Handling data is a crucial computational challenge today, arising in a variety of domains ranging from computational geometry to data mining and from optimization to machine learning. The course focuses on search and retrieval, clustering, and sampling, in spaces of high dimension. The course surveys the best current algorithms and implementations. Every section concludes with open questions both of theoretical and practical nature.
Organization and Syllabus
The course is organized in four 4-hour lectures, from Tuesday 7/3 to Thursday 16/3, with a final project that students choose among the discussed topics – schedule.
Search in high dimensions and randomized kd-trees
We survey work in Approximate Nearest Neighbor search. Tree-based methods have good practical performance, optimal space usage, but query time complexity exponential in dimension d. We focus on kd-trees and show how their worst-case behavior can be avoided by randomization, thus leading to data-structures that can handle efficiently objects in dimension up to 1,000 or 10,000.
Locality Sensitive Hashing avoids the Curse of dimensionality
Geometric approximation algorithms have been trying to address the curse of dimensionality in similarity search. The main paradigm for obtaining complexity bounds polynomial in d has been Locality-Sensitive Hashing (LSH). We discuss algorithms and data-structures based on this approach for vector and metric spaces, and present related implementation issues.
Randomized projections and the Johnson-Lindenstrauss Lemma
Random projections to lower-dimensional spaces offer sublinear query time, linear complexity in d, and optimal space consumption, which can be important in applications. We discuss generalizations of the seminal Johnson-Lindenstrauss lemma as well as simple data structures. Experiments d up to 1,000 and n up to 1 Million show that performance is better than brute force and comparable to LSH.
Clustering in vector and metric spaces by extending k-means
We discuss clustering and data organization in high-dimensional vector or metric spaces, based on the successful k-means criterion and its variations, such as k-medoids. We survey standard tools, including methods for estimating the number of clusters, as well as techniques to address issues arising at implementation. We discuss recent advances and software that clusters 100 Million images in less than an hour.
Geometric random walks in general-dimensional polytopes
Sampling is crucial when dealing with big data, and serves as preprocessing in many algorithms. We focus on convex polytopes and volume approximation. The method of choice is to uniformly sample points by random walks, in particular Hit-and-run. We discuss the algorithmic aspects, and report on experiments when d is up to 200.
Prof. Ioannis Emiris obtained his BScEng from Princeton University in 1989, and his MSc and PhD from UC Berkeley in 1991 and 1994, respectively, all in Computer Science. He has been Tenured researcher at INRIA Sophia-Antipolis (France) since 1995, on leave since 2002, when he joined the Department of Informatics and Telecommunications at National Kapodistrian University of Athens. Since 2012 he is also Adjunct researcher at ATHENA Research & Innovation Center. His research interests include Computer Algebra, Geometric Computing, Data Science, and applications to Robotics and Bioinformatics. He is founder and director of the Lab on Geometric and Algebraic Algorithms, and has been involved in 6 European research projects, including a Sklodowska-Curie Network he currently coordinates and two FET-Open STREPs, and a number of bilateral and national research projects. He has been adviser to 11 PhD and 8 postdoctoral fellows. He is the author of one book in Computational Geometry (in Greek), co-editor of two volumes published by Springer, author of more than 100 journal articles and peer-reviewed conference publications, of which two earned the Distinguished Paper Award at ACM International Symposium on Symbolic & Algebraic Computation (ISSAC). He is Associate editor of “Journal of Symbolic Computation” (Elsevier) and “Mathematics for Computer Science” (Birkhauser), and has been guest co-editor of “Theoretical Computer Science”, “Computational Geometry: Theory & Applications”, and “J. Symbolic Computation”. He was Program Committee Chair of ISSAC 2011, and an Invited Speaker at ISSAC 2016.