Skip to yearly menu bar Skip to main content


Session

Dimensionality Reduction 2

Abstract:
Chat is not available.

Thu 12 July 8:00 - 8:20 PDT

Representation Tradeoffs for Hyperbolic Embeddings

Frederic Sala · Christopher De Sa · Albert Gu · Christopher Re

Hyperbolic embeddings offer excellent quality with few dimensions when embedding hierarchical data structures. We give a combinatorial construction that embeds trees into hyperbolic space with arbitrarily low distortion without optimization. On WordNet, this algorithm obtains a mean-average-precision of 0.989 with only two dimensions, outperforming existing work by 0.11 points. We provide bounds characterizing the precision-dimensionality tradeoff inherent in any hyperbolic embedding. To embed general metric spaces, we propose a hyperbolic generalization of multidimensional scaling (h-MDS). We show how to perform exact recovery of hyperbolic points from distances, provide a perturbation analysis, and give a recovery result that enables us to reduce dimensionality. Finally, we extract lessons from the algorithms and theory above to design a scalable PyTorch-based implementation that can handle incomplete information.

Thu 12 July 8:20 - 8:40 PDT

Massively Parallel Algorithms and Hardness for Single-Linkage Clustering under $\ell_p$ Distances

Grigory Yaroslavtsev · Adithya Vadapalli

We present first massively parallel (MPC) algorithms and hardness of approximation results for computing Single-Linkage Clustering of n input d-dimensional vectors under Hamming, $\ell_1, \ell_2$ and $\ell_\infty$ distances. All our algorithms run in O(log n) rounds of MPC for any fixed d and achieve (1+\epsilon)-approximation for all distances (except Hamming for which we show an exact algorithm).We also show constant-factor inapproximability results for o(\log n)-round algorithms under standard MPC hardness assumptions (for sufficiently large dimension depending on the distance used). Efficiency of implementation of our algorithms in Apache Spark is demonstrated through experiments on the largest available vector datasets from the UCI machine learning repository exhibiting speedups of several orders of magnitude.

Thu 12 July 8:40 - 8:50 PDT

Local Density Estimation in High Dimensions

Xian Wu · Moses Charikar · Vishnu Natchu

An important question that arises in the study of high dimensional vector representations learned from data is: given a set D of vectors and a query q, estimate the number of points within a specified distance threshold of q. Our algorithm uses locality sensitive hashing to preprocess the data to accurately and efficiently estimate the answers to such questions via an unbiased estimator that uses importance sampling. A key innovation is the ability to maintain a small number of hash tables via preprocessing data structures and algorithms that sample from multiple buckets in each hash table. We give bounds on the space requirements and query complexity of our scheme, and demonstrate the effectiveness of our algorithm by experiments on a standard word embedding dataset.

Thu 12 July 8:50 - 9:00 PDT

Improving Sign Random Projections With Additional Information

Keegan Kang · Wei Pin Wong

Sign random projections (SRP) is a technique which allows the user to quickly estimate the angular similarity and inner products between data. We propose using additional information to improve these estimates which is easy to implement and cost efficient. We prove that the variance of our estimator is lower than the variance of SRP. Our proposed method can also be used together with other modifications of SRP, such as Super-Bit LSH (SBLSH). We demonstrate the effectiveness of our method on the MNIST test dataset and the Gisette dataset. We discuss how our proposed method can be extended to random projections or even other hashing algorithms.