Talk
Optimal Densification for Fast and Accurate Minwise Hashing
Anshumali Shrivastava

Wed Aug 9th 03:30 -- 03:48 PM @ C4.4

Minwise hashing is a fundamental and one of the most successful hashing algorithm in the literature. Recent advances based on the idea of densification~\cite{Proc:OneHashLSHICML14,Proc:ShrivastavaUAI14} have shown that it is possible to compute $k$ minwise hashes, of a vector with $d$ nonzeros, in mere $(d + k)$ computations, a significant improvement over the classical $O(dk)$. These advances have led to an algorithmic improvement in the query complexity of traditional indexing algorithms based on minwise hashing. Unfortunately, the variance of the current densification techniques is unnecessarily high, which leads to significantly poor accuracy compared to vanilla minwise hashing, especially when the data is sparse. In this paper, we provide a novel densification scheme which relies on carefully tailored 2-universal hashes. We show that the proposed scheme is variance-optimal, and without losing the runtime efficiency, it is significantly more accurate than existing densification techniques. As a result, we obtain a significantly efficient hashing scheme which has the same variance and collision probability as minwise hashing. Experimental evaluations on real sparse and high-dimensional datasets validate our claims. We believe that given the significant advantages, our method will replace minwise hashing implementations in practice.

#### Author Information

##### Anshumali Shrivastava (Rice University)

Anshumali Shrivastava is an assistant professor in the computer science department at Rice University. His broad research interests include randomized algorithms for large-scale machine learning. In 2018, Science news named him one of the Top-10 scientists under 40 to watch. He is a recipient of National Science Foundation CAREER Award, a Young Investigator Award from Air Force Office of Scientific Research, and machine learning research award from Amazon. His research on hashing inner products has won Best Paper Award at NIPS 2014 while his work on representing graphs got the Best Paper Award at IEEE/ACM ASONAM 2014. Anshumali finished his Ph.D. in 2015 from Cornell University.