ICML 2021 CountSketches, Feature Hashing and the Median of Three Spotlight

Spotlight

CountSketches, Feature Hashing and the Median of Three

Kasper Green Larsen · Rasmus Pagh · Jakub Tětek

[ Abstract ] [ Visit Unsupervised Learning 2 ] [ Paper ]

[ Paper ]

Abstract: In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector

v

$v$ to a vector of dimension

(2 t - 1) s

$(2t-1) s$ , where

t, s > 0

$t, s > 0$ are integer parameters. It is known that a CountSketch allows estimating coordinates of

v

$v$ with variance bounded by

‖ v ‖_{2}^{2} / s

$\|v\|_2^2/s$ . For

t > 1

$t > 1$ , the estimator takes the median of

2 t - 1

$2t-1$ independent estimates, and the probability that the estimate is off by more than

2 ‖ v ‖_{2} / \sqrt{s}

$2 \|v\|_2/\sqrt{s}$ is exponentially small in

t

$t$ . This suggests choosing

t

$t$ to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant

t

$t$ . Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of CountSketch, showing an improvement in variance to

O (min {‖ v ‖_{1}^{2} / s^{2}, ‖ v ‖_{2}^{2} / s})

$O(\min\{\|v\|_1^2/s^2,\|v\|_2^2/s\})$ when

t > 1

$t > 1$ . That is, the variance decreases proportionally to

s^{- 2}

$s^{-2}$ , asymptotically for large enough

s

$s$ .

Chat is not available.