Paper ID: 175 Title: Binary embeddings with structured hashed projections Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers memory-efficient nonlinear embeddings that involve performing a linear projection followed by a nonlinear hashing step. The linear projections considered are memory efficient in that they only require O(n) Gaussian random values, rather than O(kn) as in Johnson-Lindenstrauss. The paper provides guarantees showing that angles are approximately preserved whp. After describing the method and theoretical guarantees, the paper then presents experimental results on MNIST data in which data is first pre-processed in this way before either (a) running a deep learning network or (b) running a nearest-neighbor classifier. Clarity - Justification: The English is rough. It could use a reading by a native English speaker. Also, it is hard to tell exactly what is going on until page 3. It would additionally be helpful to make some of the definitions more precise, and some of the statements in the proofs more precise too. Significance - Justification: This paper has a number of positives. First of all, arguing a Johnson-Lindenstrauss type result for new kinds of projections is challenging and could have a number of applications. The mapping given here is non-linear: a linear projection followed by a non-linear hashing step, and so such results are even more challenging. On the negative side, (a) it is not quite clear what the projection buys you, (b) the writing is difficult to follow, and (c) the experiments seem to mostly show that the more you compress, the worse you do: is that the take-home message? Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Some additional comments in addition to the above: Definition 3.1: Do you mean any number of left/right shifts? exactly one left/right shift? Do you allow zero shifts? Remark 3.1: why do you say "special type"? It seems to me they are the same. Can you elaborate? Remark 3.2: Do I understand correctly from this remark that the definition of circulant Gaussian requires all left-shifts or all right-shifts? Overall: the results and direction seem interesting. However, the writing is difficult to follow. Also, it wasn't clear what the take-home message from the experiments was, other than that performance gets worse as you compress to lower dimensions. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose to do binary embeddings (or hashing) by thresholding the results obtained by applying ‘structured’ random projection to input data. They analyze such embeddings and show that it can preserve angles between data, thus extending Johnson-Lindenstrauss (JL)-type results for structured random matrices. They also apply such a transformation as the first layer in a multi-layer neural network and show little degradation in performance while improving the time complexity. Clarity - Justification: The paper is written relatively clearly. Significance - Justification: The authors address an important problem of analyzing properties of random projections with limited budget of randomness. The authors use interesting proof techniques as well. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): * I would like to disclose to the authors that I have reviewed a previous version of this manuscript * I would like to thank the authors to address many of the concerns I raised before. As mentioned before, I like the overall idea of using and analyzing structured random matrices in settings where a fixed budget of randomness is necessary. The paper is well structured and would be very instructive to the readers who are interested in the topic. I have a few suggestions below that I believe the authors should discuss in their text. The key result, Corollary 4.1 is now presented more clearly. But, as indicated before, there are certain odd aspects of the bound. Specifically, the second term in the multiplier in the lower bound (ie line 603, the A term in “>= [1-A-B] . [1-C]” ) cannot be controlled regardless of the projection dimension. The authors should discuss this issue in the text and perhaps explain why it occurs from their analysis. A stark difference between the JL lemma and this work is the inclusion of non-linearity as a final step. Since the authors present this work as an extension to the JL lemma, I think it would be beneficial to discuss exactly how does the non-linearity help (or hinder) the analysis. It would be nice to discuss whether such non-linearities are necessary (or makes it easier) when analyzing concentration of angles in general Toeplitz matrices. Minor typos: - running Remark 3.3 (the subsequent sentences seem to be merged into the remark) - line 534 ‘project then onto’ … should be ‘them’ ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors discuss sign random projections for dimensionality and storage reduction. The main novelty of the paper is that the random projections are not based on standard i.i.d. Gaussian matrices, but rather on structured random matrices (circulant, Toeplitz) which lead to further reduction in storage and computing time (fast matrix multiplications). The authors show a Johnson-Lindenstrauss-type result for the proposed scheme and empirically evaluate it on MNIST using a neural network and a one-nearest neighbor classifier on the reduced data. Clarity - Justification: The paper is well-written and structured, figures and tables are well organized. Significance - Justification: The paper clearly builds on well-known ideas such as sign random projections and structured matrices which have been proposed e.g.~for compressed sensing as is also stated by the authors themselves. The main theoretical contribution of the paper is a unified result covering a wide range of structured random matrices. In summary: slightly above average. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Maybe the authors can comment more on the rate k^{-1/3} in the theoretical results -- this is slower than expected. Also, the variance in Fig. 5 does not decay linearly with k. Is this a consequence of using structured random matrices in place of standard Gaussian matrices? =====