Paper ID: 1120 Title: Recycling Randomness with Structure for Sublinear time Kernel Expansions Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes constructing Fastfood-style approximate kernel embeddings using more general classes of structured random matrices than the Hadamard matrices employed there: that of low displacement rank matrices. It provides bounds on the performance of doing so, as well as empirical evaluation. Clarity - Justification: The initial development of the approach is clear. Some discussion on why densifying the inputs with D_1 H D_0 is desirable might be useful, though it is discussed in the Fastfood paper. What is the function s in the last step of the algorithm in Section 3.3? Does it correspond to the S matrix of Fastfood? Probably the x in the second step of Section 3.3 should be x'. Significance - Justification: Generalizing the Fastfood approach seems certainly worthwhile, and the empirical advantages, though not enormous, seem reasonable on most of the examined datasets. The theoretical results, though, are somewhat limiting (as discussed below). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The main theorem is somewhat unsatisfying, in that although the proposed methods seem to mostly outperform Fastfood, the bounds appear to be substantially worse for the proposed approaches than for Fastfood (since p_struct = 0 in that case, and the other terms are identical or, in the case of p_wrong, nearly identical). It is also somewhat unclear exactly how the bound scales with some of the variables, since there are two interdependent free parameters T and \varepsilon. It might be informative to numerically plot the bound optimized over T and \varepsilon for a few values of n, m, and d, since the bound otherwise depends only on the approximation scheme. It would also be preferable to note that p_struct depends on the chromatic number only through \sum_i \chi(i, i) and \sum_{i \le j} \chi(i, j), as well as its maximum \chi[P], by moving the summations in (10) after the exponential terms to clarify that nothing in those depends on i or j. Figure 2 stops at a fairly small number of random features. It might be better to switch to a log scale for the reconstruction error and continue the plots up to perhaps 2048 or so, since people typically use more features than that. If maintaining the linear scale, it would probably be clearer to make the y axis start at 0 rather than .015. How does Fastfood-DCT compare in terms of results (Table 1/Figure 2)? The Matlab fwht implementation, as you remarked, is not very optimized; a C implementation, especially a highly tuned one like that from the Spiral project, gives better performance. (I did not thoroughly review the proofs in the supplementary material.) ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a family of approximations for the product of a random gaussian matrix and a given vector. The displacement rank, a tuning parameter, trades off computation and quality of approximation. The paper shows both theoretically and empirically that the proposed construction approximates the effect of multiplication via random Gaussian matrices better than existing alternatives like the Fastfood construction. Clarity - Justification: The paper is well written and Figure 1 in particular is extremely useful in understanding the construction. However, I wonder if some generality could be sacrificed for more clarity. Is the P-model a standard construction is the literature or is it something that the authors came up with so that Fastfood and their proposal can be treated in the same framework? Significance - Justification: The paper provides a construction with a tuning knob. Previously, one could either choose the slow exact algorithm, or fast but less accurate alternatives such as Fastfood. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In Table 1 it might be better to have the results as differences from the classification error of the exact kernel. Another possibility for comparison is the heuristic in the "Practical and Optimal LSH for Angular Distance" where multiplication by Gaussian is replaced with H D_1 H D_2 H D_3 (H hadamard D_i iid diagonal rademacher) The lines in Figures 2 and 3 cannot be distinguished in a black and white print and the blue lines cannot be distinguished in a color print. Line 174: stablishes -> establishes Line 186: $i$ is previously used as the imaginary unit. Please reconcile the notation. Line 269: Golub & Loan -> Golub & Van Loan Line 358: recycle The Gaussian -> recycle the Gaussian ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes acceleration techniques for kernel machines following the previous approaches based on random embeddings. It includes Fastfood construction as a special case and extends to other matrices such as circulant, Toeplitz, and Hankel and the broader family of structured matrices with low displacement rank. The improvement comes from replacing the Gaussian matrices with structured matrices which recycle the sampled bases. The authors prove that the resulting method is unbiased with a slight increase in variance. The key observation is Theorem 2.2 which can cover all of the different linear combinations of Toeplitz-like matrices for different values of $r$. Because these matrices are Topelitz-like, FFT-like method can be used to accelerate matrix-vector multiplication (useful in feature generation). Section 3 introduces the P-model which is a random matrix functional defined on a sequence of $m$ matrices (each of these matrices has a column of unit norm) with a trade-off parameter for uncertainty. The trick is to just use one randomly sampled Gaussian vector of dimensionality $t$ (this is the trade-off parameter for uncertainty). The coherence measure is defined on the P-model which is a measure of pairwise correlation of the columns of the $P$ matrices that define the sequence. The corresponding coherence graph can be defined in which each vertex corresponds to the pairs of columns whose dot product is non-zero and each edge is between a pair of vertices in which the corresponding 2-element subsets intersect. Similar to the coherence measure, a smaller value of chromatic number is better. The sequence of matrices comprising the P model can be either deterministic or random. Section 3.2 lists examples of these and demonstrates that the Fastfood construction is a special case. The authors demonstrate that the P model can be used to generate more efficient random feature maps. Clarity - Justification: I did not check all of the proofs carefully but I believe the paper offers interesting way of putting together previous techniques and also a unifying view of previous random feature based methods. I do have a minor complaint in the main section: why do "Circulant" and "Skew-circulant" use upper-case letters? My major complaints are in the experimental section. - Figure 2 and Figure 3 are extremely crowded. - The number of datasets in consideration is three. Significance - Justification: An interesting approach for unifying a random-feature based kernel machine using a graph-theoretic approach. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I do recommend augmenting experimental results to demonstrate the efficiency of the proposed method. =====