Paper ID: 1167 Title: Learning and Inference via Maximum Inner Product Search Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper applies MIPS to sample from log-linear models. Clarity - Justification: I would like to indicate a mixed opinion here. The background and setup are presented clearly (above average) and the evaluation is less clear (below average). The title should probably be revised to indicate the focus on log-linear models. This paper includes a large amount of clearly presented background material, which this reviewer found useful. Some readers might want more guidance about where your contributions will appear (why am I being led through all this background, and when will it end?) E.g., you could indicate that you will use Theorem 2.1 later (this is not obvious upfront). The evaluation is somewhat misleading. The added preprocessing time for the HSKM variant is a concern -- the presented example requires 10 min of preprocessing, so HSKM MIPS only starts to beat the exact gradient method when run for more than about 150 iterations. A plot of total time (including preprocessing) versus iteration number might have captured this in a more informative way. How many iterations did you need to achieve the presented estimation results? Timing measurements for the exact MIPS method are conspicuously missing, and would provide a useful comparison. Significance - Justification: The method presented is interesting and should be relevant to members of the ICML community. It presents a construction for obtaining samples from log-linear distributions that connects to many known techniques and tools (like LSH and Gumbel variables). The empirical results are underwhelming (e.g., 6x-7x speedup per gradient iteration). Some of the results (e.g., Table 4) are difficult to interpret, since they don't compare to other methods. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Some other comments, etc.: - "Note that we can vary the distance between $S_1$ and $S_2$ to trade off accuracy and runtime." -- This would have been a very nice thing to evaluate. - What do the error bars in your figures represent? - You use red and blue in your figures - adding a pattern would help (especially when printing in black and white). - The red and green colors in the supplement aren't good for the colorblind -- might as well use another pairing, such as red and blue. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper applies results and ideas from the Maximum Inner Product Search (MIPS) problem literature to the problem of sampling discrete random variables. In particular it considers the problem of a sequence sample queries to related discrete distributions. The solution is to frame the task as a randomly perturbed MIPS (via the Gumbel-Max trick) and amortize the cost of queries after paying a fixed upfront cost. Clarity - Justification: I found the background section on log-linear models a bit long and detailed. It's a matter of preference, but I think the authors could compress that significantly. The notation x_i in (12) and (16) was not clearly defined and potentially confusing. I suggest defining a random function G(x) where G(x) ~ Gumbel IID for each x in X. This makes the operation argmax_x theta \dot phi(x) + G(x) very clear. The MIPS formulation Sec 3.1 is the most important section of the paper, and it left me with some significant questions. I suggest expanding that section, making sure that it is as detailed and careful as the section on background. In particular, some graphics might help (could show a graphic representation of the components in the MIPS formulation, the array of Gumbels G_ij, the queries q_j, etc.). During my first read through I significantly misread this section. In particular, the preprocessing cost for t queries appeared to be O(Nkt), so claiming a runtime cost of O(f(N) t) seemed a bit disingenuous. I understand now that the most important variable here is the number of times we make t queries. I was considering the growth in terms of t and N. I suggest this is clarified. Significance - Justification: The goal of amortizing the cost of inference over related queries is an important goal, and the authors make that clear. A large outstanding question is, how correlated at the samples as a function of k? This seems to be a major point and appealing to the correlations in MCMC was not satisfying. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): line 158: "(see equation 2)" not that necessary line 318: possibly clarify that you are scaling by max_v ||v|| ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper contributes an idea to reduce the problem of computing a partition function in log-linear models to a maximum inner product search (MIPS) problem. A log-linear model has a partition function (normalizer) whose evaluation requires summing over all possible discrete states. The idea is to turn the sum into a MIPS problem by paying an upfront cost of preprocessing all possible state vectors. Evaluating a normalizer of a given model parameter can be done efficiently with any chosen MIPS algorithm. The paper also gives theoretical guarantees bounding the normalizer and probability under the model obtained from an exact MIPS algorithm to that from an approximate MIPS algorithm. Clarity - Justification: The paper is mostly well written and self-contained. It gives a concise background material on maximum inner product search (MIPS) and log linear models. The introduction and background sections are themselves a good quick review of this subfield. Nonetheless, the clarity can still be further improved by mentioning the end goals of the paper in the background materials. That is, it would be nice to remind the reader along the way in the background section for why a particular topic is introduced. For instance, sections on "model averaging", "Gumbel variables", "Maximum of Gumbels" pretty much appear out of context. Also, the known result given in Theorem 2.1 needs some brief explanation as this is the basis for the two main results. The current version simply states the result in its technical form. The authors may consider removing or shortening sec 2.3.1 "conversion of MIPS to MCSS" since this does not seem to affect the main contribution. The claims in the abstract and first half of the introduction are slightly too broad. In particular, it seems to suggest that the proposed framework is for any "probabilistic inference query" and a "large class of probabilistic models". Significance - Justification: Although the paper itself does not offer a complete solution to evaluating a normalizer (i.e, only converting one problem to another), the idea of turning the evaluation of a normalizer into a MIPS problem can open up new research directions in using existing MIPS techniques to compute (or approximate) the normalizer. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): V is the set of N vectors over which maximum inner product is searched given a query vector q. In the context of a log-linear model, V will be a set of all possible realizations from the model. That is, if there are m discrete variables, then |V| is exponential in m. The important fact is completely hidden by the choice of notation i.e., using max_i instead of clearly writing max_{x \in V}. I recommend the edit to make this clear. From the previous observation, is it correct that essentially the problem of summing over an exponentially many terms is converted into taking a max over an exponentially many items (if exact MIPS is used)? The paper should mention this important fact as well. Of course, the max operation will be more efficient if an approximate MIPS is used. Still, the two preprocessing steps: stacking Gumbel draws to items in V (as in Algorithm 1), and preprocessing by the chosen MIPS algorithm, can be prohibitively expensive. If the number of variables m is small, one can simply compute the normalizer by a brute-force sum. So it is sensible to assume that the proposed idea is for when m is large. Since N = |V| = exponential of m, and preprocessing is typically polynomial in N, I wonder how well a MIPS algorithm can perform in a large (m) log-linear model. Theorem 3.3 bounds Z computed with an approximate MIPS with values from exact MIPS. I believe the current proof of the theorem is not valid. H is a random variable. There is a confusion between e^{-H} and its expectation E[e^{-H}]. That is, Z^{-1} = E[e^{-H}] but not Z^{-1} = e^{-H} as stated. Please correct me if I am wrong. The experiments focus on the accuracy of probability given by the model. Besides the accuracy, it is equally important to consider the runtime when the number of theta's (as in the model average context) increases. I imagine a plot of runtime (or average query time as stated at line 292) vs. the number of theta's. The paper should verify the claim that paying the upfront cost of constructing a data structure for fixed set V is worth it. questions/comments: * In theorem 3.1, the statement "... is an exponential variable with rate Z_theta." is only true if max_i is replaced with max_{x \in X} where X contains all possible realizations from the model. This should be made clear. * Does theorem 3.4 take into account the fact that the algorithm uses Gumbel samples and not the Gumbel random variables themselves? If not, this should be stated as an assumption since in practice the samples are drawn and fixed after the preprocessing step. * Theorem 3.2 about runtime and its proof are quite straightforward. Please consider shortening it and expanding Theorem 2.1. Also in Theorem 2.1, P_{h \in \mathcal{H}} is not clear. * Line 492: "Estimates for theta will have the correct marginal but will not be independent". Why not independent? minor: * G in (10) should be explicitly defined. I assume it is Gumbel(0, 1). * In Algorithm 3, consider writing Z^{-1}_\theta instead of "Estimate" Review summary: The paper proposes a potential idea of seeing a normalizer evaluation problem as MIPS. However, verification of the idea is not sufficient. ===== Review #4 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The papers presents a novel approach to efficient estimation of partition function of log-linear models using Gumbel trick. The proposed method performs a preprocessing over the set of sufficient statistics in order to sample from this model or estimates its partition function when queried with a particular set of canonical parameters. Experimental results demonstrate the effectiveness of the proposed method for parameter estimation and model averaging in log-linear models. Clarity - Justification: The paper is for the most part very clearly written. For machine learning audience one may argue that the introductory material on the log-linear model is more than necessary compared to ideas and techniques for MIPS, that could be more extensively reviewed. -- A part of introduction that needs clarification is "This observation has been used before in the so called persistent chains..." Where the relation to this paper is not clear from the text. -- f(.) is used for both query time and reduction of MCSS. Significance - Justification: The paper approaches a very important problem using (to my knowledge) a novel idea. Its significance could be improved by considering a more general class of log-linear models within exponential family graphical models (see detailed comments). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper does a nice job of introducing the problem and its solution. However the experimental results could be improved. A discussion that could boost the significance of this method and is currently missing in the paper is its applicability to log-linear models with sufficient statistics (\phi_a(x_a)) that have overlapping domains. The paper currently assumes that the subset of variables for each feature in the log-linear model are distinct. In the more interesting case, where the proposed method can be used for learning exponential family graphical models, the commonly used low-order Gumbel trick only gives an approximation and it would be interesting to see the efficiency and quality of results produced using a combination of approximate MIPS and low-order Gumbel trick. On a different note, how does one choose t and k? “t” seems to be relatively large compared to “k” in experiments. The paper does not comment on this choice. Is there any gain in efficiency if one uses exact MIPS for learning? =====