Paper ID: 1279 Title: Provable Algorithms for Inference in Topic Models Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers the problem of inference in topic models. Specifically, assume there are k topics, where each topic i is a probability distribution A_i over D words; a document y is produced by selecting an r-sparse mixture x^* over the topics, and then making n random draws from the mixture distribution Ax over words. The goal is given the document y to recover an approximation \hat{x} to x^*. The paper considers this problem in the setting where the A_i are known (like in compressed sensing). It provides an algorithm with performance guarantees as given in Theorem 4.1, and then presents experiments on several data sets. The paper also presents lower bounds, showing that even for just the problem of determining the support of x^* when components of x^* are either 0 or close to 1/r, the bounds of Theorem 4.1 are tight up to logarithmic factors. Clarity - Justification: The paper is pretty clear overall, though some things that could be made clearer (see detailed comments). Significance - Justification: This paper (a) considers an interesting problem, (b) provides a nice algorithm for this problem for which it (c) gives novel performance guarantees in terms of natural properties of the input, i.e., \lambda_\delta(A), (d) gives lower bounds showing that the upper bounds are reasonably tight, and (e) presents interesting and fairly large-scale experiments on real data. Overall it seems like an excellent paper for ICML. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall I think the paper is very nice and would be an excellent addition to the conference, as mentioned above. One somewhat peculiar thing is that usually inference is an easier problem than learning, so I was wondering if the paper could say a bit more about the range of parameters under which more typical or standard methods would succeed or fail. For instance: - It seems to me that if k << n, then there should be a simpler algorithm. For example, I imagine you could take the D words, randomly group them into d buckets for k < d << n, and then by viewing each bucket as a "word" you get an instance where now the empirical distribution of "words" in the document is a reasonably good approximation of the grouping of Ax. Does this sound right, and if so, is the key issue that you are interested in the setting where n is much larger than r but n is not much larger than k? It would be good to have up front a discussion of how one should think of the parameters of interest. - A naive algorithm would be to compute the (log) likelihood score separately for each topic i (i.e., you're imagining the document was generated by just one topic, chosen uniformly at random from the k topics) and then to select the top r most likely topics in this ordering. Can you give a simple example satisfying the conditions of Theorem 4.1 where this would perform poorly? - On page 1, "y" is introduced before it is defined. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper consider the inference in topics models. It is a assumed that there is: 1) A dictionary of side D, 2) k topics, each of which is represented by a distribution on the words 3) An "ideal" document x^* that is a distribution on the topics To generate an actual document of length n from x^*, one take n i.i.d. samples. Each sample is obtained by sampling a topic t from x^* and then sampling a word from t. The goal is to (approximately) recover the ideal document given the actual one. The authors show that a vector x of L_1 distance at most eps can be found w.h.p. if n > gamma * r^2 * log(k) * (1/eps)^2 Here, r is the sparsity of x^*, k is the number of topics, and gamma is a certain parameter that depends on the topics, which is empirically shown to be a small constant on actual corpuses. Clarity - Justification: Paper is well written Significance - Justification: The paper takes an interesting and natural problem and provides a clean answer Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Topic models are very popular models for modeling text. In particular, inference in these models is of interest. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper is concerned with the inference problem in topic models. Here we assume there is a fixed known "topic" matrix A in R^{D x k}, in which each column ("topic") is a probability distribution on the D words. One thinks of D >> k; e.g., k = 100..500, D = 10000..50000. Now we assume that an n-word "document" is generated randomly using an unknown probability distribution x supported on some very small number r topics (say, r = 3..5). One also thinks of n as quite small; e.g., 200..1000. In other words, x in R^k is an r-sparse probability vector, we form z = Ax, and finally we produce n samples from [D] according to z. Call the resulting vector of words y. The task is to estimate x (in ell_1 norm), given y. The authors' idea is the following. Try to find an (approximate) left-inverse B for A, all of whose entries are small in magnitude. Then consider the estimate x* = By. (This is at least an (approximate) unbiased estimator.) Then just set to 0 all coordinates of x* smaller than a certain threshold. The authors show the following: Suppose A has a small "l_infty/l_1 condition number". Then there exists a good left-inverse B (that can be found in polynomial time by linear programming) such that (whp) the recovery algorithm produces an eps-accurate estimate, ||x* - x||_1 < eps, provided n is roughly 1/eps^2. (This ignores small factors.) The authors then show that for a few A's arising in practice (learned topic models for 3 corpora: NYT, Enron, NIPS), the condition number is pretty small. The authors' experiments on synthetic data generated in the 3 models show the following: . Finding B from A can be somewhat slow (on the order of hours). OTOH, it only has to be done once. . Given B, estimating x is extremely fast (it's basically one matrix/vector multiply); maybe a couple orders of magnitude faster than "Gibbs sampling". . OTOH, Gibbs sampling gives error that's about 5 times smaller . The algorithm doesn't work well on non-synthetic data. Clarity - Justification: The paper is pretty well-written. Significance - Justification: I thought the findings were somewhat significant, if preliminary. The algorithm does have a provable guarantee, under an assumption about A, which apparently is reasonable in at least some real-world cases. I would not say either of the algorithms were at all innovative, but perhaps one is not necessarily looking for innovative algorithms here, just good results. And the results seem... okay. The slow linear-programming time is a bit disturbing, as you have to do it separately for each model, but then I guess the results are computed lightning-fast. I guess one downside is that there is nothing you can do if n (length of the documents) is not large enough. The authors don't let you trade better results for slower running time; if their algorithm isn't working well on the document lengths one has, there is no alternative. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): . =====