Paper ID: 66 Title: The Information Sieve Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents an interesting approach to hierarchically extracting latent structure from data without supervision. Unsupervised learning is an important problem that is far from being solved, and the approach presented brings in fresh, simple ideas. I think the Information Sieve is an important contribution. Unfortunately, in its current form the paper does not live up to its potential, largely due to incomplete presentation. Clarity - Justification: The paper is well written overall. However, the algorithm is presented badly, see comments. Significance - Justification: see above Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Comments: - The algorithm is not properly presented. Sec 3.1 partially describes how to optimise TC. The reader is referred to AppA for details. AppA provides a bit more information and then refers the reader to previously published work. Finally, the description in the paper “Maximally Informative …” is also hard to parse (for the same reasons as AppC is hard to parse). The procedure for constructing the remainder information in AppC is not well presented. In particular, the description of the procedure is intertwined with explanations of why various steps work and how they can be modified. It is frustrating that the main algorithm is obscured, especially in a submission to what is primarily a Computer Science conference. - What is the runtime of the algorithm? How does it scale? Empirically and big-O. It is mentioned that the method is linear in the number of variables in Sec4, but this needs to be backed up and also related to cardinality of variables, size of dataset and other parameters. - Estimating mutual information, entropy and independence of random variables is delicate. How much data is required to construct reasonable estimates of the entropies and mutual informations required by the algorithm? - There are important tradeoffs buried in how the optimisation procedure handles the cardinality of the constructed variables in "Controlling the cardinality of \bar{x}_i”, AppC. These tradeoffs and choices need to be better explained. Further, to circumvent a nonconvex optimisation the authors refer to a “route we used […] was to first construct a perfect solution using the procedure above”. I don’t know which procedure they are referencing. - Sec4 on Discrete ICA, is an intriguing idea. However the execution falls short: the section sketches an intuitive relationship with ICA (ICA aims for independent variables; IS aims for variables that increase independence) and presents a single hand-constructed example. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a framework for iteratively identifying factors of multivariate mutual information, and then removing those factors from a dataset. The dataset is then coded using the identified factors, and a residual (or "remainder information"). If sucessful, this will learn a lossless representation of the data in terms of independent variables. Conceptually, the algorithm is related to ICA. This framework is quite nice, and I believe novel. Experiments were on toy problems only. Clarity - Justification: The paper was clearly written. The idea was well motivated in terms of earlier work, and clearly explained. I appreciated the simple example in Figure 2. (I think Figure 1 could be improved though, and the notion of a layer seems somehow artificial ... it's more like an iterative process.) Significance - Justification: The paper presents a novel information theoretic objective for unsupervised learning. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My biggest complaint is that the technique is sold initially as being deep and hierarchical ... and it's really not either of these things. If you squint you can maybe spin it as hierarchical, in that units are added one at a time to the representation, and the value of the later units can depend on the value of the earlier units. Even this is a stretch though. I believe if there are residual dependencies between the Y variables (eg because you convert a stochastic p(y|x) into a deterministic map, or because optimization wasn't perfect), then you are stuck with those dependencies forever. That is, I believe iterating the process removes dependencies in the remainder information, but not in the already specified Y variables. Is this correct? eq 2. -- notation nit -- I found it confusing that TC(X;Y) and TC(X|Y) meant very different things. I would give the analogue of mutual information a different label (eg maybe I_{TC}). line 229 -- Do bounds on TC(X) provide trivial bounds on H(X)? This isn't obvious to me. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors introduce a new framework for extracting hidden factors from multivariate discrete variables using a information theoretic criterion that is related to infomax and redundancy reduction. The idea is to extract latent variables such that the remaining variables become less and less dependent. Each new latent factor has to explain the remaining dependencies in the data until it eventually becomes independent. The authors demonstrate that their approach an be seen as a discrete version of ICA, demonstrate their method on lossy/lossless compression of MNIST digits as well as a toy example. Clarity - Justification: The paper is clearly written, the math well explained, previous work is sufficiently cited, and the experiments are reasonable. Significance - Justification: I think it is a very cool idea for unsupervised information theoretic extraction of latent features. I personally have never seen anything like this and I think it would make a nice contribution to the conference. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Once I wrapped my head around what was going on, I really liked the paper. I understand it as a kind of information theoretically motivated clustering, because the resulting features Y basically give rise to a mixture distribution on the X (although it is not exactly that, because the cardinality of X changes). If the authors have any links to mixture modelling it might be worth discussion since this could help readers understand it easier. In general, links to clustering might be good to discuss. I also like the fact that there is an iterative scheme that directly optimizes the total correlation without any further assumptions. The upper and lower bound in (3) as well as the resulting decompositions in Corollary 2.3 and 2.4 are interesting and add to the interpretation of the algorithms results. * Minor Comments * Notational inconsistencies: One time you use I[X:Y] the other I[X; Y] * There is a q missing in the equation in line 1221 * For the iterative scheme, an iteration index might be useful. =====