Paper ID: 806 Title: Pixel Recurrent Neural Networks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a family of generative models that directly model the statistical dependencies between pixels. These models include two PixelRNNs, the Row LSTM and the Diagonal BiLSTM that differ mainly in terms of the field of conditioning information they use to make their predictions; the PixelCNN; and the Multi-scale PixelRNN. Some effort is made to render these methods more computationally efficient (that is to say more parallel) than would be a naive implementation. Clarity - Justification: There is a lot going on in this paper. Four models with many details that are shared (such as discrete output) and sometimes not shared (such as residual connections). I'm inclined to think that too much is going on. I like the way the authors break down the various novel aspects of their approaches into modules, such as Sec. 2.2 "Pixels as Discrete Variables" and Sec. 3.4 "Masked Convolution". But some of these sections are just too condensed to get a good idea of how the model is working. For example, the pixelCNN description is very sparse (one paragraph) and while I think I understand what is going on, I think it would be unrealistic to imagine someone could reproduce the authors results based on this description. Both the PixelCNN and the Multi-scale PixelRNN would hugely benefit from an illustration. Significance - Justification: Overall, this is an impressive paper that is worth publishing, but it does have some pretty frustrating flaws. In addition to the clarity issues outlined above, the experiments are incomplete in the sense that most models are only explored on a partial set of datasets. The choice of which models are explored on which datasets also appears somewhat arbitrary. It just seems like the authors took on a bit too much and could not deliver a complete and coherent scientific examination of the subject in time and in the space constraints provided by the ICML format. The model achieves SOTA on binarized MNIST, an impressive result, but the authors should also report the performance of the PixelCNN and the Row LSTM. CIFAR performance is good and, this time, the model comparisons are there. For ImageNet, the authors seem to only report results for Row LSTM. How do the PixelCNN and Diagonal BiLSTM perform on this task? This set of missing results is strange. Regarding novelty, the models presented here have close relatives in the literature, and are discussed by the authors. So the novelty is not extraordinarily high, but their The authors efforts toward rendering the models more parallel and therefore more efficient for GPU execution is interesting and, I think, does constitute a significant contribution. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I've included fairly detailed comments above. Random comments: - I really appreciate Figure 5 showing the distributions over the discrete colour values. This is really an interesting result and well worth disseminating. I wish there ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces a new architecture for modeling natural images. It is a network that models the conditional probabilities of a pixel given the preceding pixels in some predefined order (left-to-right and top-to-bottom was used in the paper). The parameters are shared across the different probability distributions by employing recurrent neural networks that summarize the context seen so far to predict the next pixel to be modeled. In addition the allowed context is restricted somewhat, and some parameters of computation are shared with the use of convolutions in the computations of RNN state updates, thus making the models more parallelizable. The approach is validated in experiments on the MNIST, CIFAR-10, and resized ImageNet datasets. The models achieve state of the art log likelihood scores on MNIST and CIFAR-10, and compelling generative performance on the resized ImageNet dataset (there are no previous published log-likelihood scores on this dataset). The samples from these generative models look crisp, diverse, and obviously capture some statistics of natural images. Clarity - Justification: The exposition is very clear. Most details about the experiments are provided, but not all the details (e.g. the procedure of choosing learning rates is vague). Source code is not provided. Significance - Justification: The paper establishes a new state of the art in generation of low-resolution images, while building on previous work on using RNNs in a scalable way to model images. The suggested architectural choices add to our understanding of how to train very deep, but tractable generative models. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The reviewer suggests to accept the paper based on very strong experimental results, incremental improvements in model architecture (applied for the first time to modeling 2D images), and establishing a new state of the art in generative models of images. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a deep recurrent architecture to model natural images. It introduces a series of variants, including diagonal bidirectional recurrences, residual connections and discrete output encodings, resulting in state-of-the-art likelihood results on several image datasets. Clarity - Justification: The paper is well written, with the appropriate level of rigor, and the figures and diagrams make the model easy to understand. Significance - Justification: The paper introduces a number of architectural variants. Whereas when taken individually they are relatively minor, the combined model is a significant innovation over current generative models, offering certain advantages (but also drawbacks) over alternative models. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Some minor comments: - Comparison with other models: The results from Table 4 and Table 5 are very compelling. However, I am wondering what is the best way to compare a model that only produces discrete values with baselines that by construction have densities in the real numbers. Are the methods from both Tables 4 and 5 "discretized"? My guess is that one should quantize the conditional distributions of continuous models to the set [1,255] before comparing them with the proposed solution. - Perhaps it would be good to discuss what are the shortcomings of the proposed model. For example, It seems that the model is not "flip" invariant, in the sense that p(x) may be very different than p(flip(x) ). Is that a problem? If so, how could one improve the model to become invariant to that sort of transformation? The multiscale archiecture is a possible step to mitigate that, but perhaps that is not enough. -A related question: I can see the advantages of having a sequentially conditional model, because it makes inference easy and fast. Suppose I say that I don't care about having expensive inference. How could one expand the model to use dependencies along all directions? =====