Paper ID: 1188 Title: A Theory of Generative ConvNet Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The main contributions are: - This paper proposes a model which can be described as a Convolutional Energy Based Model (EBM) trained with MCMC using Langevin dynamics. - Experiments show that at least for small datasets, the model is able to generate new samples that have similar statistics as the training data. Clarity - Justification: I found the paper hard to follow mainly due to the dense notation. The model itself is just a convolutional EBM and the training procedure is MCMC with Langevin dynamics. However, a lot of space is spent in first formalizing a class-conditional EBM (then setting the number of classes to 1), defining a simplified prototype model and then formalizing multi-layer convolution operations as an extension of the prototype. Most of this seems somewhat superfluous. The experimental details seem sufficient to reproduce the results. However, adding some clarifications (listed in the detailed comments) will help better understand the results. Significance - Justification: The proposed model is an instantiation of Energy Based Models where the network that computes the energy function is convolutional. Properties of EBMs have been studied extensively before (LeCun et al., 2006, Ngiam et al., 2011). The training procedure is fairly standard and does not involve any specialization for convolutional nets. As such, the model is not particularly novel. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The main strength of this paper is that it gives a very comprehensive formalization of convolutional EBMs. Several properties of the model are highlighted. For example, - The density function defined by the model is piecewise Gaussian. - One step CD learning and learning an autoencoder are related. - Local energy optima are autoencoding, etc However these properties hold in general for all EBMs and are not particularly novel or distinctive for this particular model. For example the fact that local energy optima are autoencoding has been shown for any EBM (Swersky et al 2011). The paper highlights the "conceptual novelty" of this work, in particular its focus on revealing the "curious representational structure contained in the model" which is "unexpected for exponential family models" (lines 148:155). If the representational structure being referred to here is that of having multiple layers of activations, it seems like all deep EBMs would have this. It is not clear why this is unexpected or unique for the proposed model. The generated samples in Figure 1 and 2 look interesting. Even though the model is trained on just one image only, it is able to generate different but qualitatively similar images. However, it would be far more convincing if the model was trained on a large dataset of textures and was still able to produce samples like this. Currently, it is possible that the model memorizes low-level statistics since the training set consists of 1 image only. The same might be true for Figure 3, where the dataset size is 10 images. The experimental section can be improved by adding some clarifications. - It seems like the generated image samples were produced along with training (they are the negative particles). Negative particles get to mix a lot in the initial training stages. Is that necessary for generating samples or can the fully trained model mix the different modes using Langevin dynamics ? What do those samples look like ? - In Figure 3, for each generated image, is the closest training image shown ? or are the top and bottom pairs for each category unrelated ? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a energy-based model using a convolutional neural network. While the ideas used are based on previous ideas, the experimental results are interesting and could inspire other convolutional generative models. Clarity - Justification: In general, this paper is easy to read and easy to follow. However, the notation is a bit on the heavy side, but it is necessary for convolutional models with many indices, filters and layers. Significance - Justification: The ideas of using a energy based model where the energy or score is defined by a feedforward neural network is not new. Using a Conv Net is also not new, as noted by the authors. The use of langevin dynamics is a minor novelty, however past algorithms such as "Contrastive Backprop" have used Hamiltonian monte carlo, of which Langevin dyanmics is a special case. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper propose to learn a deep energy based model where the score or energy assigned to a particular image is defined by a convolutional neural network. As noted by the authors, many prior work have attempted this, notebly Ref1. In the past works, HMC has been used in conjunction with backpropagation to estimate the gradient of the partition function for maximum likelihood learning. The main novelty here is that this paper uses Langevin dynamics, which is a special case of HMC. A convolutional neural network is also used instead of a standard fully connected neural net. One question is what happens when the model is initialized at a random image (e.g. with white noise) ? The interesting part of hopfield nets is that one will reach an attractor state given random initialization. This model by constructions (real-valued and convolutional) seems like it will contain many local minima. The experiments show pretty interesting results. However, it is still unclear how good of a generative model you have learned. Did the langevin updates for the negative phase mix well? Can it generate images from random initializations? A common way to measure generative model performance is to look at PSNR after denoising images with random noise added. To make the model better, I would put more emphasis on learning: how to mix better and analysis what sort of representation is useful for generative modeling of images. some papers that are related are: Ref2: Conv Deep Belief Networks by Honglak Lee et al, Ref3: Conv RBMs for feature learning Norouzi et al. Ref4: Deep Learning with Hierarchical Convolutional Factor Analysis Ref1: Unsupervised Discovery of Non-Linear Structure using Contrastive Backpropagation by Hinton et al. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper extends previous recent work that considers a generative model based on the exponential family with CNN sufficient statistics. It is shown that when using ReLu nonlinearities this model can be seen as piece-wise Gaussian. The model is trained by approximate maximum likelihood using Langevin Dynamics and experiments are presented on texture and non-ergodic image generation examples. Clarity - Justification: The paper overall reads quite well, with well organized sections. However, my impression is that notation could be somewhat lightened. In particular, Section 4. The algorithmic and numerical details are good enough. Significance - Justification: The proposed model is quite powerful and enjoys a number of nice theoretical properties. The novelty is moderate. Indeed, Gibbs models with sufficient statistics given by Deep Neural Networks were proposed early on [Ngiam et al] and [Dai et al] (properly cited by the authors). In that respect, the main novelty of this paper is the training algorithm for that specific model (since contrastive divergence with Langevin dynamics is not a new algorithm either), and the analysis of the model as being piece-wise Gaussian when one considers half-rectifications. The numerical experiments are interesting (cf below), but somehow the question remains as to whether these models can be efficiently trained on a large-scale dataset. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Here are some specific comments and questions - Proposition 4 studies the "local" behavior of the Langevin dynamics, showing that when the models stays in the same linear piece its log likelihood is simply an l2-reconstruction error. But, given that there are exponentially many linear pieces, how meaningful is this local result? In other words, the essential part of the model seems to be to understand how the dynamics traverse the different linear regions. - Numerical experiments. There is a mysterious aspect, namely that the number of parameters of the model (over 100K) seems to be largely superior to the number of pixels in which it is trained (a single image, if I understood correctly). I wonder how well these ML estimates generalize. In other words, if one trains the model on one realization x_1 of an underlying process X, how well the model fits a second (independent) realization x_2 ? =====