We thank 3 reviewers for their precious time and efforts. We first address common issues, then reply to each reviewer.$
0.1 A special EBM
The main purpose of our paper is to highlight the distinct properties of generative ConvNet by the 4 Propositions. These properties are essentially unique among EBMs (Energy-Based Models). They are the results of the happy marriage between the piecewise linear structure of the ReLU ConvNet and the Gaussian white noise reference distribution in the exponential tilting formulation.
Without Gaussian reference distribution that contributes the |I|^2/2 term (whose derivative is I), we won’t have the auto-encoding local modes of the form I=a(I;w). Without Gaussian reference, the distribution is not even integrable.
Without piecewise linearity, we won’t have piecewise Gaussian.
Without ReLU, we won’t have binary activation variables, because max(0,r)=1(r>0)*r.
The piecewise linearity is crucial for the exact equivalence between CD gradient and auto-encoder reconstruction gradient, because it makes the tedious curvature term in score matching disappears.
The auto-encoder we elucidated is a novel harmonious fusion of bottom-up convolution and top-down deconvolution.
0.2 Novelty
Novelty can be in generalization, but it can also be in specialization. Although the generative ConvNet is an EBM, it is an extremely special EBM as explained above.
Given the central importance of ConvNet, the distinct properties of generative ConvNet should be of broad interest.
Our paper shows that generative ConvNet can reconstruct the observed images and synthesize realistic new images. Both are novel and interesting.
Our work may open the door to unsupervised learning of ConvNet from big unlabeled data. We have recently implemented the learning of the auto-encoder in Proposition 4. It is as fast as learning discriminative ConvNet.
0.3 Random initialization and mixing
After learning, we can fix the parameters, and run Langevin dynamics by initializing from white noise. We tried it on ivy example and obtained similar synthesized images, albeit with a bit less realism than those in the paper. We shall try more examples. The Langevin dynamics should be able to traverse different modes with the help of annealing or tempering. We shall study this issue carefully.
Reviewer 1
1.1 Proposition 4
Contrastive divergence is very popular for training generative models. Proposition 4 establishes the exact equivalence between CD and auto-encoder, which is true only because of the piecewise linearity of ConvNet. This result can be useful for scaling up unsupervised learning.
We shall follow your advice and study how the dynamics traverses different regions.
1.2 Number of parameters
We reduced the numbers of filters to 32, 16, 10 filters at layers 1, 2, 3 respectively. The model can still generate reasonably realistic images. We shall study the issues of model complexity and generalizability carefully.
Reviewer 2
2.1 Novelty relative to EBM and Langevin dynamics
See 0.1 and 0.2 above. As to Langevin dynamics, the novel property is that it is driven by the reconstruction error of an auto-encoder.
2.2 Initializing from white noise
See 0.3 above.
2.3 Denoising, learning
We shall follow your advice in future work.
2.4 References
Thanks. Will cite them.
Reviewer 3
3.1 Novelty in relation to EBM
See 0.1 and 0.2 above.
3.2 These properties hold in general EBMs?
With all due respect, we disagree with your statement that these properties hold in general EBMs. As explained in 0.1 above, these properties are very special.
The beautiful paper of Swersky et al 2011 studies score matching estimator of latent EBMs. However, it requires that the free energy can be calculated analytically, i.e., we can integrate out the latent variables analytically. This is in general not the case for deep EBMs, such as deep Boltzmann machine (DBM) with two layers of hidden units. So there is no way to obtain an explicit auto-encoder for a deep latent EBM such as DBM. Also, none of the models in Swersky et al are piecewise Gaussian.
Our work provides a deep realization of auto-encoder, and is probably the right way to extend Swersky et al to deep regime.
3.3 Conceptual novelty
For deep EBMs with multiple layers of hidden variables, such as DBM mentioned above, the inference is in general intractable, and requires variational approximation with iterative computation. In contrast, the generative ConvNet has explicit bottom-up convolution for computing binary activations, and explicit top-down deconvolution for reconstructing the image. This is very uncommon for deep latent EBMs.
3.4 Learning from bigger data sets
Thanks. We will do it in future work.
3.5 Sample from fully trained model
See 0.3 above.
3.6 Fig. 3
The top image is the observed. The bottom one is reconstructed. It shows ConvNet can auto-encode and reconstruct. The result is novel.