Paper ID: 732 Title: Autoencoding beyond pixels using a learned similarity metric Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Arguing that variational autoencoders (VAEs) generate poor quality samples due to relying on a similarity metric that measures the distance between individual pixels, the authors propose training VAEs in higher-level feature spaces learned by the discriminators of generative adversarial networks (GANs). They achieve this by jointly training a VAE and a GAN, with the reconstruction error of the VAE computed in the feature space of one of the hidden layers of the GAN discriminator. To ensure that the models operate on the same latent space, they share the parameters between the VAE decoder and the GAN generator. The resulting models indeed generate much sharper samples than VAEs, though they exhibit artifacts similar to those of GAN samples. Clarity - Justification: The paper is clearly written and is fairly easy to follow. The experiments are described in reasonable detail and should be straightforward to reproduce once the authors release their implementation, as promised in the paper. Significance - Justification: The combination of a VAE and a GAN is novel and interesting. As with GANs, it is unclear how to evaluate such models properly, though the authors do a reasonable job in the paper. Still, the paper does not quite convince me that this interesting development is a substantial advance. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I liked how the authors cleanly combined two very different approaches to generative modeling. The resulting model does indeed produce sharper imaged than VAEs, though it is not clear to me why sharp images with odd artifacts are preferable to blurry images without such artifacts. It is hard to conclude much from the visual attribute vector results, even if they are certainly interesting visually. The motivation for avoiding pixel-wise objective for training generative models given in the paper is plausible but it is somewhat undermined by the recent results of van den Oord et al. with Pixel Recurrent Neural Networks. Thus I encourage the authors to think of additional ways of motivating the use of feature-wise objectives. It might be worth discussing which evaluation metrics are available for VAE/GAN compared to plain VAEs and GANs. It would also be good to say what is gained and what is lost compared to these architectures. The experimental section could be made stronger by performing a hyperparameter search for each method. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes to combine two recently introduced generative models GAN and VAE that shares the generator of GAN and decoder of VAE. The VAE is not defined on the pixels, but rather the objective is defined on the learned similarity metric, e.g., l-th layer representation of discriminator of GAN to allow looking more realistic image generation. The paper provides useful tips in training the proposed network followed by comprehensive empirical comparison with competing methods via visual inspection. Clarity - Justification: The paper is mostly clearly written, but motivation of using discriminator's representations as target could be made clearer. Significance - Justification: The experimental validation is mostly based on the visual inspection. Lack of control experiment. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - I like the idea of training VAE to target at learned similarity metric instead of pixel values. Since the discriminator's representation is used as learned similarity metric, the paper should explore the properties of discriminator, e.g., what kind of similarity metric it learns, to justify how discriminator's representation could be a reasonable target to the VAE. - In addition to this, at the very beginning of training, the discriminator's job is fairly easy (classifying real image from almost random image from randomly initialized generator) and how much useful information there is at the discriminator's representation to be used as a target for VAE. - Since the experimental validation is mostly based on the visual inspection, the evaluation could be subjective. I would suggest to include quantitative measures such as parzen window-based log-likelihood estimates. - To claim the effectiveness of using Dis_{l} as a target, the paper may want to include results of VAE/GAN whose target for VAE is pixels (i.e., L_{prior} + L_{llike}^{pixel} + L_{GAN}). - Analysis on hyperparameter \gamma is missing. - In Equation (10), how important to have the second term (log (1-Dis(Dec(z)))) when there exists third term? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes to train encoder/decoder pairs by mixing VAE and GAN objectives. The model is evaluated based on samples from the model trained on faces. Clarity - Justification: The paper is mostly well written and easy to follow. Significance - Justification: The subject of the paper is of great interest to the ICML community. Being able to train encoder/decoder pairs with a meaningful latent representation and an ability to generate realistic images would be very useful for manipulating images. There is also a need for better image metrics. On the other hand, the mishmash of objective functions and many heuristics make it hard to understand what is going on when using the proposed objective function, which I expect will severely hamper the paper's impact. The model also doesn't seem to produce any significant advances in image generation or representation learning that would drive people to nevertheless adopt the approach. Still, I find the ability of the approach to quickly manipulate images intriguing, which may inspire other researchers to come up with more principled/better working approaches. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The most interesting result of the paper to me are the image manipulations in Figure 5, since these seem to be the only results exploiting the strengths of the model: being able to quickly jump between realistic images and latent representations. A comparison with VAE or other approaches (not necessarily generative) for this task would have been great! =====