We thank the reviewers for their helpful feedback.$
The evaluation of our method and the conclusions drawn from the experiments seem to be of general concern to the reviewers. We fully acknowledge the lack of convincing quantitative measures in our evaluation. Unfortunately, current evaluation measures are pixel-based and favor methods with pixel-wise objectives. To evaluate feature-based similarities, we are left with the following two experiments.

 1. We assess the visual fidelity of generated images and show that VAE/GAN images look more natural than VAE images since blurry structures are unnatural. We acknowledge that this is subjective.
 2. We evaluate the quality of generated visual attributes according to a separately trained regressor convnet. In this experiment, the feature-wise similarity of VAE/GAN performs significantly better than pixel-wise similarity of plain VAE. This is clear evidence that the generated samples better imitate the dataset in terms of image structure recognized by the regressor. The question is then to what degree the regressor convnet captures natural image structure.

We hope this is sufficiently convincing.


Reviewer #1:
We agree that combining objectives from different models complicates our method. However, we do not see this hampering the impact potential of our main message which is to go from pixel-wise to feature-wise similarity measures. Our work demonstrates a first step in this direction, and recently this idea seems to be getting traction with the work of Dosovitskiy et al. 'Generating images with perceptual metrics [...]' and Lamb et al. 'Discriminative regularization for generative models'.

We like the idea of also showing visual attribute vectors for a plain VAE and we plan to include such a figure as supplementary material.


Reviewer #3:
To answer your question regarding the similarity metric property of the discriminator, we have run an experiment on CIFAR-10: After training a GAN, we compute a feature representation of the test set by propagating it through the discriminator network. We then measure the k=5 nearest neighbor classification error. KNN with a feature-based metric gives an error of 33.73% whereas KNN on the raw pixels gives 66.05%. We plan to include these results in the revised manuscript.

We agree that early during training, the discriminator representation is a random projection of the data which is problematic for the VAE reconstruction term. In our experiments, though, this is not a problem as the discriminator quickly captures useful structures. In fact, face structures start appearing after only a few minutes of training. We will add this discussion to the manuscript. 

It is a good idea to demonstrate the effectiveness of using Dis_l by comparing L_{llike}^{Dis_l} with L_{llike}^{pixel}. We will include these results in Figure 3 and 4. 
To better explain the \gamma parameter, we propose to include a figure showing reconstructed samples for models trained with different \gamma values.

Regarding the importance of the term (log (1-Dis(Dec(z)))) in Eq. (10): We have observed best results when training the GAN on both reconstructed dataset samples and samples from p(z). This may not be true in all cases but we include it in Eq. 10 for completeness.


Reviewer #4:
We agree that sharper images with GAN artifacts are not necessarily better than blurry images. The experiment in Sec. 4.2 shows that the generated attributes are easier to recognize for a regressor network which we would argue can be a useful improvement in some circumstances.

The PixelRNN method is definitely an interesting approach to generative image modeling. Though, we do not see that their results invalidate ours as the models are quite different from each other.

As we have mentioned, it is hard to come up with representative evaluation metrics for generative image models. We could report a parzen window-based log likelihood in the feature space of a separately trained feed-forward network, but we suspect that this will not be convincing to those are not already enthusiastic about feature-based representations. 

Regarding hyperparameter search for the experiments: We have manually tried to find the best set of hyperparameters for each method/experiment. It is very difficult to thoroughly search the hyperparameter space given the instability of the GAN and the lack of quantitative evaluation measures. This is indeed problematic.