We thank the reviewers for their feedback, which we will incorporate in the final version, including their advice on referencing the supmat and the points discussed next.$
ER_2: lacks a comparative analysis of the two sampling strategies
ER_4: the [IMAGE] quality isn’t clear
ER_4: if something simple like SSIM can be used to [EVALUATE]

Fig.6 does provide a quantitative comparison of the quality of the generated images since it reports a texture similarity loss.

As implied by ER4, defining such a similarity is one of the key challenges here. We further assessed the results using the STSIM metric [Zujovic TIP13] (as SSIM is not a texture similarity and does not apply), but could not find any clear correlation between this metric and the perceptual quality of the results. Quantitatively, when run on 12 sample textures from supmat, the STSIM metric between the reference texture and the generated ones are as follows:

Texture nets:        0.782 \pm 0.020
Gatys et. al.:       0.784 \pm 0.022
Portilla Simoncelli: 0.780 \pm 0.020

Thus all methods perform nearly the same in STSIM, but the improvement of using CNN statistics in our and Gatys et al. approaches compared to Portilla and Simoncelli is empirically obvious.

Finally, note that if a more powerful texture similarity did became available, then all methods could be modified to optimize that one instead.

ER_2: [GATYS] is an approximation of the maximum entropy distribution, which gives some guarantees .... How about [THIS METHOD]? Do they have comparatively similar entropy?
ER_3: “theoretical analysis of the main drawback of the paper”

The “sampling by projection” method of [Portilla and Simoncelli, 2000] (applied by Gatys et al. to CNN-based statistics) does not appear to have formal guarantees. They construct a function x=f(w) that projects an initial random sample w to an image x with (approximately) the desired statistics. However, there is no guarantee that the resulting p(x) is a good approximation to the maximum entropy distribution on such images. Instead, Portilla et al. verify empirically that “In practice, this choice seems to produce an [IMAGE] with fairly high entropy.”

Our method can be seen as defining one such x=f(w) directly, instead of indirectly as the result of optimization. Inspired by the reviewer, we have verified that the entropy of the samples produced by our and Portilla et. al methods are about the same for simple statistics such as color distributions (as estimating the entropy of x as a whole is very challenging).

ER_2: How well do they match the empirical moments on average?

The value of the loss function directly measures how closely moments are matched. In Fig. 6 we can observe that our method and the optimization method of Gatys et. al. achieve comparable matching errors. 

ER_2: Is it the best possible texture representation? If we take DCGAN and plug Gram matrices to obtain a completely unsupervised algorithm (adapted to stationary processes), what happens?

We tried to substitute L2 loss between Gram matrices with a neural network based discriminator like in GAN and DCGAN. We found almost no visual differences in samples between modeling target distribution as a complex discriminator-based one or by matching the Gram matrices.

ER_4:”In what way is the results degraded when training on more images?Image quality degraded on unseen images? Or on training images? Or both?”

A: Both. While training on more images does reduce the loss on hold-out images (as expected), the visual quality degrades uniformly across train and hold-out images. We believe that we understand why this disconnect between loss and quality emerges when more images are considered, and  will spend more space discussing and explaining this effect, if the paper is accepted. In short, the disconnect arises because loss makes too much emphasis on keeping the proportions of textures at macro-scale (e.g. if the image is covered with 75% of strokes of type I and 25% strokes of type II, then the loss would try to associate 75% of each training image with type I and 25% with type II).  Perceptually, this preservation of proportion is not important for humans (e.g. having a 50%-50% split between type I and type II is fine for humans as long as both types of strokes are transferred in a plausible ways). When more training images are considered, the training process has to spend too much effort/learning capacity on preserving global proportions, and less effort on the veracity of local details, hence the degradation of the perceptual quality. Note that this conflict between global proportions and local detail reproduction does not arise during texture synthesis.

ER_4: “ the texture loss L_T is measured up to relu5_1 layer, while the content loss L_C is only measured on relu4_2. This seems strange”

Mainly, we reproduced the setting in “A Neural Algorithm of Artistic Style” [Gatys et. al.] for a fair comparison.