Paper ID: 639
Title: Texture Networks: Feed-forward Synthesis of Textures and Stylized Images

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a feed-forward texture synthesis neural network model which is able to synthesize texture without time consuming iterative optimization procedure in previous art, speeding up the synthesis time by a lot.

Clarity - Justification: The paper is well written.

Significance - Justification: Empirically, the improvement of texture synthesis speed is significant, and may make these type of texture synthesis and transfer usable for real time applications.  On the model side, the texture loss, the content loss, and the idea of transforming a noise distribution into a desired distribution are all not new, but the authors proposed to combine them and solved the generation efficiency problem.  Some of the experiment results are a bit questionable.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper is about doing texture synthesis a lot faster than previous approach and hopefully not do too much worse in terms of texture quality.  The method trains a feed-forward generative model of texture which generates texture in a single feed-forward pass, avoiding an expensive iterative optimization process as in previous art, therefore improving the speed by a lot.    The speed improvement is quite obvious and from the examples it also seems like the proposed approach is doing seemingly equally well as previous approach by Gatys et al.  However, the texture quality is hard to clearly judge from human eye, and it will be very helpful to have some kind of quantitative measure of quality, that helps in assessing it.  I imagine quality assessment to be a difficult problem, but I was wondering if something simple like SSIM can be used to give us some sense about it.  The multi-scale architecture is interesting, and it is good to see some exploration of the effect of different layers in Fig.4 in the supplementary material, these should be referenced in the main paper.  There still seems to be a lot of space in the paper and can fit some of these into the main paper.  The results mentioned in the paragraph around line 580 is strange to me.  In this paragraph it is mentioned that training is resilient to overfitting, and the authors are able to get good performance with only 16 images; but training on more images actually degrades the generated image quality.  In what way is the results degraded when training on more images?  Image quality degraded on unseen images? Or on training images? Or both?  To me it seems training with more data should not degrade the performance in general, and the model trained with 16 images can only do “better” if the data used for testing the model looks similar to training data, in a sense the model could be overfitting the small training set and producing things like these training set which also happens to look good in some way.  In sec 4, experiment details, the texture loss L_T is measured up to relu5_1 layer, while the content loss L_C is only measured on relu4_2.  This seems strange, as it is unusual for textures to be on an even higher level than “content” which is expected to be high level.  Overall I think the speed improvement is obvious, the quality isn’t clear but from the examples it seems they are not bad.  I lean toward accepting this paper.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes Texture networks, that can do texture generation and image stylization purely formulated as a feedforward process using a convnet. The paper largely uses the same loss function as Gatys et. al., but differs in that it trains a new model to model the synthesis function, rather than using an optimization process + pre-trained VGG network. The benefits of this are that the synthesis process if much faster. The paper provides details of training, as well as visual samples comparing themselves to Gatys et. al. and Radford et.al.

Clarity - Justification: The paper was easy to follow and well written. If I could suggest something here, the authors should write the drawbacks of this method in a small separate section, as the drawbacks are currently hidden in various sections all over the place.

Significance - Justification: The idea itself is very very incremental and there's no justification to say that the idea is novel. In that sense, the paper is weak.  However, there has been an incredible amount of attention (and usage by others) given to Gatys et.al.'s Neural Style paper, and this paper addresses one of the major drawbacks of that paper, which is inference time. Considering that this paper provides a lot of practical significance and affects many ongoing work in this subfield justifies it's acceptance.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors need to spend some time doing a theoretical analysis on the main drawback of the paper, i.e. the style-transfer only works for a limited set of style images per neural network. Without this, the paper remains weak for a conference such as ICML that has higher standards of mathematical rigor.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a feedforward CNN architecture to perform texture synthesis and artistic style transfer. The training algorithm follows the framework of Generative Adversarial Networks, by replacing the trainable discriminator with a pretrained CNN architecture producing texture representations. The resulting samples are of quality comparable to state-of-the-art but at a fraction of the computational cost. 

Clarity - Justification: The paper is clearly written, with enough rigor and description of prior art. 

Significance - Justification: The main contribution of this paper is a fast algorithm that produces high quality texture synthesis and style transfer. It is an interesting adaptation of two recent models (the texture models of Gathys et al on the one hand, the Generative adversarial networks of Goodfellow et al) that is of relevance to the computer vision community. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - Besides the (excellent) numerical results, and given that this is primarily a Machine Learning conference, I found that the paper lacks a comparative analysis of the two sampling strategies: the expensive sampling in Gatys et al on the one hand versus the fast sampling proposed in the current paper. The first one is an approximation of the maximum entropy distribution, which gives some guarantees and insights on the corresponding statistical model. How about the second one? Do they have comparatively similar entropy? How well do they match the empirical moments on average?   - Despite the fact that pretrained CNNs produce good texture representations through the Gram matrices, a question remains. Is it the best possible texture representation? If we take DCGAN and plug Gram matrices to obtain a completely unsupervised algorithm (adapted to stationary processes), what happens? 

=====