Paper ID: 485 Title: Generative Adversarial Text to Image Synthesis Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes to tackle the problem of generating images from sentences. Building up on the recent advances in image generation using adversarial networks this paper presents a way to condition the generation on captions as well. Very nice generated images and experimental results on zero-shot learning are presented. Clarity - Justification: The paper is very clear and easy to follow. The supplementary material is very rich and helpful. Significance - Justification: This paper goes one step further in terms of image generation and for that reason it is significant, even if imperfect. The main limitation of this model and other similar ones is that, so far, they are only able to generate images from the very narrow distribution they were trained on (birds or flowers here). For instance. what happens when trying to generate using the "flower" network but conditioned on captions from the "birds" dataset? Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The -CLS idea is nice but from the experiments, it is not actually obvious that it helps. Is is that important? What is its main advantage? On the other hand, -INT seems crucial. How many interpolated captions are generated in this setting? How does it compare to the size of the dataset? Zero-shot evaluation is a good idea. Unfortunately, the results remain quite anecdotal. And the setting raise some questions. - How many images were generated for the Zero-shot recognition? - Is there any explanation on why there is no impact of the training iterations? And what means training in this context? From what I understand from Section 3.2, is the image and text encoders are trained, one can use them without retraining to classify new images. The text model has a very specific and sophisticated architecture. Does it really have an impact? Can this be quantified compare to simple baselines? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper extended the generative adversarial networks (GANs) to learn a generator network and a discriminator network to learn to generate images from a sentence and also map an image to a sentence. From the GAN baseline, the authors proposed matching-aware discriminator (GAN-CLS) to train the system with real image + false text, and learning with manifold interpolation (GAN-INT) to generate a large amount of additional text embeddings by interpolating between embeddings of training set captions. Extensive experiments were conducted to show the effectiveness of the algorithm. Clarity - Justification: Overall, the paper is well-written and it’s easy to follow. The algorithm is sensible and enough details have been provided for re-implementation. I have concerns over the size of images. Figure 4 just squeezed way too many pictures to half of page. It’s really difficult to see those images on a print out. I have to go back to the screen to zoom in a lot. The authors should consider taking out some examples and only show the most typical ones. Another concern is the fine-grained domain of birds and flowers. Most of the readers will be far from a botanist or a zoologist. So connecting the synthesized images and the text will be somewhat difficult. At least the paper could have provided the GT images associated with the sentences for the readers to have some sense. Significance - Justification: I wouldn’t say that the models designed in the paper are ground-breaking, but it’s sensible and I like the extensions to make GAN more robust. I especially like the results in Figure 6 where the description is changed line by line and the synthesized images are changed accordingly. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I recommend the paper to be accepted as the paper made solid improvement over the GAN framework with not only good tweaking on the learning algorithms, but also interesting applications such as style transfer and sentence interpolation. My biggest concern is that that data used in the paper are somewhat difficult for the readers to comprehend -- the only easy to tell description is color, and other descriptions are a bit hard to tell. I wonder whether such method can be applied to other more general text to image (or image to text) datasets where the text to image correspondence is more apparent. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a model for generating images conditioned on text descriptions - image captioning backwards. The model is based on a char-cnn-rnn encoder on text, an upconvolutional network, and a Generative Adversarial Network -style discriminator. The contribution lies in the architecture, as well as two non-intuitive components: A matching-aware discriminator (CLS) and a manifold interpolation technique (INT). The qualitative experiments over datasets of flowers and bird look nice and convincing, and the authors show good evidence of trying to analyze and debug the model both qualitatively and quantitatively. Clarity - Justification: The paper is well written and clear. Significance - Justification: I would be content with the basic char-cnn-rnn -> image architecture, but the authors go beyond this and introduce the CLS/INT techniques, which I find non-obvious and they clearly do work better than the naive, straight-forward approach. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This area is important, interesting, and this paper is clear, well-written, contains non-obvious contributions and good analysis of the model. The results look fun! I do feel that this paper is a little bit "in the air", in the sense that someone was going to develop a similar model about now so it doesn't come as a huge surprise, but the architecture sufficiently departs from an obvious one, and CLS/INT techniques are interesting additions. It does seem a little odd that GAN alone performs so badly, giving an identical result, e.g. in figure 3. It feels a little suspicious and could be discussed better. I also wish the authors compared more explicitly to Mansimov et al. (2015) if possible, since they share the task. Considering all of these points, I'd place this paper somewhere between Weak/Strong accept, but I'll lean towards stronger accept. Good work! =====