We sincerely thank all reviewers for their thoughtful comments. We respond to individual questions and suggestions below in order:$ AR3 - Generating flowers from bird captions: We generated images using bird descriptions and a GAN-CLS-INT trained on flowers, and a text encoder trained on bird captions. Interestingly, the generated images correctly matched the predominant color of the bird in many cases. As expected, the bird shape and realism were not preserved. - Advantage of -CLS method: Anecdotally, we observed that -CLS improved the apparent speed of convergence. Models with -CLS training appeared to match colors with the text description at an earlier epoch compared to models without it (GAN and GAN-INT). - Number of interpolated samples: For every adjacent pair of real captions in a randomly sampled batch, we generate their embedding interpolation. So a batch of 64 ground-truth captions yields an interpolation batch of combined size 96. - Effect of # training iterations for training zero-shot image classifier: Our aim was to check whether the image encoder trained from only GAN samples would quickly overfit or not; the idea being that if samples lacked diversity the test accuracy would drop after a few iterations. Instead, we found that GAN-INT and GAN-CLS-INT yielded much better samples than GAN and GAN-CLS, but we did not observe any rapid drop in test accuracy of the image classifier due to overfitting. We will clarify our zero-shot experiment analysis in the text. - Impact of text encoder: We chose char-CNN-RNN for its generality and discriminative power; however, we could also start from a bag of words encoding. Bag of words loses some information when word order matters, e.g. in a caption mentioning colors and shapes of multiple bird parts. This results in lower zero-shot prediction accuracy when using BoW text embeddings (54% for char-cnn-rnn vs 44% for BoW in our latest CUB experiments). To further investigate, we trained a GAN-CLS-INT on the CUB data conditioning on bag-of-words encodings. We observed that the overall quality was similar to the GAN baseline on CUB shown in figure 3; most samples included the predominant color in the text, but lacked diversity and realism. AR4 - Clarity We thank AR4 for pointing out areas to improve the presentation. We will change the figures to make samples larger and easier to match with the text, and put additional figures in the supplementary materials. We will add ground-truth images next to the ground-truth text so that non-experts can more easily verify the accuracy of our generated samples. AR5 - Poor performance of GAN baseline on CUB We investigated this further by training 10 instances each (varying only the random seed) of GAN, GAN-CLS, GAN-INT and GAN-CLS-INT on the 100 CUB training classes for 200 epochs. Using samples from each of these GAN models, we trained a zero-shot image classifier from scratch, following the same protocol described in section 5.5. We found that on 3/10 trials the GAN generated plausible images and resulted in a classifier with ~8% zero-shot accuracy. However, on 7/10 trials the GAN performed very poorly, resulting in an overall average of 4.5%, stdev of 2.4%. All other varieties we tried (GAN-CLS, GAN-INT, GAN-CLS-INT) consistently generated plausible images, yielding zero-shot accuracies of 7.5%, 7.9% and 8.6%, respectively. Note that we report higher accuracies (fig. 9) for GAN-CLS-INT, achieved by training on the 150 training+val classes for 600 epochs. However, the baseline GAN and GAN-CLS performed worse on the 150 classes compared to only training on 100. Our impression is that the classification (-CLS) and especially interpolation regularizer (-INT) stabilize the training, significantly reducing the incidence of “failed” GANs. - Comparison to Mansimov et. al on COCO data As suggested we ran our code on the COCO dataset and observed compelling results on many queries, e.g. “two horses that are eating grass by the dirt”, “a row of meatball sandwiches sitting on a plate”, etc. Characteristic of GAN models the samples are fairly detailed and sharp (i.e. do not appear blurry). We can share these results upon request and/or include them in the final version.