To all reviewers:$
We would like to thank you for your fair criticisms and constructive feedbacks. Here are our responses to some of your questions and comments.

--------------------------------------------------------------------------------------------------
To Review #1:

You are right that we cannot claim that the improvement in generalization we observe when adding noise to the first layer would be more profound than when we would add it to one of the other layers. We didn't test that and should rephrase our interpretation of those results. We meant to say that adding noise to all the layers doesn't improve generalization that much compared to just adding it to the first layer but this was apparently not clear. The initial idea of stripping down the Ladder Network to a more standard architecture like Denoising Auto Encoder motivated us to try adding noise only to the first layer.

--------------------------------------------------------------------------------------------------
To Review #2:

As you correctly mentioned, it is definitely interesting to see how extendable the results are to other datasets or even other models. Our motivation was to start with a model that already performs reasonably well and observe the effect of removing or modifying one architectural element at a time and to see when it breaks down.

Nice idea to visualize the learned "combinator function". Since each unit has a separate MLP, we will only choose to visualize one or two of them. Due to regulations, we cannot provide URLs here. Because we have reached the 8-page limit, we will add the plots to the supplementary material for the final version.

--------------------------------------------------------------------------------------------------
To Review #4:

Your comments will definitely encourage us to continue working on semi-supervised learning. We look forward to revitalizing the trend in using massive amount of unlabeled data to help supervised tasks. Another two interesting recent papers in this area are "Stacked What-Where Autoencoders" and "Improving Semi-Supervised Learning with Auxiliary Deep Generative Models". In the former, the notion of "where" loosely echoes the idea of the lateral connections.

Regarding your point on removing batch normalization, we have tried removing it but we still need to do more experiments along this line. Our preliminary results suggest that batch normalization plays a crucial role in training for the semi-supervised tasks