Paper ID: 1054
Title: Deconstructing the Ladder Network Architecture

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper has the commendable objective of teasing apart the components of the successful Ladder network, and analyze what pieces matter in which settings. The analysis is empirical, with experiments performed on the permutation-invariant MNIST.  Multiple variants of the ladder network are compared: instead of adding noise at all layers, noise is added only at the first layer, reconstruction cost is considered only at the first layer, lateral connections are removed, different way to combine lateral input and top down input are explored, etc. Results are somewhat qualitatively unsurprising (reconstruction cost is crucial, additive noise in layers is one of the most important contributors, later connections are crucial), but the paper also provides finer quantitative results, e.g. how removing a component hurts performance for 100 labels vs. 1000 labels vs. fully supervised, and the effect of initialization and nature of the combinator between top down and lateral inputs. 

Clarity - Justification: The paper reads well for the most part. It is unclear what experiment justifies that applying noise to each layer *and especially the first layer* helps generalization: the ablation experiments of adding noise only to one layer are limited to the first layer -- where in the experiment would we guess the effect of, say, adding noise only to the second layer? It would be better to use a formulation more faithful to the actual experiments -- e.g. something like: "applying noise to each layer helps generalization, and the bulk of that improvement can already be obtained by simply adding noise to the first layer." 

Significance - Justification: This is not an especially flashy paper, and it is resolutely empirical. However I do think it would be valuable to practitioners because it tries lots of "tweaks" or tests what happens when simple initialization choices are changed (e.g., random initialization instead of 0-1).  The fact that an MLP as a combinator yields better performance is also worth noting.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, this is a clear paper that tinkers with a successful architecture to see what pieces matter or not. While it isn't groundbreaking, it is interesting to see what the effects of the various parts are.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a study of variations of the Ladder Network architecture.  After presenting the ladder network architecture, the paper presents the considered variants of the ladder network and why they were considered.  The authors then present the methodology and the results of their experiments which show that the lateral connections are the most important component of the ladder architecture, followed by the combinator function (and it's multiplicative term), the sigmoid and the noise injection which regularises the model.

Clarity - Justification: The paper Is very clear.

Significance - Justification: The paper allows the reader to gain a better understanding of the (intricate) Ladder network architecture. Unfortunately, the experiments are limited to Mnist where the vanilla variant is already very performant which leaves little room for improvement.  The contribution is incremental and limited in scope but serious.  Another more difficult dataset would have made the arguments more convincing.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Section 5.3 gives perspectives for further research and is not very interesting. It should probably be summarised at the end of the conclusion.  pros:  - exceptionally clear  - interesting variants give a better understanding of the ladder network's components  - modest improvements using a different combinator function. Since the mlp has two inputs and one output it would be interesting to plot the learned function. cons:  - The paper examines variants of a known method and presents only few novel avenues of improvements.  - the experiments are limited to Mnist. This leaves little room for improvement since the vanilla variant is already very good on Mnist and makes it difficult to assume that the results would generalise well to other datasets.  - The paper is a bit short

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper undertakes a systematic deconstruction of the ladder network architecture, investigating exactly which components lead to its excellent performance, particularly on semi-supervised tasks. The paper tests 19 variants of the ladder architecture using identical hyperparameter selection methods and tasks. The central findings are that lateral connections are crucial to the ladder network’s excellent performance on semi-supervised tasks, while noise injection is useful for supervised tasks with many examples. Several of the model variants outperform the vanilla ladder network model, yielding new state-of-the-art results on MNIST.

Clarity - Justification: The paper is extremely easy to follow and provides details necessary for reproducing results in the supplementary material.

Significance - Justification: This paper undertakes a commendably rigorous analysis of the central architectural elements of the ladder network, and arrives at the convincing conclusion that the ladder architecture confers a decisive advantage on semi-supervised tasks. This may contribute to renewed interest in semi supervised and other approaches beyond currently dominant purely supervised learning approaches.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.):  Major comments:  This paper contains a masterfully fair, rigorous, and extensive comparison of the main architectural elements of the ladder network. The quality of the experiments is extremely high, with multiple random restarts and statistics reported, and the best performing hyper parameters listed in the supplementary materials.  The architecture variants have been thoughtfully chosen and usefully organized, though one potential variant which could be informative would be to remove the batch normalization step. The conclusions are useful and, encouragingly, point to a strong role of the lateral connections (the main architectural innovation of the ladder network) in improving semi supervised classification performance. These results should be interesting to many.  It is remarkable that using 600x as many labeled samples only drops the error from 1% to .57%. The paper already contains extensive experiments, but it would be very interesting to evaluate even more extreme versions of the semi supervised performance—eg, with 1 or 10 labeled samples—to see when the performance begins to degrade substantially. These conclusions may contribute to renewed interest in semi supervised and other approaches beyond currently dominant purely supervised learning approaches.  Minor comments: Ln 295: “networks”—>”network” Ln 634: “small”—>”small” 

=====