Paper ID: 303
Title: Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image Classification

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper describes augmenting convolutional trained for image classification with reconstructive pathways that help regularize the network. The paper reviews the literature and evaluates different architectures for the decoding pathways.

Clarity - Justification: The paper is clear but it spends comparatively little time describing the experiments considering the long introductory sections.

Significance - Justification: The paper could be important as it may revive a direction of research which had recently lost steam. They show for the first time that semi-supervised learning may help improve performance on a large scale image classification dataset.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper spends a significant portion of time discussing the image reconstructions from the deconding pathways. However, it is not clear what important conclusion we are to take from this. It shows that the reconstructions are successful but one Figure would have been sufficient to show this instead of two (Figure 4).  The main claim of the paper is that the decoding pathways can help improve performance on a large scale dataset. This claim is weakened by the fact that the weakest VGGNET is used as the baseline for this work. The authors choose to experiment with the VGGNet with around 10% Top-5 error but it is never explained why the better VGGNet which has 8.0% (Proposed in the same paper http://arxiv.org/pdf/1409.1556.pdf)  Top-5 error is not used. This is important because the decoding pathways only improve performance to 8.13%. Nonetheless, it is still impressive that the deconding pathways are able to improve the performance that much.  What is missing from the paper is more information about the experiments on classification using the SWWAE. How fast is training for this model? A learning curve of error vs time would be interesting. Can we say anything interesting about how the decoding pathways reduce the error? How do the features learned by the model change after being adapted with the SWWAE? Do the filters become sharper for instance.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposed to use auto-encoder modules and reconstruction losses to regularize supervised deep neural networks.  This idea is not new but the authors showed a few good empirical results in reconstructing input and doing image classification with the proposed regularization for well-trained supervised 16-layer VGG network.

Clarity - Justification: The paper is well written.

Significance - Justification: Reconstructing input from learned representations of a deep convolutional network was explored in (Zeiler and Fergus, ECCV'14; Dosovitskiy & Brox 2015).  The authors did this with SWWAE networks on larger images and got better reconstruction results.  Using auto-encoder reconstruction loss to regularize supervised learning, and do semi-supervised learning was explored in (Zhao et al. 2016).  The authors did this with 16-layer VGG network and got better single model results on ImageNet classification for a particular training and testing setting.  Even though the results are only for a single model under a particular crop / resizing setting and didn’t report advances over the current state-of-the-art, this is the first work as far as I know that has improved the performance of a well-trained supervised deep convolutional neural net trained on ImageNet classification data like the 16-layer VGG model, therefore justifying its significance.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors compared three different ways to use auto-encoding reconstruction loss for regularizing supervised training.  However, in the experiments on ImageNet classification, the results didn’t show a clear picture on when to use which approach for doing reconstruction based regularization, and the authors didn’t provide any insights or suggestions on this.  In the appendix, the loss weight and learning rate parameters for different layers were listed but for completeness it is better to state clearly how these parameters were chosen.  It also looks a bit odd that the loss for the first macro layer is a lot bigger than the other layers.  It will be interesting to see if there are any patterns with regard to how the change in these parameters affect classification performance.  As an empirical paper, the exploration in different variants of the idea and different experiment setups is really limited.  The paper only includes results for one particular model under one particular setting and didn’t improve the current state-of-the-art.  It will be a much stronger submission if more results are available for more variants or more models.  However, I do think the improvement over 16-layer VGG network is significant and slightly lean toward accepting this paper. 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces a simple yet effective way of combining supervised classification loss with unsupervised reconstruction cost, in order to improve object classification accuracy. It also demonstrated that by remember which units were used during the feed-forward max-pooling stage, reconstruction is significantly more accurate than when a fixed unpooling switches were used.

Clarity - Justification: This paper is well written and it is clear about the motivations and methodologies.

Significance - Justification: This paper is more of an incremental advance. The idea of combining unsupervised and supervised learning objective functions has been around for a long time.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Although the methodology proposed is not very novel, this paper demonstrated two things: 1. Much better reconstruction when the pooling switches are remembered. this is kind of obvious, but the difference is striking. 2. Superior classification accuracy after fine tuning with the combined objectives.  This paper also investigates different ways to enforce reconstruction loss at every single intermediate layer.   How would this method compare against the Ladder network?  I think accepting the paper could be useful in demonstrating how to incorporate more unlabeled data to boost performance of supervised training of convolutional nets.  

=====