Paper ID: 1229
Title: ADIOS: Architectures Deep In Output Space

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper tackles multilabel classification via deep architectures. The main contribution of the paper is that the authors propose using another deep architecture to model relationships across output labels. This is further based upon the idea of markov blanket chains.

Clarity - Justification: The paper is very well written and provides a good insight into the problem as well as the proposed solution. The experiments are exhaustive and clearly show the benefit of the approach.

Significance - Justification: The insight behind the Markov Blanket chain is perhaps the main contribution. In fact, the same insight could be used in other ML architectures as well (not just the deep ones). The ideas are clearly articulated and as a reader I enjoyed learning about these ideas.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, a very well written paper that clearly articulates the problem and provides a reasonable solution.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Dealing with large label space is one of the important research issues in multi-label classification (MLC). In this paper, a new MLC approach is proposed by exploiting the dependency structures among labels. Specifically, the label space is partioned into two disjoint subsets where one subset is the Markov blanket of the other. The corresponding dependency structure is implemented with the multi-layer preceptron with ReLU hidden units and stochastic gradient descent optimization. Experiments on five datasets with large label space are conducted to show the effectiveness of the proposed approach.

Clarity - Justification: The whole paper is well-written and easy to follow. Technical details as well as experimental studies have been clearly presented.

Significance - Justification: Designing multi-label learning techniques which can effectively handle the large number labels in the output space has been found useful in many real-world applications. The idea of exploiting label dependency by identifying Markov blanket relationships among partitions of label space is interesting. Performance of the proposed approach is compared against several state-of-the-art MLC methods dealing with large label space.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. Several related works on MLC with neural networks (deep learning) or large label space haven't been properly discussed in this paper:  a) Read J, Perez-Cruz F. Deep learning for multi-label classification. arXiv 1502.05988, 2014.  b) Zhang M-L, Zhou Z-H. Multi-label neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(10): 1338-1351.  c) Li X, Zhao F, Guo Y. Conditional restricted boltzmann machines for multi-label learning with incomplete labels. In: Proceedings of AISTATS'15, 2015, 635-643.  d) Charte F, Rivera A J, del Jesus M J, Herrera F. LI-MLC: A Label Inference Methodology for Addressing High Dimensionality in the Label Space for Multilabel Classification. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(10): 1842-1854.  2. The proposed approach has similar working mechanism as MLC methods based on the stacking techniques, i.e. making predictions by taking outputs from other labels as the inputs. More information on stacking-based MLC can be found in the following literature and references therein: **Montanes E, Senge R, Barranquero J, Ramon Quevedo J, Jose del Coz J, Hullermeier E: Dependent binary relevance models for multi-label classification. Pattern Recognition 2014, 47(3): 1494–1508**  3. Section 2, 1st paragraph, the actual meaning of "trainable end-to-end" should be further explained. Furthermore, in Algorithm 1, the condition "$|G_1| < K$" should be "$G_1$ > K".  4. It is nice that the proposed approach have been compared against a number of state-of-the-art approaches which work for MLC with large label space. Nonetheless, it would be better if datasets with larger number of labels (e.g. >10K labels as those used in Prabhu & Varma 2014) could be employed for experimental studies.  5. As shown in Algorithm 1, the proposed approach has two important parameters which need to be set, i.e. partition size K and approximation parameter k. However, the concrete setting of these two parameters are not given in Section 5. In addition, it is desirable to show whether the performance of ADIOS is sensitive to the configurations of these parameters.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a method for exploiting label dependence based on the assumption that the entire label set can be split into two disjoint groups where one group help a model predict labels in the other. The disjoint label sets are constructed by making use of submodularity of information gain rather than relying on external knowledges. The authors extend a neural network architecture by introducing two output layers where computing predictive scores of labels in one output layer requires those of labels in the other subset. Both two output layers are jointly considered during model training. The authors have carried out experiments on five multi-label datasets across multiple domains. The proposed model achieves good performance on rare labels as well as frequent labels.

Clarity - Justification: This paper is written very clearly although it needs background knowledge of submodularity to confirm that why authors’ claims hold and how to extend this work further towards using more than two subsets of labels. The experiments demonstrate the proposed approach is effective on learning from label dependence. 

Significance - Justification: The authors combine neural networks with useful findings from a well-studied area, submodular functions, for addressing the problems of modeling label dependence in multi-label classification. This is quite interesting approach which has not been studied yet in multi-label classification yet to my best knowledge. A weakness of this paper is that lack of new technical contributions to which could draw other research areas’ attention.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Performance evaluation in terms of label size would be helpful to convince those who wonder how much effective the proposed method on rare labels.  It might be better to have a few interesting examples from the learned models, which can show why some abstract but rare labels can be predicted.   Is there any explanation why ADIOS_RND perform better than MLP? What if we compare MLP and ADIOS_RND while having the number of parameters same for them? I’m wondering whether or not the performance improvements of the proposed method including ADIOS_RND are partially attributed to the enhanced expressive power by more learnable parameters with regularization.  In the manuscript, ADIOS with only two output layers is demonstrated. Would it be possible to extend it to more general cases where the model benefit from having multiple output layers? If it possible, does the complexity of the MBC construction grows linearly or exponentially?  If I understand correctly, a small number of correlated labels are used to reduce complexity of MBC construction. However, I cannot find how many such labels are used in the experiments. Is the proposed method sensitive to the choice of k?  I’m not convinced the ADIOS or the multi-layer approach in the output space and recurrent neural networks are similar. ADIOS can be seen as a special case of recurrent neural networks in a sense that the objective of learning is to learn a joint distribution of random variables. However, architectures are quite different to achieve that goal. For example, one of reasons why recurrent neural networks work well on modeling a joint distribution by factorizing it into product of conditional probabilities is weight sharing while processing inputs over time. There is no weight sharing or similar in the proposed method.  Lines 123~125 contain incorrect information about the neural machine translation architecture where a variant of importance sampling is used to overcome the large vocabulary problem.  In Algorithm 1, please provide more information on what ``update C_i`` means or how this can be done.  Minor comments  Line 286: G_1 -> G_2 Line 316: \ell^{ast} -> \ell Two different versions of dropout papers are cited at lines 94 and 454.

=====