Paper ID: 755 Title: From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a variation on the softmax that can enforce sparsity in the returned posterior probability distribution over classes. The paper explains how this loss can be used to replace a softmax in various applications like multilabel classification or attention mechanisms for RNNs. Clarity - Justification: The paper is very well written and very interesting to read. I found the maths quite elegant and all derivations from the sparsemax very nice. Connections with the Huber loss are nice too. Significance - Justification: This is a nice work, and the direction is interesting and promising. However, in addition to being a nice tool, to be used and have a major impact, I think that Sparsemax should show its practical interest over Softmax in either (1) speed/scalability, (2) accuracy or (3) interpretability. The paper gives glimpses on all 3 aspects without being convincing on any of them, unfortunately (see 6.). Without that, I'm afraid that the impact might be limited. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As I said in 5., I think that the paper could be much more convincing if it was stronger in any of the 3 following aspects when comparing softmax and sparsemax. * Speed/ scalability: it is written in the paper that sparsemax “can lead to faster backpropagation” and that “appears better suited for problems with larger numbers of labels.” But this is never demonstrated nor quantified empirically. Results showing that sparsemax scales better than softmax when the number of classes increases would be important. * Accuracy: from the experiments, it is not completely clear that the sparsemax leads to better results than the softmax (apart from the synthetic data). Results on multi-label classification are unclear. On SNLI, sparsemax is better than softmax in test but not in dev. On this dataset, (Rocktaschel et al. 15) actually report results with soft attention higher than that of the soft attention presented here and also higher than the sparse attention (dev: 83.2 test: 82.3) and also improved upon them using a word-level attention (dev: 83.7 test:83.5). * Interpretability: an interest of sparsity is that it can lead to better interpretation of predictions. But this is not very well shown in the paper. Table 4 shows this but it should also show what attention is generated by the softmax (taking all words above a certain threshold for instance). How do they compare?In general, what is the sparsity degree achieved by sparsemax when prediciting? Other comment: using the constant t in section 4.2 looks tricky and sketchy. How is t set? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces a transformation the authors call sparsemax and accompanying loss function that is suitable for the top layer of a gradient-based classifier. It is similar to the softmax and log loss, but is designed to produce sparse probability distributions. The sparsemax function has several desirable and sensible properties and the paper includes experimental results that demonstrate that it achieves its intended purpose. The paper presents experimental results on sparse attention in neural nets and on multi-label classification. Clarity - Justification: The paper is quite clear. It is well organized and I did not find myself struggling with the prose. The mathematical notation seems consistent and not overly cumbersome. Enough detail is included to make the contribution and experiments clear. Significance - Justification: What is in effect a multi-class Huber loss fills an important hole in the literature and could be useful for sophisticated neural net architectures that use external memory and attention. The sparsemax function and corresponding loss have great promise to be useful in a variety of applications as well. This paper could have been written years ago and is useful enough that it should have been, so it is a strong contribution. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is ready for publication as is. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces a new activation function, dubbed as sparsemax, which can produce sparse probability distributions over a set of K discrete outcomes (e.g. classes). The paper shows how to compute this function and its jacobian, and provides a comprehensive mathematical analysis of their properties. It also introduces a loss function based on the sparsemax function, which can be used to learn sparse posterior distributions. Finally, the paper presents experimental validation on several multi-label classification benchmarks, and applies the sparsemax function to achieve sparse neural attention mechanism providing more interpretable results. Clarity - Justification: Paper is well written. Contributions were clearly outlined in the context of previous work. Mathematical derivations were clearly presented. Significance - Justification: The problem of producing a sparse posterior distributions is very important, especially, as the authors mention, for filtering out a large set of outputs. Both the proposed activation and loss functions are novel, and they have a high potential for being applied to a wide range of applications. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I have several comments/questions: I would like to get a sense of how sparse the produces distributions are using the sparsemax function. I would encourage the authors to address that point. In addition, can we control the sparsity? The authors should elaborate more on the computational aspect of the sparsemax function. The authors provide an implementation that is O(K log K), which is a disadvantage compared to softmax. They mention in a footnote that there exist an O(K) algorithm, and I think they should elaborate more on this. I would recommend that they discuss how in practice sparsemax would compare to softmax in terms of computational speed (both in training and inference). Finally, a minor typo should be corrected in footnote 3, where the equal sign should be replaced with the set membership symbol. =====