We thank all reviewers for their comments.$ R1 and R2 asked us about the sparsity degree achieved by sparsemax when predicting. We agree with both reviewers that this an interesting quantity to report, and we have these numbers already: on the SNLI dev set, the fraction of premise words selected by sparsemax is 24.6%. This will be reported in the final version. R1 asked if we can control the sparsity. The answer is yes: we can do it by multiplying the argument of sparsemax by a positive constant t (which we can interpret as an inverse temperature). The larger this constant, the more sparse the result will be (cf. the limit case in Prop. 2, item 1). In the multi-label classification experiments described in 4.2, we actually used this constant to control the sparsity -- we will make this connection clearer in the final version. Regarding R2's question: for those experiments we tuned t the same way as we did for the probability thresholds in the softmax and logistic baselines: with cross-validation/held-out data (see lines 678-679). Following R1 and R2's recommendation, we will also provide more details about the computational aspect of the sparsemax function. The short answer is: (1) at training time, sparsemax can backpropagate gradients faster due to the sparsity (cf. last paragraph of 2.5); (2) at inference time, the softmax forward pass is faster, but asymptotically both are linear time. The projection onto the simplex (required by sparsemax) is well studied in the literature. The O(K) algorithm mentioned in footnote 1 relies on partitioning and median-finding, but in practice other algorithms with observed linear time are often a better choice. A full description of these algorithms is out of scope, but see e.g. Table 1 of the following ref, where an empirical comparison is also provided: Condat, Laurent. "Fast projection onto the simplex and the ℓ1 ball." (2014). http://hal.univ-grenoble-alpes.fr/hal-01056171v2 In our experiments, even with a O(K log K) implementation, the runtimes achieved with softmax and sparsemax were similar.