Paper ID: 1375
Title: PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper motivates a Crammer-Singer style loss function for performing multi-class and multi-label classification. The final algorithm uses an elastic net style regularization with the Crammer Singer loss function which is optimized in the dual. A block conditional gradient approach is proposed to solve the dual problem. The dual variables corresponding to each label constitute a block.  The solver exploits the presence of primal and dual sparsity to speed up gradient computations by maintaing both primal and dual iterates. An active set of labels is maintained for each data point and dual updates are made only to dual variables corresponding to these. Individual updates are found by solving a quadratic equation involving a cheap upper bound on the Hessian. This problem can be solved using a single sorting operation followed by a sweeping step.    On multiclass datasets, the method performs comparably to other algorithms such as the multi-class SVM with smaller training times. On multilabel datasets, the performance is better in terms of both accuracy, as well as training and prediction time.

Clarity - Justification: Please see the comments. The paper makes several minute optimizations to speed up the solver.

Significance - Justification: The paper presents a solver for elastic net regularized optimization of the Crammer Singer loss. The main contribution of the paper is in the minute changes it makes to the overall block conditional gradient method to speed it up for extreme classification problems.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - At several places, the paper makes the mistake of putting \alpha_k <= 0 as the constraint. This should be corrected to \alpha_k >= 0.  - There is recent work that achieves better accuracies than FastXML on several of the datasets used in the paper (see reference below). This method has a much smaller memory footprint than FastXML too.  Bhatia et al, Sparse Local Embeddings for Extreme Multi-label Classification. NIPS 2015.  - How was accuracy measured for multi-label datasets?  - What is LMO (line 502)?  - Please give a more detailed proof of Theorem 1 to make the reading of the paper self contained. The paper makes several approximations with the approximation of the Hessian, as well as the choice of the descent direction and it needs to be clarified as to which all approximations can be tolerated by the convergence bound.  - Theorem 1 gives a very weak result. Since Q ~ O(N), and the denominator of the expression in Theorem 1 behaves as t/N for large t, we would need t ~ N^2 in order to begin convergence. Please comment.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper tries to reduce the computation of extreme multiclass/multilabel problems by using block-coordinate Frank-Wolfe algorithm. The algorithm solves an l1 regularized multiclass SVM problem to make intermediate variables sparse, thus reducing computation. Experiments on large real-world datasets demonstrate the advantage of the proposed algorithm compared with competitors.

Clarity - Justification: The paper is difficult to read mainly due to inconsistent notations and bad writing.  Many notations are used without definition, and many notations are not consistent. For example, the variable alpha_i, alpha_k seem to mean totally different things in Section 2.  As for writing, the paper lacks introduction and motivation when describing the details of the algorithm. For example, lines 411-432, why suddenly talk about Hessian and quadratic approximation when you are describing a Frank-Wolfe algorithm? And it is better to add some explanation why such projection still makes the update sparse.

Significance - Justification: The algorithm solves an interesting problem and indeed achieves nice computational savings while preserving accuracy. The technical novelty is incremental as it is applying block-coordinate Frank-Wolfe on the l1 regularized multiclass SVM problem.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Applying Frank-Wolfe in the large-scale multiclass/multilabel problem is certainly interesting as it produces sparse solutions with sparse intermediate iterates. The experiment results also look promising and it may be interesting to practitioners.  The technical contribution seems to be limited. It is applying block-coordinate Frank-Wolfe on an l1 regularized multclass SVM problem such that the primal variables remain sparse as well.  Some comments about the analysis. In Section 2, the paper analyzes the optimality condition and argues the optimal solutions should be very sparse. However, the sparsity of the iterates does not necessarily inherit such properties. For example, the primal and dual variables W_t and A_t at the t-th iterate does not necessarily have the relationship dictated by Corollary 1, i.e., nnz(W_t) <= nnz(A_t). And a closer look at the algorithm reveals the sparsity of W_t should highly depend on the l1 regularization lambda. So, it does require a large enough lambda to maintain the sparsity of W.  Some minor issue: it seems the active sets are redundant. Frank-Wolfe already gives you sparse updates and you don't need to solve the subproblem restricted to the active set.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.):  The paper proposes an algorithm for fitting sparse linear models for extreme classification problems  which utilizes sparsity both in the primal and the dual. The algorithm relies on the fact that the   positive classes are a small fraction of the available classes so with a smart choice of loss and   a good Frank Wolfe implementation it is possible to fit sparse linear models in time competitive or  better than "sublinear" time approaches. The experimental results are very convincing including many  large multiclass and multilabel datasets where the method and several well recognized benchmarks are compared in terms of   training time, prediction time, test accuracy and model size. The proposed method is always the best  performing in at least one of these metrics. 

Clarity - Justification:         The paper is well written and easy to follow (except for section 4.1).    

Significance - Justification:         The paper exploits sparsity for both statistical and computational savings, with great empirical results in several extreme classification datasets.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.):                  One of the three citations in line 086 has not been generated correctly.                  Please rewrite Section 4.1. It is really hard to understand what is going on. Even better would be to release software and          have Section 4.1 refer to particular files/functions/lines in the code.                  In Table 3, consider including metrics such as F1 and Precision@k which are usually reported in the experimental sections of          other extreme classification papers.                  In Table 3, FastXML has better train time for LSHTC wiki and predict time for EUR-Lex so make these entries bold and make the          corresponding entries for PD-Sparse not bold.             Line 077: learnec -> learned         Line 088: label effective -> labels effectively         Line 142: I think the indices p and j should be the same symbol?          Line 148: Loss of Dual Sparsity -> Loss with Dual Sparsity (when I first read it I thought it meant losing dual sparsity).         Line 438: w_k^t and v_k t both have a t but I think these two t refer to different things. The second t is I think the minimizer from Algorithm 2. Please clarify.          Line 481: stablizes -> stabilizes         Lines 501 and 502: expand acronyms (RHS, LMO) that have not been defined.         Line 544: Wouldn't it be better to use importance sampling? E.g. let q_j = x_j^2/||x||^2, sample i from q and return w_{ik}/q_i.         Line 549: shrinked -> shrunk         Line 596: Practical Issue -> Practical Issues         Line 637: Experiment -> Experiments

=====