Paper ID: 805
Title: Dictionary Learning for Massive Matrix Factorization

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): By combining the ideas of online dictionary learning [Mairal et al., 2010] and random subsampling [Szabó et al., 2011], the authors propose a dictionary learning algorithm that can efficiently handle high-dimensional (p is large), massive (n is larger) data. Experiments on a 200,000-by-480,000 fMRI matrix is used for evaluating the efficiency of the proposed algorithm.

Clarity - Justification: In general, the paper reads okay to me. Unfortunately, it seems that there is a system error in the uploaded PDF file. The document has somehow crashed and therefore I cannot see the contents beyond page 6.  

Significance - Justification: It is certainly significant to develop a dictionary learning algorithm that scales in both the data dimension (p) and the dataset size (n). However, the presented techniques are simply adapted from online dictionary learning [Mairal et al., 2010] and random subsampling [Szabó et al., 2011], without much adjustment. Even more, there is no theoretical analysis provided. I feel that the novelty and significance should be below the threshold of this conference.   

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. The document has crashed since page 6. Please check that.   2. It would be much better if you can provide some theoretical analysis for the proposed algorithm. As an online algorithm, its convergence property is a very arresting issue. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present a method for matrix factorization that scales well and allows for regularization in both dimensions of a matrix, can incorporate both L_1 and L_2 regularization and naturally accommodates missing data.  The work can be viewed as extending the streaming matrix factorization algorithm of Mairel et al., 2010 by allowing for partial observations of columns of a matrix at each time point. The authors present results demonstrating that the method achieves comparable results to an existing online dictionary learning algorithm (that of Mairel et al) but achieves up to a 8x speedup in runtime on a real neuroscience (fMRI) dataset.  The authors also present results on three collaborative filtering datasets and compare against an existing method (spira) for collaborative filtering, showing that for moderate to large datasets, they achieve similar prediction error with significant runtime speedups. 

Clarity - Justification: The paper is extremely clear and easy to follow.  The authors do a good job of positioning their work in the context of what has been done before.  The methods are crisp and the results presented in a manner which allows for clear comparison of the presented algorithm to existing methods. 

Significance - Justification: The authors acknowledge that they extend the work of Mairel et al.  The proposed extensions while natural provide substantial speedups in the demonstrated scenarios and are therefore impactful.  The paper would be strengthened by an inclusion of theory relating the ratio, r, which governs the amount of data sampled at each iteration to performance of the proposed method in relation to a method which uses full columns.  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): While the work seems a fairly natural extension of what has come before, the clarity of the presentation and the substantial demonstrated speedups render the paper a valuable contribution.  Additional theory would further strengthen the paper, but even without this, I view the paper in it’s present form as providing a valuable contribution to the community. 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Given a matrix X (p*n) of n-signals in R^p, the goal is to compute a factorization X=DA, with structure requirements like, sparse D or A. This is achieved by minimizing the norm of the error X -DA and penalizing L1 norm of D, A. This paper mainly focuses on the missing data setting, where only few entries of each column of X are observed. Also the number of factors k is assumed to be much smaller than p and n. Considering the setting, where few entries of signal x_i are observed in an online fashion, the paper proposes an alternating minimization style heuristic to update both D and A. Further the updates are modified so that computation in each iteration depends only on sparsity of (x_i), k and not the dimension p.  Later simulation results for the proposed heuristic are presented for the settings of factorization for fMRI and recommendation systems. For the fMRI and the recommendation system setting, proposed heuristic achieves speedups compared to the fully observed data setting, and coordinate descent (Yu et al., 2012) respectively, while maintaining similar error rate.  

Clarity - Justification: The introduction of the paper mixes various settings like matrix completion, dictionary learning, sparse PCA and is very confusing. The problem considered in this paper is more of the flavor of sparse PCA and not that of dictionary learning (where number of factors is larger than dimension). The rest of the paper, the proposed methods and the simulation results are easy to follow.

Significance - Justification: This paper considers the problem of learning sparse factors from data in an online fashion. So the loss is minimized with a L1 penalty to encourage sparsity, using a stochastic alternating  minimization method. Comparison with many existing sparse factorization methods is missing and is limited to only one other alternative method in some experiments. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Standard dictionary learning problem refers to the setting where the number of factors k is bigger than p, n and the factors A are sparse. The problem considered in this paper is not dictionary learning and is more like sparse PCA (sparse eigenvectors of XX^T, will give sparse features of D). The introduction mixes discussion between, matrix factorization (k << p), dictionary learning (k > p) and sparse PCA settings and is very confusing.  Computing sparse factors of data (Sparse PCA) is a very well studied problem. However no comparisons with existing algorithms are made. See some references for large scale sparse factor computations below. A detailed comparison with the existing algorithms both discussion of merits and in simulation results will help in highlighting the benefits of the proposed method.  http://papers.nips.cc/paper/5905-approximating-sparse-pca-from-incomplete-data http://projecteuclid.org/euclid.aos/1368018173 http://arxiv.org/abs/1303.0551  Also for some comparison with the latest dictionary learning work look at, http://arxiv.org/abs/1503.00778, http://research.microsoft.com/en-us/um/people/alekha/dict-learning-colt.pdf and the followup works.  By now it is well understood that L1 regularizer promotes sparsity.   In equation 17, the update on RHS should be added and not subtracted right?

=====