Paper ID: 304 Title: Online Low-Rank Subspace Clustering by Basis Dictionary Pursuit Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper establishes an online algorithm to speed up the computational procedure of LRR (Low-Rank Representation) [Liu et al., ICML 2010], a useful method in subspace clustering and data recovery. The presented techniques are basically adapted from Online Robut PCA (ORPCA) [Feng et al. NIPS 2013]. Due to the specifics of LRR, the method of ORPCA is not directly applicable to LRR, and thereby the authors indeed spend considerable effort to build their Online Low-Rank Subspace Clustering (OLRSC). Both theoretical analysis and empirical evaluation are performed to show the superiority of the proposed OLRSC method. Clarity - Justification: The paper reads well to me. Significance - Justification: Considering the fact that large-size datasets are now rather common in practice, solving the efficiency issues of LRR is certainly significantly meaningful. However, the proposed OLRSC method is actually not a real online method, because OLRSC does not well handle fresh samples. When a new data sample is added, the computational procedure of OLRSC still has to be re-performed on all available samples. I could understand that it is rather difficult to establish a real online algorithm for LRR, especially in the context of subspace clustering. But the authors have the responsibility to overcome this big challenge, and which can dramatically improve the significance and novelty of this work. The current OLRSC method looks like an incremental variation of ORPCA [Feng et al. NIPS 2013]. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Table 2 is confusing. What is the time unit of OLRSC, ORPCA, LRR and LRR2? It seems that SSC is the fastest method? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper addresses the computation and memory bottleneck for optimizing the nuclear norm regularized Low-Rank Representation problem. B invoking low-rank matrix factorization and intorducing auxillary basis dictionary variables, the authors reformulate the problem into a non-convex stochastic optimization model, which is amenable for efficient online updating rules. The proposed algorithm enjoys very cheap iteration cost and is guaranteed to converge to a stationary point asymptotically. Extensive numerical results are provided to verify the efficiency. Clarity - Justification: The paper is crystal clear and very well-written. Significance - Justification: The reformulation of the LRR model as well as the proposed online algorithm seem novel and useful. Given the online nature and scalability of the algorithm, this work could potentially bring a significant impact to the community. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. It would be interesting to have some non-asymptotic regret bound or convergence rate of the algorithm instead of just asymptotic convergence when sample size goes to infinity. 2. Simply comparing the iteration cost to other algorithms without taking into account of the iteration complexity is not quite convincing. The authors need to show that the algorithm requires less computation cost while achieve at least the same accuracy compared to the alternatives. 3. Some of the subproblems in the algorithm are solved iteratively using alternative minimization or coordinate descent, so it might be imprecise to claim that the iteration cost is just O(pd^2). ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents an online method for solving the low-rank subspace clustering problem (LRSC). The approach aims at improving two shortcomings of LRSC namely computational cost and memory footprint. By reformulating the objective loss the authors state the problem in a way that is amenable to online optimization with significantly fewer optimization variables than the original problem. The authors present convergence results along with some experimental evaluation. Clarity - Justification: The paper is overall well written. The description of the method is presented clearly and intuitive discussion on the different aspects of the method is provided. However the problem presentation could be improved, clarifying reach of the obtained results. These thing should easy to fix with minor changes. Significance - Justification: The aim of the work is to improve the computational and memory cost of the LRR algorithm. For doing so, they reformulate the problem in a way that is well suited for an online task. Looking at the LRR formulation there are two main difficulties for doing so: (i) the computational cost involved in dealing with the nuclear norm of large matrices (ii) storing in memory the full dictionary Y (which usually is the full dataset). (i) The solution proposed for solving (i) is using a factorization of the coefficient matrix X following results by (Recht et al 2010). This approach has been used extensively in the context of rank regularized optimization problems, in particular in RPCA. The authors themselves cite some works, there are other in the literature see [A] and references therein (reference given in the detailed comment). (ii) The novel aspect of the work consists introducing an auxiliary variable explicitly modeling a basis for the union of the subspaces present in the data (which is considerably smaller than the data size). They state a regularized version of the original problem and show that, under certain conditions on the data, the optimization algorithm converges to a stationary point of the expected loss. In this way the algorithm is guaranteed to recover a matrix D, generator of the union of the subspaces. While the results provided in this work are interesting and non-trivial, the authors should clarify the benefit of this approach. One of the claimed advantages of the work is to save memory. However, unlike sparse dictionary learning setting, the coefficients (U and V in this case) need to be stored to be able to perform the subsequent clustering stage. So it is not clear how this would impact the claimed memory savings. Still it is interesting that using the basis explicitly the problem can be decoupled into several smaller problems Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): For clarity in the presentation, the authors should define the subspace clustering problem and how it is solved via LRR (use of spectral clustering etc) By explicitly modeling a basis, the approach can be thought (as the authors say) as a regularized version of (non-sparse) dictionary learning. The paper follows the same structure as the work by (Mairal et al 2009) in online sparse dictionary learning but in the LRR setting. Unlike sparse dictionary learning setting, the coefficients (U and V in this case) need to be stored to be able to perform the clustering stage. So it is not clear how this would impact the claimed memory savings. The authors discuss the reduced memory cost of the algorithm in the paragraph starting in line 448 and describe it as one of the main contributions. While U and V do not need to be recorded for finding the optimal D, they are needed to perform the clustering stage. D captures the union of the subspaces but does not provide the subspaces themselves. Clearly U and V can be obtained from D (using (3.1) and (3.2)) but they need to be eventually recorded for solving the clustering problem. Still it is interesting that using the basis explicitly the problem can be decoupled into several smaller problems. Maybe this can be avoided consider an online version of Spectral Clustering. In the abstract the authors mention that the computational per iteration goes down from O(np^2) to O(pd^2). These are a batch and an online method, which makes the comparison difficult and might be misleading. The authors claim on Remark 2 that the proposed setting is superior to LRR and also show results in this direction. As LRR (like SSC) comes with subspace recovery guarantees, it seems natural to think that this advantage would be outside of the region in which this results are valid. Please explain in more detail. The solution to the original problem (1.1) has some guarantees for recovering the correct subspaces (as shown by Liu et al), does the proposed algorithm maintain this properties? In other words, the algorithm converges to a stationary point of the expected loss of f(D), how does this stationary point relate to the solution of the batch version? The authors might find useful looking at the results provided in [A] for the online RPCA case, see Section IV.C. Section 5.1 evaluates subspace recovery. What is evaluated is the recovery the union of subspaces (generated by Dt). It would be informative to include also analysis of subspace recovery (not only union) with synthetic data and include SSC in the comparisons. As shown in those experiments, given the dimensionality and number of subspaces considered, PCP is very good at recovering the union of the subspaces. One could consider applying the PCP and solving SSC on the data with dimensionally reduction. This approach would not work if the union of the subspaces is of high dimension (while the proposed method should work). In Section 5.2 are interesting but go beyond the setting in which subspace clustering is expected to work well. Standard problems in this setting are motion segmentation from feature point trajectories and face clustering under varying illumination. It would be good to include one such example. The comparisons in terms of computational time are not very informative, as implementations might differ significantly. Which implementation of SSC was used? Please include a comment in Section 5 explaining (or referring to) where the optimal parameters lambda_1, lambda_2 and lambda_3 come from. In remark 3, it is stated that lambda_3 has (or grows to) a fixed value while in this section lambda_3 grows with ''t'' There have been proposed efficient (or online) versions of SSC. The authors should include this citation and maybe consider this work in the comparisons. See for example [B]. There are other subspace tracking algorithms that could be used for clustering, such as [C] Also, would be important to state the cost of running the Spectral clustering problem. Minor comments: in line 520 ''more slower'' The description of Section 3 is a little bit confusing, specifically because the equations (3.1) to (3.3) are displayed inside Algorithm 1. Reference: [A] Mardani, et al. "Subspace learning and imputation for streaming big data matrices and tensors." Signal Processing, IEEE Transactions on 63.10 (2015): 2663-2677. [B] Peng, et al . "Scalable sparse subspace clustering." CVPR. 2013. [C] Balzano, et al. "k-Subspaces with missing data." Statistical Signal Processing Workshop (SSP), 2012 IEEE. IEEE, 2012. =====