Paper ID: 823 Title: Stochastic Optimization for Multiview Representation Learning using Partial Least Squares Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper provides considers two stochastic optimization algorithms to solve partial least squares. The papers contributions include the introduction of these two algorithms, a fairly extensive theoretical analysis of the iteration complexity bounds and an experimental evaluation with both synthetic and real data. Clarity - Justification: This paper was largely a joy to read. I do not consider myself an expert in the subject matter of the paper, but the authors present a clear and coherent development of their ideas, including especially the theoretical analysis. There were a few typos that did somewhat obscure the clarity of presentation. Some of these, I've highlighted below. Significance - Justification: I am not an expert in the subject matter of the paper, so I am unable to confidently assess the novelty or significance of this work. That said, it seems like the proposal and analysis of the two convex relaxations, if novel, do constitute a significant contribution. The empirical evaluations show that while these proposed approaches are not really competitive with the SOTA, they are not far off and do offer theoretical advantages that are not shared with the SOTA. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Minor comments, typos, etc: line 136: "so far as been in the context" been -> being line 162: Was the term "tracks of flesh points" really intended? If so, I don't know what this is. lines 370: and 375 M' should probably be M_{t+1}. line 434: "of M are [the] same as" line 462: "maximized" should probably be "minimized". line 477: "derivative of [the] Lagrangian". line 518: "inequality" should probably be "equality" ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a new stochastic optimization algorithm for partial least squares (PLS) in multi view settings. The authors introduce two stochastic approximations, matrix stochastic gradient and matrix exponentiated gradient, and show that these methods compare reasonably well to an accurate but expensive version of incremental PLS. Clarity - Justification: The paper is well written and the mathematical details are fairly easy to follow. Significance - Justification: Given that data is more frequently encountered in streaming settings and big data regimes, the algorithms presented here are particularly useful. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I generally think this is a strong paper. It is well written, the justification for the algorithms is strong, and their performance compared to the incremental PLS baseline look good. That being said, I do think that the experimental results could be improved with more extensive comparisons to previous approaches. While the synthetic and real world datasets are reasonable for comparing the proposed methods, the authors could strengthen their results by considering additional algorithms such as the recent online PCA algorithm of Boutsidis et al. and Brand’s ``Fast Low Rank Modifications of the Thin Singular Value Decomposition.’’ In particular, the latter method could provide a good baseline to compare against: unlike the proposed incremental partial least squares algorithm, Brand’s algorithm does not require a full singular value decomposition at each tilmestep. It is fast and numerically stable, however it lacks theoretical guarantees and may exhibit poor accuracy if the data is presented in a suboptimal oder. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper explores online convex optimization approaches to partial least squares (PLS) with multiview data. The authors analyze the convergence rate of (1) the online matrix exponentiated gradient PLS algorithm of Arora et al. 2012 and (2) a new matrix stochastic gradient algorithm for PLS. Both algorithms can be viewed as applications of stochastic mirror descent to convex relaxations of the population PLS problem, and preexisting stochastic mirror descent analyses are applied to obtain the results. The algorithms studied are compared with other online PLS approaches (by plotting PLS objective versus versus running time or iteration count) on one synthetic dataset and one X-ray microbeam dataset, and the proposed algorithms are found to typically perform similar to or worse than the preexisting incremental PLS algorithm. Clarity - Justification: The paper on the whole is very clearly written, but please note the following opportunities to improve clarity: -L013: For those unfamiliar with PLS, could you provide specific examples of applications? -L0188: This sentence suggests that we typically want to do CCA, not PLS, but we PLS will suffice if we are content and able to whiten our data. Are there other cases in which the PLS is clearly preferable to the CCA objective? -L547-551: Could the authors provide citations for the claims in these sentences? Significance - Justification: -Does the paper contribute a major breakthrough or an incremental advance? The main contributions of the paper appear to be (1) the derivation of a new stochastic gradient descent procedure to solve a convex relaxation of the PLS problem and (2) the provision of stochastic mirror descent analyses for this stochastic gradient algorithm and for the preexisting matrix exponentiated gradient algorithm of Arora et al. These contributions are significant but closer to incremental advances than major breakthroughs. -Are other people (practitioners or researchers) likely to use these ideas or build on them? I cannot tell whether practitioners are likely to adopt the proposed algorithms over incremental PLS. On the one hand, incremental PLS appears to be both faster and more accurate than the proposed procedures in practice. On the other hand, the proposed algorithms come equipped with guarantees of correctness and rates of convergence to optimality. I could imagine researchers building on the theoretical tools presented here to close the theory-practice gap and develop an algorithm with both theoretical guarantees and performance matching incremental PLS. In addition, the paper would benefit from more concrete motivation for solving the PLS problem. -Does the paper address a difficult problem in a better way than previous research? The authors propose a new algorithm for online PLS, but it typically performs no better than the preexisting incremental PLS algorithm. The authors analyze the convergence rates of two online PLS algorithms, but the analyses appear to be a straightforward applications of existing convergence rate analyses. -Is it clear how this work differs from previous contributions and is related work adequately referenced? It is not entirely clear which contributions are new to this paper and which are being rehashed for the reader’s convenience, since many of the results and algorithms presented (including the algorithm of Sec. 3.2) are restatements of prior work. I would recommend that the authors highlight their principal contributions early in the paper, clarify that the algorithm in section 3.1 is new, and, when referencing related prior work, discuss the key differences being introduced in this paper. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): -Technical questions L547: How do the eigenvalues of C relate to the singular values of Sigma_{xy} when singular values are repeated? L547: Would we expect repeated singular values if we are first whitening the x and y views? -Is this a complete piece of work, or merely a position paper? This is a complete piece of work and not merely a position paper. -Finally, please provide a short 1-2 sentence summary of your review. The main contributions of the paper appear to be (1) the derivation of a stochastic gradient descent procedure that is quite similar to preexisting algorithms for online PLS (e.g., the algorithms of Arora et al.) and (2) the provision of standard stochastic mirror descent analyses for this stochastic gradient algorithm and for the preexisting matrix exponentiated gradient algorithm of Arora et al. I believe these must be classified as incremental advances. =====