We thank the reviewers for their valuable comments. We hope that reviewer 6 has been able to read the whole article, especially as the experimental validation on real-world datasets that we propose is central in our contribution.$
# Terminology

We specify in the introduction that we tackle general low-rank (k < p) matrix factorization problems, with or without sparsity constraints over the factors, with or without complete knowledge of the data.

This indeed differs from the over-complete dictionary learning setting, which assumes k > p. However, to the best of our knowledge, the use of the dictionary learning terminology is not reserved to the over-complete setting, and has also been used in the k <= p case: it covers any factorization setting with l2 reconstruction loss and additional sparsity constraints / penalties on the factors. As matrix completion problems can also be formalized in this setting, we cover all three frameworks with weak semantic distinction in the introduction.

We indifferently use the matrix factorization and dictionary learning terminology in the rest of the paper. The algorithm we propose is a non trivial extension of a stochastic matrix factorization algorithm that was first proposed to tackle dictionary learning, which makes it convenient to use DL terminology. We make an explicit remark on this aspect in section 2. If our contribution is accepted, we will state it in the introduction to avoid any language ambiguity.

# Sparse PCA

As mentioned by reviewer 7, low rank matrix decomposition with sparse factors has been addressed by sparse principal component analysis methods. Classical SPCA formulation differs from dictionary learning objective: it seeks a sparse orthonormal basis on which projected data has maximal variance, whereas the problem for which we propose a solver does not enforce orthogonality.

Most importantly, classical sparse PCA algorithms does not scale to very large dataset size as they involves a QR decomposition of the sample covariance matrix, whose computation costs O(np^2) (see [Ma '11] for instance).

As a consequence, the SPCA methods mentioned by the reviewer are tested on real/generated ~ 1000 x 1000 matrices, that are 5 orders of magnitude smaller than our setting. Due to limited space, we chose to review state-of-the-art work with scalability requirements closer to ours - eg methods with linear dependency in sample dimension. Following the reviewer's guidance, we will explain how we differ from SPCA setting and cite advances in this domain. On the other hand, assessing classical SPCA performance on the full HCP dataset is simply beyond reasonable computational power.

# Contribution importance

To our knowledge, randomly reducing the dimension of samples before streaming them to an online algorithm is entirely novel for matrix factorization. Although based on simple ideas, our algorithm thus builds upon a new perspective for improving scalability of data decomposition.

Like our work, [Szabo and al. '11] tackle online matrix factorization with missing values. Yet their computational complexity depends on the total number of features and not the number of observed features (see line 292). The algorithm we propose solves this issue. As we extensively show, this enables to speed-up decomposition convergence by introducing artificial missing values in an entirely known dataset. Such acceleration is original - [Szabo and al. '11] does not mention random subsampling, and their work is slower in a fully-observed case.

Our algorithm is not solely restricted to settings involving sparse factors: it accelerates both dense and sparse decompositions. In fact, speeding-up matrix factorization is more challenging in the sparse setting, as the l1 projection is not closed form.

On real-world fMRI data, our algorithm provides a 10x speed-up compared to the fastest sparse decomposition algorithm currently available - namely online decomposition from [Mairal and al. '10], presently the only method to scale to such data size.

Ensuring that complexity depends linearly on the number of observed features also proves critical to outperform existing algorithms for collaborative filtering. It makes our algorithm useful in the connected domain of recommender systems.

We improved our implementation following the submission and performed experiments on the full HCP dataset. We were able to compute the 70 components decomposition of a 2 000 000 x 200 000 matrix in less than 10 hours, where current most efficient algorithm takes nearly a week. For practitioners, such speed-up is a significant improvement, especially as we show that it comes with no loss in result accuracy.

# Theoretical analysis

The additional randomness that we introduce with stochastic masks makes the theoretical analysis of our algorithm strongly involved, which is why we focused on providing a significant real-world validation before addressing theoretical difficulties.