We thank all the reviewers for their positive and helpful comments.$
We emphasize that our main contribution is to provide insight into when alternating minimization for weighted low rank can recover the underlying ground truth. We take a first step in this direction and identify the crucial technical difficulties which make this problem different from the special case of matrix completion. We also propose a whitening step to address these difficulties.

Our analysis also gives practical hints. For example, in general the ideal weight will depend on the problem and the noise. However, even if the ideal weight W' does not satisfy the assumptions, a simple practical heuristic that can rectify this may be to use a linear combination of W' and the all one matrix E, that is, set W = c E + (1- c) W' for some c. The error bound |W \hp N| is bounded by c|N|+(1-c)|W' \hp N|, where \hp is the Hadamard product. This is better than using the all one matrix.

Also, our analysis implies a simple trick to modify the algorithm to be more practical. Since we show that the iterates move towards the ground truth as long as they have good incoherence and spectral properties, one can do vanilla alternating minimization and check these properties every few iterations (which is cheap), and only perform whitening when necessary. 


Assigned_Reviewer_1:
1) Assumption A2 is satisfied in the case of the traditional notion of "spectral gap", which is why we choose the name. But we agree to choose a better name since it is significantly more general.

It requires that the weight is not arbitrarily far from the all ones matrix, and still allows a wide range of matrices. For example, even a random 0-1 matrix with a few 1 entries satisfies this w.h.p. (after proper scaling). 

2) Thresholding can guarantee incoherence but not the spectral property needed (See Line 729). The latter is unique to weighted low rank and is the main reason to introduce the whitening. In matrix completion, it is automatically maintained by the randomness (See Line 711), which is not available in our case.

3) In the special case of matrix completion, our algorithm requires a slightly larger number of sampled entries m. With that sample size, choosing W to have 0 for unobserved entries, and n^2/m for the observed ones, the guarantee on the recovery error is comparable with prior work (e.g., a factor of k sqrt{log n} from Keshavan et al., 2009).

In general, see the comments at the top for how to choose W. We will add some experimental results for demonstration.


Assigned_Reviewer_4:
--heteroskedasticity
Our analysis is applicable to any noise model since there is no assumption on the noise (but of course, the recovery error will depend on the level of the noise). Therefore, it can be applied to heteroskedastic noise models, such as Glove (Pennington et al.).


Assigned_Reviewer_5:
--On the motivation for the problem:
The problem studied is well motivated and of importance (also pointed out by Assigned_Reviewer_4), since in many applications, the noise is not i.i.d.. Even in cases when unweighted low rank can get good results, it has been empirically shown that incorporating weights can lead to better results (See Srebro &amp; Jaakkola, 2003). Furthermore, there are important applications where non-uniform weights can lead to much better performance. In particular, in word embedding applications studied in (Pennington et al., 2014; Arora et al., 2015), weighted low rank can lead to accuracy about 70% while unweighted low rank gives accuracy less than 50%. 

Alternating minimization is arguably one of the most common algorithms for this problem. We think that it is important to study when and how the algorithm can work as we did in this work. The analysis can lead to insight in how to rectify the algorithm when it fails, and also methods to make it more practical, as demonstrated below.

--On the practicality of the algorithm:
See the comments at the top for how to make it more practical.

--On the importance of non-binary weights:
In the general case of the problem, different entries have different noise, though all the entries are revealed. In that case, intuitively, one should focus more on the entries with less noise, which corresponds to larger weight to them. This has been used empirically (Line 77-95). Our paper gives a mathematical justification why such weights can give better error bound (there always exists a W that gives a better error bound than uniform weights, since the latter is a special set of weights satisfying our assumptions).

Another way to see this would be that uniform weights will make the minimization output a solution tuned to the noise, which can be fairly large. With appropriate choice of W however, ||W \hp N|| where \hp is the Hadamard product, could be much smaller -- so our result would imply a much better recovery error.