We thank all the reviewers for their helpful comments. $
To Reviewer 2
We completely agree that the variational inference algorithm we presented is similar to K-SVD. Still in our algorithm there are some extra regularization terms resulting from the prior, which makes a significant difference. For example, the prior we introduce encourages sparse correlations between two latent features. We went back and tried K-SVD in our genome experiment. K-SVD performed somewhat worse in the predictive result (a little less than -530 wrt Fig. 3) and in the visual result (Fig. 2). Actually, we want to emphasize that K-SVD is much more similar to BPFA, and their performance in these experiments was comparable. The major difference in our formulation is the Markov structure (not present in K-SVD or BPFA) which we argue improves modeling ability.
In the denoising experiment, K-SVD can exhibit reasonable correlations if we collect the transition statistics and display them. (The same can be said for BPFA.) However, these correlations are not explicitly modeled by K-SVD, as they are in our algorithm. We believe that by imposing this structure in the model, we can encourage the model to explicitly learn correlations which leads to better performance since it is actually modeling this information.
The reviewer also mentioned that it would be better to learn the noise variance. This is a good point, however we found that this additional inference is typically much more sensitive than reported. In fact, we find with these algorithms that the noise variance will decrease to zero in the BNP setting since including more dictionary elements can always improve modeling performance, and that it doesn't do this only when certain local optimal solutions are found (which can be made to happen through clever initializations). We empirically found that the algorithm we use to set the noise variance (Liu, et. al. 2013) is extremely accurate in approximating the noise for the problems we considered.
Regarding performance of compared methods: There are multiple inference solutions we can use for these models (for example MCMC). One of our interests is scalability (although we acknowledge we don't use massive data sets in this paper). Therefore, we care about deterministic inference algorithm performance for all models. In our experiments we compare with a variational solution of BPFA, rather than the MCMC method presented by Zhou et al., which leads to the different numbers. For K-SVD, we used the code provided by them with their recommended settings, and so perhaps the difference is in the new noisy data we generate.
To Reviewer 5
We used a greedy search algorithm to sequentially pick features that optimize the variational bound. The latent features we picked for all instances form a single Markov chain, so the greedy search algorithm is like filling a "bridge" between two consecutive return times, and the return time continues to grow through this process. We can therefore set the inferred return time to just be equal to the time point we stop in our algorithm (i.e., the iteration number in the while loop in Algorithm 1).
To Reviewer 6
We thank the review for the positive comments.