We thank the reviewers for their valuable comments, and will improve the paper accordingly.$
Contributions (R1 & R2):
We make two major contributions. First, we propose the MABN prior, which is one of the two proper priors that can induce diversity in LVMs (the other is DPP). MABN has two advantages over DPP: (1) it enables variational inference (VI) which is usually more efficient than MCMC; (2) it can extend to Bayesian nonparametric LVMs with an unbounded number of components. Second, we develop a VI algorithm, which is nontrivial due to the complexity of MABN. Especially, bounding the partition function (Lemma 1) is technically challenging (please see supplements for details).
Interpretability (R1 & R2):
The idea of improving interpretability via diversification was proposed by Wang et al. (2015), who argues that components (phenotypes in their case) that are mutually distinct are more interpretable. There usually exist substantial overlapping and redundancy (Zou & Adams, 2012) amongst components learned by conventional LVMs, making them ambiguous and difficult to interpret. E.g.: Wang et al. (2015) observed that the learned phenotypes by standard tensor factorization (an LVM) have much overlap, causing confusion (e.g., two similar treatment plans are learned for the same type of disease). Diversified LVMs encourage components to be non-overlapping, which makes it cognitively easier for humans to associate components with different concepts in the physical world, i.e., achieving better interpretability.
To illustrate interpretability, we did experiments on the Adult dataset, where the task is to predict whether income exceeds $50K/yr based on census data. We set the number of experts to 5 and visualize each expert by picking up the top 4 features associated with the largest classification coefficients (in vector \beta). The first 3 experts learned by non-diversified expert model are (The other 2 are omitted for saving space):
E1: capital-gain, capital-loss, hours-per-week, education-num
E2: capital-gain, capital-loss, education-num, hours-per-week
E3: capital-gain, hours-per-week, france, age
We can see that they have many overlapping features, making them hard to interpret. The first 3 experts learned by diversified expert model are:
E1: capital-gain, exec-managerial, hours-per-week, Prof-specialty
E2: white, france, age, Married-AF-spouse
E3: education-num, capital-gain, Prof-school, Doctorate
They are more mutually distinct and amenable for interpretation: E1-3 relate to occupation, demographics, and education, respectively.
To R1:
1. The reparametrization (Eq.3) still encourages larger mutual angles via -\sum_j^{i-1}a_j^T a_i as Eq.(1) does. The difference is: Eq.(3) moves ||\sum_{j=1}^{i-1}\tilde{a}_j||_2 from the denominator to the normalizer Z=1/C_p(\kappa ||\sum_{j=1}^{i-1}\tilde{a}_j||_2), in light of that in variational inference the expectation w.r.t the denominator is hard to compute. However, this brings a new problem that computing the expectation w.r.t normalizer Z is difficult. We address this issue by upper bounding Z (Lemma 1), then computing the expectation w.r.t the upper bound.
2. VI stands for variational inference.
3. Expectation of the angle is A_p(\kappa)\mu (see supplements).
To R2:
1. The idea of capturing long-tail patterns via diversification was proposed in (Xie et al., 2015). A possible reason for standard LVMs to be inadequate to capture long-tail patterns may lie in the design of their training objective. For example, a maximum likelihood estimator would reward itself by modeling the dominant patterns well (specifically, allocating a number of components to cover the dominant patterns as best as possible) as they are the major contributors of the likelihood function. On the other hand, the long-tail patterns contribute much less to the likelihood, thereby it is not very rewarding to model them well and LVMs tend to ignore them. Diversified LVMs solve this problem by encouraging model components to be far apart from each other, then one would expect that such components will tend to be less overlapping and less aggregated over dominant patterns and therefore more likely to be pushed to capture long-tail patterns.
The above explanations regarding interpretability and long-tail apply to all kinds of definitions of diversity, including the angle-based one, which encourages the components to have larger angles that reduce correlation, overlap and redundancy.
It is very necessary to establish theory to formally explain long-tail and interpretability, which is not the focus of this paper and would be left for future study.
2. We’ll reorganize the equations and compress the discussion of DPP.
To R4:
Thanks. We’ll move the longer equations to appendix and present results on more models in an extended version.