Paper ID: 1237 Title: Conditional Dependence via Shannon Capacity: Axioms, Estimators and Applications Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Authors propose a novel method for estimating the strength of a causal relationship based on Shannon capacity. They prove its consistency and illustrate its use through an application to single-cell flow-cytometry. Clarity - Justification: Some minor concerns (see below) and typos (e.g., extra } in line 566). Significance - Justification: Fine approach, solid theoretical results. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Interesting, but quite technical paper. Some relatively minor issues: - D(.|.) in line 085 not really defined (but clear from context); similarly for D(.||.|.) in line 270. - This makes it difficult to follow the reasoning about Axiom 3: why can be arguments be swapped? what's the convex hull of a distribution (line 222)/the convex hull of the range of a map (line 295)? - Unclear why Axiom 3 is relevant/desirable. - Where does \rho_{k,i} in (12) come from and how should it be chosen? - Why the additional constraint on w_i in line 551? - How to tune the other hyperparameters such as k, h_N, and a? PS. Thanks for the reply, which clarifies some (\rho_{k,i}, my mistake) but not all of my concerns (why can min and max be swapped in line 272?) ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Given a known causal relationship between two variables, this paper proposes a nonparametric method, CMI, to measure its strength. The paper offers a list of axioms that CMI satisfies, convergence rates between the empirical and population versions, and experimental results on both synthetic and real world data. Clarity - Justification: The paper is clearly written, easy to read, and mathematically sound. Significance - Justification: Measuring the strength of causal relationships is a fundamental problem across many areas in the applied sciences. As a main limitation, the proposed method only works for pairs of variables and does not account for confounding. The experimental evaluation is rather weak (one real dataset). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall a good paper. I would like to see more experimental evidence (if possible real, otherwise synthetic) to better understand when we should expect CMI to work *or* fail. The authors could also provide better discussion on the optimization side of things: how expensive is it to compute CMI? How bad is the situation when CMI gets stuck at a local minima? That is, how much do two different runs of CMI differ on the same data? Regarding theory, how do your convergence results compare to the ones of (http://www.princeton.edu/~samory/Papers/kNNModeRates.pdf)? Minor: Line 657: Is the term P(X|T=t) missing? Should synthetic-data experiments appear before the real-data experiments? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper is about estimating the mutual information of two variables X and Y jointly distributed based on a distribution derived from P_Y|X and a distribution over X that is either uniform or selected to maximize the Shannon capacity. The authors provide estimators for this quantity, and give some data validation for their methods on gene data. But I think I was less sure about the justification for the interest in this quantity, see below. Clarity - Justification: The paper was fairly easy to follow. Significance - Justification: The interest in this quantity the authors are interested in is motivated by causal considerations, but (a) this only works if P_Y|X is purely causal, which implies hidden common causes are assumed away. The application is gene pathways, where unobserved variables are a perennial worry, and (b) if assumptions for (a) do not hold, these results may say something about discriminative classifiers in machine learning. But it doesn’t seem like there is much interest in this. Why? Putting the word “causal” in doesn’t in itself make something intrinsically more interesting! The estimators seem interesting to me, but I am not an expert on these types of estimators. I think the motivation for the target quantity is weak from a causal point of view. I think acceptance of this paper should hinge on estimator quality, and motivation from statistical, rather than causal, considerations. It is insufficient to assert the causal direction X -> Y as known, it is also necessary to rule out hidden common causes of X and Y, otherwise any function of p(Y | X) is going to capture confounding strength and causal strength together. The axiomatic characterization is interesting for a measure of dependence, but it does not try to rule out confounding in any way, so the axioms could hold just as well for X and Y that have no direct causal relationship, but instead a hidden common cause. So it is unclear in what sense these axioms separate causal from non-causal relationships intuitively. If you have to assume confounding away by fiat, the axioms aren’t about causality but about dependence. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): pp. 1-2: “Where P_Y represents the distribution at the output of a channel P_Y|X with input given by U_X.” This is slightly confusing notation because standard notation would entail P_Y to be a marginal distribution of Y obtained from P_Y|X P_X, not from P_Y|X U_X. Perhaps call P_Y|X U_X by another letter, Q or P*. Also isn’t P_U = U_X? Later P_Y refers to a marginal of P_Y|X Q_X, which is very confusing and overloaded. What motivates the proposal (10)? This is never explained. ===== Review #4 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose using Shannon capacity, which they call Capacitated Mutual Information (CMI), to quantify the strength of the causality relationship between two variables. They justify this approach by proposing five axioms that a causality measure should satisfy and proving that CMI satisfies all axioms, whereas previously proposed measure based on mutual information with a uniform prior (UMI) does not. They then propose estimators for both measures to compute them from data samples. They show theoretical guarantees for these estimators that they converge to the true quantity under regularity assumptions. Clarity - Justification: The paper is easy to follow for the most part. I had some trouble confirming Eq. (5) from the given reference. I also haven't checked the proofs in the supplementary rigorously. Introducing the discrete estimators before the continuous ones would be a more sensible choice in Section 3. The experimental setup in Section 5.1 is also not clear without looking at the reference paper. The reader should at least be able to understand what "error" means and how it is computed (e.g. related to Fig. 1) without having to look at the cited paper. There is also some inconsistency in the notation. The weights w are subscripted both as w_i (with i the sample index) or w_x, w_{X_i} (with x the sample itself). The authors should also revise the supplementary, since it has many typos and other issues like small parentheses/brackets. Significance - Justification: I find the axiomatic approach the authors have taken interesting but I am not very convinced by it. However the estimators and their analysis for the two measures are interesting in their own right (cf. detailed comments). The experimental results are overall nice but not explained very well. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Concerning the axiomatic approach: It is not very clear to me how axioms 3 and 4 are related to causality, I think these points should be made clearer. I also believe there is a question on whether causality can be inferred from a given set of samples at all (without temporal relationships, knowledge of the generating system or be able to influence it), which I am ignoring. About the estimators: The authors seem to use a data weighting term that is inversely related to the density f(x) in order to uniformize the underlying density and estimate the UMI. They then show that maximizing over these weights subject to constraints gives an estimate for CMI, which is an interesting result. It should be noted that the theorem for CMI currently only holds for the discrete case. The assumptions that are enumerated in the supplementary are also restrictive, where everything is bounded by constants. Still the results are a good contribution to the literature. Although the authors do not provide much detail about how the experimental setup is evaluated in the gene experiments, the results seem promising. The synthetic experiment confirms the accuracy of their estimates, however it does not actually prove anything about measuring causality. I am confused by point (d) in the discussion section: If the aim is to estimate the direction of causality between two variables, this is a binary problem. Then any estimator below 50% accuracy is worse than random guess and can be flipped to obtain higher than 50% accuracy. Some clarification regarding this would be welcome. To sum up, the main issue I have is with the axioms regarding causality, as I noted above. I think authors needed to justify the axioms much better. They should also provide trivial and non-trivial examples on probability models that have differing levels of causality (not correlation) according to their definition and axioms. =====