We thank all reviewers for the insightful comments. We shall revise the paper in response to all their suggestions. According to the ICML feedback guideline, we will now address issues for improvement in this feedback, that need some explanation. $
==

Assigned_Reviewer_1: I would like to see some elaboration about why their approach do not violate computational complexity proofs. Is is since the "expected" norms of weight vectors which are NP hard are usually much bigger than 1?

Answer: Hardness results from computational complexity consider the general problem of approximating the partition function over all problem instances, i.e., all combinations of graphs and parameter vectors. If one focuses on a subset of all problem instances, such hardness results might not hold. For instance, computing the partition function of tree-structured models is in P for arbitrary parameter vectors. In contrast, we have shown that the partition function of arbitrary graphical structured models has a polynomial time approximation scheme for fixed parameter norms. Hence, our result does not contradict the general hardness of the problem.

==

Assigned_Reviewer_2: The runtime is claimed to scale as n^{2k}. The experiments reported have n=128*128, k=16. It seems that n^2k>(10000)^32 should be way too big. Hidden constants in the bigO notation?

Answer: The reported asymptotic runtime is indeed the worst-case runtime. The complexity is caused by the computation of the Q(i) values which depend on the graphical structure. For large k, our implementation exploited the repetitive structure of grid models so that it did not reach the worst-case runtime. Such implementation details could be reported in an extended version of this article.

==

Assigned_Reviewer_2: In the introduction and table, the relationship between k and epsilon should be made clear [..] similarly when you talk about the stochastic version, epsilon and the probability of failure should affect the reported runtime

Answer: Theorem 5 describes the dependence between k and epsilon and Theorem 6 describes the dependence between m (number of samples), epsilon and zeta. Thank you very much for this hint. We shall adjust the notation to reflect the dependences among these variables, as also pointed out by Assigned_Reviewer_2.

==

Assigned_Reviewer_2,
Assigned_Reviewer_4: in definition 1, "j" is never defined or explained.

Answer: In Definition 1, Pi_phi is declared as a function from k natural numbers to the real line. This is an implicit declaration of the domain of "j". However, we will adjust Definition 1 to make this explicit.

==

Assigned_Reviewer_4: Experimentally, it does not appear that this approach significantly outperforms the naive mean field lower bound.

Answer: This seems to be a misunderstanding of our experimental setup. The experiment (results in Fig. 2) is designed to compare the theoretically predicted performance according to Theorem 5 with the actual performance. Although Theorem 5 predicts poor performance, we find it remarkable that the quality is still in the range of known heuristics. Moreover, from the naive mean field lower bound, we cannot derive the worst case error. In contrast, our new method allows to compute the error in advance, i.e., before actually running the algorithm.

==

Assigned_Reviewer_4: I'm also not sure how useful the described algorithms actually are. [..] it seems that all of that work is required just to barely compete with current methods - not to outperform them.

Answer: The comparison with other methods was not designed to outperform them in terms of accuracy. However, Mean Field, Belief Propagation and TRW are in fact outperformed in terms of *guaranteed* quality. Since these methods do not provide any quality guarantee, we made this comparison in order to supply the reader with an intuition of what price needs to be paid for the quality guarantee. 
Moreover, with our approximation, we tread a path for the first time (the quadrature) and report on the encountered landscape. This might also be interpreted as some kind of usefulness.

==

Assigned_Reviewer_4: there may exist applications in which this technique is ideally suited. If so, the authors should explain what these applications are and compare against the baseline approaches in this setting.

Answer: To the best of our knowledge, we present a novel method to a relevant problem that is an alternative to all approaches presented before. Hence, it is too early to discuss applications. However, at the end of Section 3.1, we point out that the most costly part of the computation can be precomputed on regular hardware, so that the remaining computation can be carried out on sensors, embedded or wearable devices. This opens opportunities for applications, which we shall investigate in the future.