Reviewer 1$Thank you for the detailed review. The reviewer would like clarification on our Lagrange updates. The Lagrange updates we use are a major contribution of the paper; conventional updates result in lack of convergence, while the updates we use here are quite stable. The Bregman iterative method solves problem (9) from the paper by minimizing the energy J(u) - < u,p > +1/2||Au-b||^2. Here, p is the Lagrange multiplier. When p=0, this is the objective function J(u) with an added quadratic penalty. If we minimize this energy with p=0, we only approximately enforce the constraint Ax=b. A Bregman iterative approach begins by first solving with p=0 to obtain a minimizer x’, before computing the (sub)gradient of J(x’), and adding this to p. It then solves again with the new p. This makes the quadratic penalty more dominant, and we obtain a new solution with a smaller value of ||Au-b||^2. It then iteratively minimizes the objective and adds the sub-gradient of J to p until ||Au-b||^2 is close to zero. Convergence of this method is well understood (review article Yin et al ‘08). Consider our neural network formulation. Suppose the vector u of unknowns contains the vectors {z_l}, {W_l} and {a_l} for all l. The objective J is the loss function from equation (5), and the operator A contains all constraints in (5). Because the loss function only depends on the vector z_L (the outputs of the final layer), and has no dependence on any other vectors or layers, only the derivative of J with respect to z_L will be non-zero (all other partial derivatives will be zero). Note that < u,p > is an inner product that involves all unknowns. However, the derivative of J with respect to any vector of unknowns (except z_L) is zero, and therefore the associated Lagrange multiplier vectors can be neglected because the corresponding entries in p will always be zero. This Bregman procedure updates the value of p using the gradient of J, rather than by using the constraint error Ax-b as in the classical multiplier method. It is easily shown (using the conditions in Section 4.1) that the Bregman iteration and ADMM are equivalent when A is a linear operator (this is elaborated in reference Yin et al ‘08). However this equivalence breaks down for non-linear A. For this reason, we will shorten/remove the ADMM discussion, and expand the Bregman explanation in the paper to make the justification of the Lagrange update clearer. So: why use Bregman iteration and not the method of multipliers? As the reviewer has observed, the classical method of multipliers requires more complex inner-product terms involving non-linear constraints. We have tried *many* formulations of ADMM for this problem, and the standard ADMM formulations always become unstable, whereas the Bregman iteration is highly stable as it avoids inner products with non-linear functions. Unfortunately, we do not have space in a conference paper for methods that don’t work, but the failure of classical ADMM is something we can elaborate on in an arxiv version. Q1 & Q4: Our formulation is a direct application of Bregman iteration to solve the constrained problem (5). The reviewer points out that a classical ADMM formulation would require a term of the form < z_L–W_L a_{L-1},\lambda > . We do not use standard ADMM updates, but rather Bregman iterative updates that only require the term < z_L,\lambda >. Q2: Thank you. We accidently left the Lagrange multipliers out of the Z_L update. This has been fixed. Q3: W_l denotes the weights of the l’th layer of the network, and not the iterate obtained after l steps of ADMM. The “for” loop in Algorithm 1 is updating each layer’s weights, rather than counting iterates of ADMM. Regarding Experiments: We will fit a graph of objective function/ training error over time into the paper. We originally had such a graph, but cut it to save space. We can make space by paring down the ADMM interpretation in Section 4.2. Reviewer 2 Thank you for the review. Hopefully, the above discussion clears up confusion about the Lagrange multipliers. Additional experiments would be useful, but the limited space forced us to pick. As such, we chose the scenario most favorable to our competition, which is using a well-known customized implementation on a GPU. In comparison, our new approach was built from the ground up in an assuredly less optimized way (Python vs. the special-purpose compiled code in Torch), and run on CPUs. Nonetheless, we were able to achieve roughly 100x speedup with 7500 cores. We will make code available after publication. Reviewer 3: The plateaus in the performance curves correspond to periods where the training accuracy is improving, but the test accuracy is nearly constant. The plateaus are not present in the training error curves. It seems that during this period of time the model is fitting noisy outlier data rather than the true distribution. We will fit a training curve into the camera ready paper.