We thank the reviewers for their comments and will incorporate appropriate changes in the final version.$
REVIEWER 1

1. In line 086, the authors summarise...Wiatowski 2015...Can we rely these findings?

Our paper is fully self-contained and does not depend on any of the results in Wiatowski 2015.

2. What is the definition of completeness...and why is it important intuitively...?

The set of translated atoms is complete for H_N if every element in H_N can be written as a linear combination of elements in the set. 
Completeness guarantees that potentially relevant features do not get "lost" in the network.

3. Figure 2(right)...but itself is not a cartoon function...

Indeed, the graph in Figure 2 (right) consists of a *linear combination* of 3 cartoon functions. Our theory can easily be generalized to linear combinations of cartoon functions, but this is not done for simplicity of exposition. 

4. The experiemnt reported in Table 1 seems to have no conclusions. ...How does this translate into the proposed theory...?

As Mallat's architecture uses wavelet filters and the modulus non-linearity, and does not include pooling, it is interesting to compare the practical performance of different filters, non-linearities, and pooling operators to Mallat's architecture. Our experiment shows that alternative architectures, covered by our theory, can deliver classification performance on par with that of Mallat's but at significantly reduced computational complexity (see lines 739-744).

Our analytical results show that Lipschitz non-linearities and pooling operators lead to translation and deformation stability and are hence properties shared by a wide class of networks. This offers a theoretical explanation for the impressive performance of DCNN-based feature extraction in a wide range of practical applications.


5. The reviewer's main concern lies with the clarity of theorems and lack of any illustrations...examples, intuitive explanations and more extensive discussion...

The notation is admittedly cumbersome, but this is a consequence of the need to formalize the elaborate network structure. As for explanations of results, we feel that, given the length constraints, this is already done where possible: (i) stability to deformations and translations: lines 627-651, 870-876, (ii) translation covariance: lines 660-666, 870-876, (iii) trade-off between deformation stability and energy conservation: lines 646-651, (iv) motivation for the deformation model in Eq. (11): lines 539-541, (v) sampled cartoon functions: lines 400-412, 456-464.  

REVIEWER 2

1. ...theoretical analysis...similar to earlier work on ScatNet...consider contribution incremental.

We assume that the reviewer refers to Bruna &amp; Mallat, 2013, which does not contain proofs for the discrete case, but refers to Mallat, 2012 for proofs. Mallat, 2012, however, applies only to continuous-time wavelet-modulus-based networks without pooling. It is therefore unclear how Mallat's theory can be generalized to the case considered here that, besides applying to the discrete case, covers arbitrary filters, non-linearities, and pooling operators.

2. A key feature of ScatNet is its translation invariance. However, neither Theorem 1 nor Theorem 2 establish the translation invariance...

Strict translation invariance was proved only for the continuous-time case in Mallat, 2012, and Wiatowski and Bolcskei, 2015. 
ScatNets, as in Bruna &amp; Mallat, 2013, are not translation-invariant but rather translation-*covariant* on the rough grid induced by the factor 2^{J} corresponding to the coarsest wavelet scale. Our result in Eq. (19) is hence in the spirit of Bruna &amp; Mallat, 2013, with the differences that (i) we prove this covariance property rather than providing a heuristical justification only, and (ii) the grid in our case is induced by the product of the pooling factors, which can be chosen freely and can differ across layers, properties arguably more natural than dependency on a fixed wavelet scale. 

3. ...the proposed method was not compared with any baseline approach. ...

We compare to Bruna &amp; Mallat, 2013, in lines 739-744, see also response to issue 4, Reviewer 1 (comparable classification performance at significantly reduced computational complexity).

4. ...the number of features will be exponentially large. How...handle the computation intractability? ...running time...on MNIST?

The new architectures we propose allow for different strategies: (i) individual layers can be made not to contribute to the feature vector (see lines 326-36), (ii) dimensionality reduction via pooling. Running times (Matlab + libSVM implementation): 0.32 - 2.47s, depending on network configuration.


REVIEWER 3

1. ...the empirical evaluation, is comparably weak...

See response to issues 4 and 5, Reviewer 1. Our architectures achieve performance on par with Bruna &amp; Mallat 2013 at significantly reduced computational complexity.