We would like to kindly thank the reviewers. We will address the concerns one by one:$
Re: Related work
We will discuss and give credit to Sifre &amp; Mallat, Gens &amp; Domingos, and Zhang et al. These are relevant works, which differ from ours in the following way:
- Sifre &amp; Mallat consider a fixed feature transform (no representation learning)
- The symmetry nets of Gens &amp; Domingos are only approximately equivariant due to their use of sparse feature maps. In our experience exact equivariance is necessary to make G-convs work in deep networks. Furthermore, symmetry nets require an iterative optimization for inference, while G-convs have essentially no computational overhead.
- Zhang et al. present a network which is inspired by group-theoretical considerations, but in practice reduces to a feature transform followed by a standard translational CNN.

We will not include:
- Simnets (not related to invariance; worse performance on CIFAR-10)
- Shepard nets (aims to break translation equivariance)

Our latest results show that G-CNNs outperform many recent architectures on CIFAR-10, including Maxout, DropConnect, NIN, deeply supervised nets, and highway nets (some of which are much deeper). It is thus not true that "there are many other algorithms that outperform the proposed methods". The only better results that we are aware of use massive data augmentation, ensembling, or even deeper networks.

Re: Novelty
The general idea of building in/equivariance into a model is not new, but the manner in which this is done is very important. G-CNNs retain all the practical advantages of CNNs, while reducing the number of parameters by a large factor (e.g. 4 or 8).

We know of no method that:
- Cuts the number of parameters by a large factor.
- Is easy to use (simply replace convs by G-convs)
- Has essentially no computational overhead
- Outperforms many recent deep architectures on a competitive dataset such as CIFAR-10

So while our network "does not improve the state of the art in what kinds of symmetries can be represented" [in a mathematical theory of CNNs], it is the first time similar ideas have been turned into a practical algorithm that actually improves results on a competitive dataset, and the first time these ideas have been exposed in a correct, elegant, and accessible manner.

It is also not true that these are "old ideas that have not been favored" - our work represents the first time a non-trivial group convolution has been used in a CNN. We show that GCNNs work better than the currently favored method (data augmentation).

Re: Group theory
Depending on the background of the reader, the math in our paper may appear highly abstract or completely trivial. Previous work in group-theoretical learning (e.g. from Poggio's group) has been completely inacessible to deep learning practitioners, due to its use of sophisticated mathematical machinery (e.g locally compact topological groups). This level of abstraction furthermore leads to a theory that cannot easily be translated into actual computations, and increases the risk of errors.

We have thus opted to work with discrete groups, for which G-convs can be computed exactly. All equations in our paper reduce to sums, products and indexing operations. The use of elementary abstractions like "group operation gh", "group action Tg" and "group convolution f*psi" allow us to present simple equations that hide horrendously complicated indexing expressions (the abstraction can also be coded). Even with these simple abstractions, several highly regarded deep learning practitioners have confided to us that in their opinion, these architectures are far from trivial.

We want to emphasize that the G-conv used on the second and higher layers is *not* computed by a simple rotation of the filters - it is a more complicated transformation. If this is not understood, the method may seem trivial. The correct filter transformation is easy to see only using group theory.

Finally, we have added a discussion of coset pooling, which leads to feature maps that are functions on the quotient of a group by a subgroup.

Re: Data augmentation
We used translation + flip augmentation, because this is used by most of the mentioned competing methods. We show improved performance relative to the baseline and competing methods in this setting. This may be attributed to improved learning (more signal per filter), and the fact that the network is guaranteed to be equivariant, even away from the data distribution (thus aiding generalization). We have added a direct comparison to dihedral augmentation, which has not been favored by the community.

Re: references
We will clean up the references.

New CIFAR10 results
Augmentation: none | flip+trans | dihedral+trans (new)
CNN           9.44 |    8.86    | 14.5
P4-CNN        8.84 |    7.67    | 8.3
P4M-CNN (new) 7.59 |    7.04    | 7.03