Paper ID: 1335 Title: Group Equivariant Convolutional Networks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a replacement for convolutional layers in a deep network, that extends equivariance to translation, to equivariance over a larger group that contains rotations by multiples of 90 degrees and reflections. The paper is clearly motivated, and presents experiments on CIFAR-10 that show that using the explicitly equivariant features leads to improved performance, on both CIFAR-10 without and with data augmentation. The text reads well and is very clear. Clarity - Justification: The paper is well organized. Section 4 introducing the framework of groups and transformation is easy to follow -- the paper patiently unpacks most of the equations given. The difference between equivariance and invariance, while definitely an old idea, is clearly presented here. Significance - Justification: The overall contribution is limited in the sense that these are old ideas that have not been favored: practitioners have often concluded that it was more efficient to simply augment the data with the desired transformation, rather than enforce it at the model level; the fact that the proposed equivariant net performs better on CIFAR even on the augmented set is not surprising given that the augmented set contains translated and flipped copies, but not rotated ones. Granted, this method reduces the number of parameters and the size of the dataset to store -- but are these truly the current bottlenecks in vision architectures? It is also unclear whether this explicit encoding of a select number of transformations would be effective in a setting with much more data than CIFAR: enforcing these equivariances dictates that the rotations of multiples of 90 degrees all deserve the same weight in terms of parameters -- but perhaps the real data does not match that assumption, and favors only small rotations, or some subset of transformations? In that case, a huge abundance of data would let the filters "decide for themselves" which transformations are the most relevant. An advantage of an explicitly equivariant architecture over data-induced equivariance is that the filters corresponding to transformations of one another are grouped together -- but besides learning invariance, which is shown in the paper to not be the best choice, the paper does not propose another way to directly take advantage of this. So, while this paper is very clear and makes a good presentation of the differences between equivariance and invariance, it may be more of an interesting conceptual object rather than a tool that truly ends up being used by practitioners -- larger scale experiments would be needed to show the contrary. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Additional remarks: - The main contribution of this paper is to extend equivariance to transformations beyond translations (symmetries, etc). As such, it would be good to also mention work that has focused on the particular subcase of invariance to these transformations, e.g.: Sifre and Mallat's 2013 paper Rotation, scaling and deformation invariant scattering for texture discrimination examines invariance beyond translation (rotation, scaling, small deformation), in a practical setting of texture classification. (Bruna & Mallat 2013 only deals with translations) - Section 3, first paragraph contrasts empirical findings of equivariance to lack of invariance. it would be clearer to replace "In agreement with this finding" with a reminder that equivariance strictly contains invariance, and then add that the observed lack of invariance supports the idea that it is the non-invariant type of equivariance that seems more useful (otherwise, the first sentence could still apply even if the nets had been shown to be invariant). - detail: the formatting of references is inconsistent (some first names are listed fully, some are given as initial). I find the format Last name, Initials of first name easier to parse for multiple authors than the format Full last name, Full first name. The arxiv link for Graham's fractional max pooling is missing: http://arxiv.org/abs/1412.6071 -- and the year, too (2014) Also: Olivier & Simoncelli 2016 should be Henaff & Simoncelli (olivier is the first name) ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a framework called G-CNN that generalize the convolution in 2D translation group of a standard CNN to convolution in a 3D parameter space for the affine transformation group. By replacing the standard convolution layers with G-convolution layers, better performance is reported on the rotated-MNIST and CIFAR-10 than the baseline. However, there is no comparison to similar algorithms that explicitly model the affine transformation, even though the idea of generalize CNNs to affine transformations is not new. Clarity - Justification: The presentation of this paper is relative clear, although quite a few forward references into future sections make things a bit complicated for a first reading. Significance - Justification: In terms of improving the performance of image classification, the significance is minor, as also reported in the paper, there are many other algorithms that outperform the proposed methods. In terms of explicitly handling equivariance/invariance to affine transformations, it is hard to assess the significance due to the lack of comparison to related methods. A most straightforward way of handling variations of known transformations (e.g. affine) is via data augmentation (according to the same transformations used in the proposed convolution group), and should be compared. Previous work on generalization of CNNs (some examples will be provided in "Detailed Comments") should be compared before the significance of the proposed work could be well evaluated. I'm putting "Below Average" for "Significance" because I did not see enough supporting evidence for proper evaluation. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper presented a generalization of CNNs to affine transformations. I think the paper could be improved if the author could have a more focused illustration of the significance of this work. It seems different things are touched but without strong evidence for supporting. In the first line of the abstract, the paper talks about "reduced sample complexity" via exploitation of symmetry. However, neither theoretical analysis nor empirical observation concerning sample complexity is presented. By contracting the term "equivariant" and "invariant", it feels that many of the previous work on understanding and generalizing CNNs (focusing on "invariance") is only remotely related to this work. This might confuse the readers who is not familiar with this topic. Because even though a different term is used, most of the mentioned previous work only use a local pooling/averaging. Instead of a global invariant feature map, those methods have effectively the same property as the equivariant convolution + pooling-with-subsampling as used in this paper. A clear discussion here might help to place the proposed work better in the context of existing work. And comparison with closely related algorithms is critical for evaluation of the proposed work since the idea is not new. That being said, just to name a few other related work that show up with google search with "Group Convolutional Neural Networks" that are neither discussed, nor compared in this paper: "Deep SimNets", arXiv:1506.03059 [cs.NE]. "Deep Symmetry Networks", NIPS 2014. "Discriminative Template Learning in Group-Convolutional Networks for Invariant Speech Representations", INTERSPEECH 2015. "Shepard Convolutional Neural Networks", NIPS 2015. It seems that one contribution of the paper is to simplify the implementation of G-CNNs by computing the affine transformation first with a rotation and then with translation. But no analysis in terms of computational complexity or empirical speedups are presented. Nor is there any analysis of how limiting to only rotation with multiples of 90 degrees (because of this decomposition) affects the performance. In the experiments, it is mentioned "because data augmentation could potentially reduce the benifits of using G-convolutions, all experiments reported in this section use random rotations on each instance presentation". It is a bit confusing for me whether the experiments are on a subset or a new generated set or the original rotated-MNIST dataset. The fact that "data augmentation could potentially reduce the benifits of using G-convolutions" exactly indicate that a direct comparison to data augmentation (with the same transformation used in the design of the G-convolution) should be compared. Because if a simple data augmentation could achive the same performance, then a more complex model would become less interesting. Lastly, the references of this paper should be re-worked. Many of the references are missing very basic information like the conference / journal / archival names. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces the terminology and notation for "G-CNNs", which are convolutional neural networks that are equivariant to transformations of inputs under certain symmetry groups. The paper considers inputs that are images that are transformed under the symmetry groups of rotation, reflection, and translation (the last being the usual "classical" symmetry on which CNNs are equivariant). Clarity - Justification: The paper states that it aims to rectify the inaccessibility of group-theoretical learning literature by explaining the basic ideas of group invariance in a simple way, leading to a straightforward way. It does succeed at this to some extent, in that it perhaps serves as a primer on group theoretic notation and terminology for the deep learning community. However, it also fails to use any interesting results from group theory to make advancements in the state of the art for convolutional networks (more on this below). As a result, the use of group-theoretic terminology has a very low power-to-weight ratio in this paper: the settings considered (simple discrete groups like 90-degree rotations and mirroring) are so simple, and the resulting CNN architectures so obvious to practitioners, that it's not clear what was accomplished by translating the basic idea of CNNs into group-theoretic language. Significance - Justification: At this point, the application of symmetries to CNNs has been very thoroughly explored. Group theory is also the tool of choice to talk about symmetry in mathematics and physics. It is no surprise that lots of work has focused on formulating CNNs with group symmetries. Unfortunately, this particular work does not use any interesting ideas from group theory, and therefore does not advance the state of the art in what kinds of symmetries can be represented (beyond simple discrete symmetries that permute input pixels in very regular ways). If one is familiar with group-theoretic notation, I do not think the content of this paper would be new to many practitioners. If one is not familiar, then the paper might be somewhat helpful in introducing the notation, but would not motivate why group theory should be used to describe concepts that are simple enough to not need the heavyweight mathematical framework. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As the literature review shows, the work being proposed here is in a space that has been thoroughly explored from many different perspectives. In my opinion, the particular terminology introduced does not represent any new insight into the problem space, and perhaps serves more as an introduction to group theory. In particular, the significant advances to be made are in the area of encoding non-trivial group symmetries (perhaps even arbitrary rotations, scaling, and affine transformations). This is still an open problem (Gens and Domingos, 2014, only scratches the surface) and will probably benefit from deeper ideas in group theory. However, if one is talking only about simple groups (e.g. the dihedral group), then I feel there is not enough conceptual advancement in CNN architectures to justify what is, in the end, just a translation of existing ideas into group-theoretic terminology. =====