Thank you for your helpful reviews. We tried to provide an intuitive presentation by using CNNs on images as a starting point. This was the motivation for Figs 1 and 2: we wanted the approach to be understood as a generalization of CNNs on images. We take reviewer 1 and 3's concerns seriously and will improve the presentation. (Additional suggestions are welcome!)$
The work addresses the problem of learning features for graphs. These features are used for classification or regression. Its goals are identical to the goals of graph kernels which is why we make a comparison to graph kernels using benchmark data sets. 

The nodes are not in correspondence with each other. This is exactly the core problem: what if we have a collection of graphs and we want to learn a feature representation?  The graphs may have different number of nodes and edges. (This is common in graph-based ML: chemicals, social networks, knowledge graphs, etc. have this property). Finding the "best correspondence" is what we define as graph normalization. This distinguishes our work from that mentioned by Reviewers 1 and 2 (we clearly should have and will discuss these papers as related work). There, the input features of the problems are the nodes and are in correspondence. The training data is used to infer and exploit graph structures on top of these features. We don't discover the graph structure as it is part of the input (a collection of graphs). The labeling approaches are used to align nodes that are similar wrt their structural roles within the graphs. 

Perhaps another helpful analogy is that of invariances to operations on the input. CNNs on images are known to be also effective because the learned features are invariant to translations. Similarly, recent work has introduced methods that make CNNs invariant to rotations. Our approach can be understood along these lines. The representation should be (approximately) invariant to graph isomorphisms, that is, the many ways a graph can be represented in a vector space (via its adjacency matrices). Our work attempts to find a "canonical form" by first applying appropriate graph labeling methods and then Nauty to break symmetries that exist after the labeling was applied. 

We do not only use one conv layer. Arbitrary many layers are possible and we use two conv and one dense layer in the experiments. 

The run time experiments are intended to show the efficiency for several real-world graphs. It is true, however, that a clear motivation and a baseline is currently missing. We will compare the receptive fields per second rates to rates that are achieved by CNNs without the receptive field computations. The goal is to show that the creation of the receptive fields is efficient enough to feed downstream CNNs. (also CNNs with more layers)

Reviewer1 also asked several questions about the graph classification experiments:

3) The graph kernels were used in combination with a SVM classifier (explained in the paper)
4) The problems here have no edge attributes. Different kernels use the node attributes differently. We will attempt to give a more lengthy explanation without sacrificing other parts of the paper.
5) see 3)

Fig3 green and yellow nodes simply represent the distance to the root node (i.e. distance 1 and 2).

Reviewer 3 also provided some helpful comments: 

L111: Intuitively speaking, the features learned in the lowest layer represent local graph neighborhoods. Higher conv layers combine these local features to features that represent larger graphs, and so on. The higher conv layers, therefore, combine features representing local neighborhood structure into features that represent more and more global graph structures. 

L311: thanks!

L382: We agree that it is confusing to use both vector and tensor. We used tensor because there can be several attributes per node/edge. We agree, however, that it is better to stick to vector when describing the mapping. 

L489: Yes, a procedure to generate a labeling (such as certain centrality measures). 

L519: thanks!

L520: Thm2 is intended as a way to compare existing labelling procedures such as WL and centrality measures. The perhaps more interesting question is whether we can learn (approximately) an optimal labelling. It seems to be a hard problem but still something we are thinking about. 

L525: The main reason for writing this is that this is true for common measures of graph and matrix distances. We'll mention this. 

L546: NAUTY is used to break ties after the labeling has been applied. 

L595: A comparison to CNNs on images might be helpful. Here, the receptive field is fixed and so is the order of the nodes within it (so it is also 1D in that sense). The spatial structure of images is captured by the structure of the receptive fields. Analogously, the local spatial structure of the graphs is captured by the receptive fields. (Apologies if we misunderstood the question.)