R3:$
- "... Battaglia et al. ...": The contribution of this paper is to explore a similar problem to Battaglia by training feed-forward models, rather than relying on top-down, hand-engineered models of the world. As such, our approach has the potential to address many other problems involving physical understanding, which would be impractical or impossible with a rigid rendering engine.

- We will revise Fig. 6 to improve clarity.

- Stochasticity: this is an interesting question that we intend to explore further. However, in the 4 block case, the diffuse predictions show that it is able to get the rough sense of the outcome, even if the precise block locations are unclear (due to inherent randomness on how far the blocks roll, or collide with one another). Thus the model is able to handle the degree of stochasticity present in this case.


R2:

- L.122-124. We agree and will reword.

- The results reported in the paper are indeed on the validation set. We were not very concerned about overfitting on the validation set since we performed only very limited architecture tuning. We subsequently re-tested all of our models with an independent 30k-image test set, and the accuracy on the validation and test sets matched to within 1%. We will clarify and update the reported results in the final version.

R6:

- "authors only focus on convolutional neural networks": While the focus of this paper was on CNNs, we used other ML approaches (kNN, logistic regression) as baselines.