Paper ID: 413 Title: A Comparative Analysis and Study of Multiview CNN Models for Joint Object Categorization and Pose Estimation Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes several variations of a base convolutional neural network that is tuned for object (not human) pose recognition and classification. The authors discuss the merits of each approach, and analyze what makes each particular instantiation work. In particular, the authors analyze the structure of the convolutional network upon which all the variants are based and motivate each variant and its performance from the initial analysis of a pre-trained model. The main contributions of this paper are the thorough analysis of the base network and the additional architectures. In addition, the results are several percent better than state of the art. I also highly appreciate the fact that the authors took the effort of fully describing the architectures in the supplemental materials. Clarity - Justification: This paper is very easy to follow. Since this is an analysis/architecture paper, there isn't much to say except that the authors explain everything very clearly. The only seemingly missing detail is how many quantization levels were used for the angle. It is implied that 3 elevation angles were used when discussing the RGB-D dataset, but it's never explicitly stated as in "we used 3 possible elevation angle values". Figure 1 lists difference in angles between 20 and 100 degrees, which makes me wonder whether this was indeed the case. Again this is the only area where it would really help to be explicit. Significance - Justification: This paper is definitely above average in the sense that the authors motivated their decisions in the architecture based on empirical evidence. The number of experiments run was rather impressive, and a lot of care was taken to explain exactly how data was used. Given this, and the supplemental materials, and of course the results, this makes for an interesting paper that's definitely above average. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper does a very good job to discuss the problem of joint object classification and pose estimation using variants of the same network, while explaining the design choices that were made for the variants which were presented. The most interesting aspect of the paper was that the EBM network was very nicely motivated by the experiments. Of course, it also helps that this paper presents state of the art results on two benchmarks (not by a small margin). Therefore I think this effort should be commended. I think it was also very nice of the authors to include detailed supplemental materials further explaining the details of the networks, thus making it possible for other researchers to fully replicate this work. In the final version of the paper, if accepted, I would definitely try to emphasize the number of possible angles (crucial for the cross-product network). ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents 3 slightly different CNN architectures for joint object pose and object class estimation. The differences in the 3 architectures reside in the layer depth for separating the pose from the object category outputs. The authors evaluate their 3 models on the RGBD dataset and Pascal 3D+ and show that their best model, that shares the low level feature layers but learns specialized feature maps for pose and category estimation, beats all previous approaches. Clarity - Justification: The paper was clearly written Significance - Justification: I would qualify the contribution as incremental. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Positive points: - the joint pose and object categorization was, to the authors and my knowledge, not explored before. - the developed method is efficient to solve both tasks. Negative points: - The application of CNN here is quite straightforward, there is no real scientific new knowledge to be gained by this work. The fact that the early branching, i. e. specialized feature maps for different modalities, works better is not a surprise as it was investigated for instance in [5] - l.92: I disagree when the authors write that pose estimation has not received a lot of attention CNN in the context of deep learning. As stated in the relative work section, there are several works, such as [1-4] that employ CNNs for pose estimation, and in particular [1] [1] Tulsiani et al, Viewpoint and Keypoints, CVPR 2015 [2] Carreira et al. Human Pose Estimation with Iterative Error Feedback , ICCV 2015 [3] Tulsiani et al, Pose Induction for Novel Object Categories, ICCV 2015 [4] Tompson et al. Joint training of a convolutional network and a graphical model for human pose estimation, NIPS 2014 [5] Neverova et al: Moddrop: adaptative multi-modal gesture recognition, PAMI 2015. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a comparison of different techniques that extract viewpoint and category information from images of different objects. The main question that the submission addresses is from which layers to extract viewpoint information in the common AlexNet architecture. Experiments are conducted on different datasets but having ground truth localization available. Clarity - Justification: The content of the paper is presented sufficiently clear. Most figures are of poor quality, need to be bigger and augmented with axis descriptions. The content of the sections were a bit unclear at first, I would have expected different sections on the different questions asked and the section names "CNN Layer Analysis" and "Experiments" do not separate those clearly. Significance - Justification: The topic of viewpoint (pose) and joint category estimation is an important one in the field of computer vision and object detection. I am not convinced that the submission would generate enough interest from the audience of ICML. The scope of the paper is rather narrow, focussing on one single network architecture and one single task: pose estimation. It reports on experimental results that practitioners would be interested in, but for a comparative analysis I would have expected a broader set of experiments. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The approach is to take the AlexNet architecture and try out different strategies and to also train for viewpoint estimation. One shortcoming is that the viewpoint estimation is discrete, i.e., it is discretized into a finite number of viewpoint labels. This will not generate a precise estimate of the viewpoint, in terms of single degree. If I am not mistaken, the work Pepik et al. 2015b also present a regression approach for the very same problem. This could be compared on equal ground, measuring the angular error and not the classification error. An important missing experiment is the following. As the paper correctly states the comparison in Table 3 to Xiang et al. 2014, Tulsiani & Malik, 2014, and Pepik et al. 2015b is (slightly) unfair as those work perform joint detection, categorization and viewpoint estimation. In other words, the experiments in this submission have available the ground truth bounding boxes as the others have not. I strongly advise to separate the table to highlight this distinction, it is very likely that a non-expert will miss this point. Also the proposed method needs to be tested on the same scenario, e.g., evaluating on bounding box proposals. This will provide the only valid comparison against the aforementioned works, the current Table 3 is not "slightly" unfair, but misleading. It is very easy to perform this experiment. One may argue it is sufficient to compare pose estimation in isolation, that would be fine, but nonetheless it should be possible to relate the existing works with this dimension, so either adapt to the settings of previous works or apply those in the new setting using ground truth bounding boxes. Eq. (1) is unclear, probably what is meant is E_{p(pose|x)}[pose] = ... All figures are of poor quality, eg. Figure 6 is too small, has no text that describes the labels and further violates the formatting instructions in that it overflows to the top. In summary, this paper is too narrow in scope, crucial experiments are missing and is unlikely to be of interest to the ICML audience. =====