We thank all the reviewers for their valuable feedback$
R1’s review does not give credit to several efforts in the paper, based on which we think there might be a misunderstanding. We do not focus on “one single network architecture and one single task: pose estimation”. As also R4 and R5 pointed out, we study jointly optimizing over both categorization and pose estimation (not only pose)

Our setting takes an image and predicts both the category and the pose. Localization is not the goal of this work. This setting is already studied and motivated in [e.g. Saverese et. al. 2007 and 2008, Zhang et. al. 2013 and 2015], that we compare with.  Our main goal is to study joint category/pose estimation with a set of well-motivated experiments on models that vary in structure and/or loss. The dichotomous nature of this problem is the heart of our paper

We studied several dimensions on the largest available datasets (RGBD and Pascal3D). We enumerate the explored dimensions below and also address R1 feedback

A) We studied and compared multiple models (1) BaseNetwork	(2) PM	(3) CPM	(4) LBM	(5) EBM. These models are motivated by the results in Fig1 as R4 mentioned

B) Layer by Layer Analysis of the models on the two datasets.  All the following aspects were studied on the five models and on every layer (i.e. each is applied 5 \times num_of_layers)
	(B.1) (Category)-SVM: shown in layer-Classification performance in Fig3 (Right) for RGBD and Fig4 (Right) for PASCAL3D
	(B.2) (Pose)-SVR: shown in layer-Pose performance in Fig3 (Left) for RGBD and Fig4 (Left) for PASCAL3D
	(B.3) (Pose/category)-NN: see Fig3 (Left) for pose on RGBD and Fig4 (Left) for PASCAL3D. Cageory-NN analysis in the supplementary
	(B.4) (Local Pose Measurements): Additional local pose metrics  were evaluated per layer ( see L610-618). Additional results are in the supplementary

C) We discussed several findings from our analysis (Sec 6 and 8)

D) Overall performance of the models on RGBD dataset in Table 1 and PASCAL3D in Table 3 (rows 6 to end)

E) We showed convergence advantage of EBM over other models in Fig 6

F) Fair Comparison to the State of the Art?

(F.1) RGBD dataset in Table 2: All the related methods in Table2 use exactly our setting and are completely fair, showing the value of our method

(F.2) Pascal3D dataset in Table 3 (row 5 to end (including [Zhang etal 15)): uses our exact problem setting. Our results are still much better than (Zhang et al 15)

(F.3) To our knowledge, our results fairly report the best performance in both RGBD and Pascal3D datasets under our setting (no localization)

(F.4) Pascal3D dataset: In Table 3 (methods in rows 1 to 4 perform pose after localization), the aim is give the reader a broader context by reporting other related results on Pascal3D while of course being explicit about fair comparison. We already mentioned this two times in L684-689 and also L743-745. We will further highlight the differences between the methods in the final version by an additional column in Table 3 that indicate whether a method localizes the object first.  Note, we only compare pose.

Localization is not our goal and we argue that existing results are conclusive and prove our claims. Integrating localization using for instance RCNN as (pepik etal, 2015) did is interesting but this does not perform joint prediction and does not study several analysis aspects done in our work. The methods in Table3 (rows 1-4) do not jointly learn category and pose as we do. 


R4:

The number of pose bins is 16 in all the models. Output nodes are 16 x Num_object_classes for CPM. We tried also 8,32. We found that 16 bins captures a good trade-off between being as fine as possible and still discriminate between poses. We will release the trained models (~15 models on RGBD split1, RGBD split2, Pascal3D)


R5:

Regarding line 92 in related work section, we agree that this part should be improved in the context of the new references [1-4]. We will fix it in the final version. Apart from the summarized performance in Tables1-3, we highlight the layer-by-layer analysis and the convergence results(fig6)


R1:

Eq1 is the conditional expectation of the pose given the image E(\phi(pose)|x) = \sum p(pose_i|x) phi(pose_i), where p(pose_i|x) is produced by the softmax pose-layer and \phi(pose_i) is the discrete pose that corresponds to the i^th bin. This enables continuous prediction of the pose. We used consistent loss functions on both categorization and pose to avoid bias of one task over the other

deeper layers: A dimension of exploration is to try our models with deeper  layers but this is not our main goal. We will try the proposed models with deeper layers in an extended version due to space

Figures: We apologize if some figures were small but they are very high resolution. They could be seen clearly electronically by zooming-in. We will improve them in the final version