We thank all three reviewers for appreciating our theoretical contribution. All three reviewers acknowledged that end-to-end training of rank pooling for video data is an interesting and useful contribution. We demonstrate the advantages of rank-pooling on two datasets. However, some valid concerns were raised regarding the experimental results. We address these experimental concerns in this rebuttal. These new experiments show that indeed the improvements mainly comes from the end-to-end trailing of rank pooling operator. We hope that these new experiments are convincing and our main contribution is technically useful.   $
As requested by Assigned_Reviewer_4, we performed an experiment to see actually the improvement comes from end-to-end training or due to the fine-turning + rank-pooling operator.  

In this controlled experiment, we use the pre-trained model as before and fine-tuned the Caffe reference model on frame data considering each frame as an instance from the respective action category. Afterwards, we extract FC7 features from each video (frames). Then we encode temporal information of fine-tuned FC7 video data using rank-pooling. Afterwards, we use soft-max classifier to classify videos. This baseline obtains 73.0%  for UCFSports dataset and 36.3 map on Holywood2 dataset which is much lower than what our end-to-end training network obtains, which is 87% (UCFScports) and 40.6 map (Holywood2). 

We also apply another baseline on frame-level fine-tuning where we fuse-the classifier scores of FC8 layer as a sum of scores per video. For this baseline we obtained 70% on UCF sports and 34.1 map on Hollywood2. In this instance, there is no temporal encoding. We can observe that results gradually increase from  fine-tuning < fine-tuning + rank-pooling  < end-to-end (fine-tuning + rank-pooling). 

These results indicate that the fine-tuning is not alone helpful to improve action recognition results. Moreover, when we analyze these new results and the results obtained in Table 1 and Table 2, we conclude "end-to-end training of rank-pooling is useful for action recognition". 

Some concerns were raised regarding the state-of-the-art on Hollywood2. In the original Rank Pooling paper of Fernando et al (CVPR15), authors combined four hand-crafted features known as HOG, HOF, MBH and TRJ. Moreover, Fernando et al (CVPR15) used several kinds of data augmentations (forward reverse rank pooling and mirrored videos data) to get to 70.0 map after combining all features. If we consider individual features only, they obtained 45.3 for HOG, 59.8 for HOF, 60.5 for MBH and 49.8 for TRJ features individually. We show in this paper that we can improve CNN feature performance from 31.0 (vanilla rank pooling) to 40.6 using end-to-end training (Note that fine-tuned network + rank pooling + soft-max gives 36.3 map). Our objective is not to obtain state-of-the art but to show that rank-pooling operator (Fernando 2015) can be improved in the context of CNN-based video classification. If we simply use vanilla rank pooling operator on CNN features as in Fernando et al (CVPR15), we only obtains 31.0 map.  

Furthermore, our-end-to-end training is useful to obtain state-of-the-art results when CNN features(end-to-end) are combined with HOG+HOF+MBH feature which results in superior performance of 73.4 map without any data augmentations. If we use vanilla rank pooling with CNN features and combine with HOG+HOF+MBH we only get 71.4 map. These results indicates that our end-end-training of Video networks with rank-pooling is useful also when combined with hand-crafted HOG, HOF and MBH features.  

We agree to remove some of the claims such as use of word "elegant" and phrases such as "large video collections are hard to come by". However, such terms and phrases does not hinder the main contribution of the paper. 

Our main contribution is to show that it is possible to use rank pooling in an end-to-end learning network. To best of our knowledge, no one has shown before how to compute the gradients of rank-pooling operator and use it inside a neural network as a temporal encoding mechanism. We believe that our method could be useful in many other sequence classification application domains as also pointed out by the reviewers. 

In the final version of the paper we will include all these new results and new baselines. We would also incorporate all minor changes requested by Assigned_Reviewer_7. Once again we thank all three reviewers and the area chair for reading our paper and providing some very useful comments. We hope that we have successfully addressed main concerns of the reviewers.