Paper ID: 554 Title: Learning End-to-end Video Classification with Rank-Pooling Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Having observed that it’s challenging to design a good video representation for recognition, the authors proposed an end-to-end CNN learning system that consists of a module to learn a rank-pooling over a temporal sequence of features extracted from frames. The paper discussed in-depth how to derive the learning algorithm for this module through approximate Hessian. Experiments were conducted to show the effectiveness of the paper. Clarity - Justification: Overall, the paper is well-written and is easy to follow. But there paper has some overclaims. On line 74 to 76, it said “For sequence data such as videos, obtaining labelled data is more costly than obtaining labels for static images”. And then said that it’s reflected by the size of datasets. Well, there is a sports 1M dataset (http://cs.stanford.edu/people/karpathy/deepvideo/) that contains more than one million labeled sports videos. And I really don’t see how video labeling is harder to collect than image labeling other than more storage is required. So the motivation is bit problematic here. On line 127, “we present an elegant method...” Whether the method is elegant or not should be judged by the audience and a technical paper needs to avoid describing the method proposed in the paper as “elegant” (probably fine to say other methods are elegant). On the Hollywood2 dataset, the proposed algorithm only achieved 40% mAP while the state of the art (VideoDarwin) is 70% mAP. The paper started to say that it’s because “improved trajectory features (MBH, HOG, HOG) are highly engineered for this task...” On line 751 to 753. Well, why don’t you use the same features and run your method to see whether indeed it’s due to better features? This conclusion is very shaky unless relevant experiments were conducted. Significance - Justification: Reducing the representation learning for video to a bilevel optimization problem is interesting. It’s hard to say what’s new in this paper beyond what has been published in the bilevel optimization literature. Results wise, the proposed system works pretty well for the UCF-sports dataset, but significantly behind the state of the art on the Hollywood2 dataset. Unfortunately, the paper didn’t fully investigate into this issue. Therefore, it’s not clear whether the proposed bilevel optimization would really benefit a large-scale video recognition system. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I really want to like this paper despite the overclaims, due to the interesting rank-pooling representation solved by bilevel optimization. But the experimental results are telling a different story than that in the paper. The results on the Hollywood2 dataset 40.6% is far behind the state of the art 70%. The numbers, as they are, suggest that other factors, such as choices of features, are far more important the representation itself. Thus, I would draw the opposite conclusion from the authors. I wish that the authors could conduct more thorough experiments to back up the points in the paper with strong numerical evidences. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces a new model for representation and learning of video sequences, making use of rank pooling of (Fernando et al. CVPR 2015). It demonstrates that a network structure consisting of per-frame CNNs followed by a rank pooling layer can be learned in an end-to-end fashion. The paper reports action classification results on two datasets (UCF sports and Hollywood2) Clarity - Justification: The paper was very easy to read and follow. The derivations in particular are quite clear. Significance - Justification: The results are potentially quite compelling. The insight that a network can be fine-tuned through the pooling layer to improve video classification accuracy is not novel, but the fact that fine-tuning through a more advanced rank pooling layer can yield 16% improvements in accuracy compared to fine-tuning through an average or max pooling layer (Table 1) is valuable. However, the experiments do not convincingly demonstrate whether the improvements come from (1) finetuning the lower layers (below the "Temporal pooling" layer), which is well established to be important for accuracy, or (2) rank pooling the finetuned features, which is also established in the Fernando et al. CVPR 2015 work, or (3) from the proposed method of end-to-end training through the pooling layer. A simple baseline that would help ascertain that it is in fact (3) would be: - treat each video frame as positive or negative depending on the video label and fine-tuning a per-frame CNN to build a frame-level classifier (this bears similarity to “avg-pool-CNN-end-to-end” of Table 1) - rank-pool the fc7 features of the frame-level classifiers and feed them to a softmax classifier. If this baseline is significantly below the end-to-end trained model then it is reasonable to ascertain that the accuracy improvements come specifically from end-to-end training rather than anything else. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is well-written and proposes a reasonable classification framework. I would've been comfortable voting for acceptance if the baseline experiment (above) was present. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper describes an approach for video classification using rank-pooling. The main novelty of the paper is a method that enables end-to-end learning with rank-pooling. Results show that by enabling end-to-end learning significant boosts in accuracy can be achieved. Clarity - Justification: Given the highly technical content, the paper was surprisingly easy to understand. Very well written. Significance - Justification: While the authors show results on video classification, the method could be applied to many potential classification tasks that take sequential data as input. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The is a well-written paper with a novel approach and good results. I only see two negatives: 1. Results aren't shown on larger video classification datasets, such as UCF101. 2. The results on Hollywood2 are significantly below state-of-the-art. I'm not as concerned about 2) since the paper's main contribution is not "better accuracy", but a method for doing end-to-end training with rank-pooling. It would have greatly added to the paper to have results on UCF101, but I'm still in favor of accept. Minor comments: 1. Table 2 is hard to read and gather trends. Could the results be shown visually? 2. Figure 1 could be collapsed. It isn't necessary to show all the Conv/ReLU/Pool layers. An illustration of the rank-pooling method would be more useful. 3. Since rank-pooling is central to the paper, I would recommend describing it in more detail earlier in the paper for those who are not familiar with it. =====