Skip to yearly menu bar Skip to main content

Workshop: ICML Workshop on Human in the Loop Learning (HILL)

Interpretable Video Transformers in Imitation Learning of Human Driving

Andrew Dai


Transformers applied to high-level vision tasks showcase impressive performance due to the use of self-attention sublayers for computing affinity weights across tokens corresponding to image patches. A simple Vision Transformer encoder can also be trained with video clip inputs from popular driving datasets in a weakly supervised imitation learning task, framed as predicting future human driving actions as a time series sequence over a prediction horizon. In this paper, we propose this task as a simple, scalable method for autonomous vehicle planning to match human driving behaviour. We demonstrate initial results for this method, along with model visualizations for interpreting features in video inputs that contribute to sequence predictions.

Chat is not available.