Publications

Space-time Mixing Attention for Video Transformer (=Video TimeShifter)

Published

Neural Information Processing Systems (NeurIPS)

Date

2021.12.08

Abstract

This paper is on video recognition using Transformers. Very recent attempts in this research direction have demonstrated promising results in terms of recognition accuracy yet they have been also shown to induce significant computational overheads due to the additional modelling of the temporal information. In this work, we propose a transformer-based model for video recognition the complexity of which scales linearly with the number of frames in the video sequence and hence induces \textit{no overhead} compared to an image-based Transformer model. To achieve this our model makes two approximations to the full spatio-temporal attention used in video transformers: (a) Firstly, it uses temporal channel shifting to attend both spatial and temporal locations without inducing any additional cost on top of a spatial-only attention model. (b) Secondly, it restricts time attention to a local temporal window and capitalizes on the Transformers depth to obtain full temporal coverage of the video sequence. Our second contribution is to show how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost. We demonstrate that our model is surprisingly effective in terms of capturing long-term temporal dependencies and producing very high recognition accuracy on the most popular video recognition datasets including Something-Something, Kinetics and Epic Kitchens while at the same time being significantly more efficient than other transformer-based video recognition models.