Publications

Low-fidelity end-to-end pre-training for temporal action localization

Published

Neural Information Processing Systems (NeurIPS)

Date

2021.12.07

Abstract

Temporal action detection is a fundamental yet challenging task in video understanding. However, most existing models developed for this task are pre-trained on general video action classification tasks, which doesnt obtain the localization sensitivity -- the representations of the same video of different temporal localization are similar. In this paper, for the first time, we investigate model pre-training for temporal action localization by introducing a novel method. Instead of relying on proxy tasks that sensitive to action location, we propose to down-scale existing temporal action localization solutions, and help to train the video encoder localization-sensitively. When we train the down-scaled temporal action localization model, the gradient flow back-propagates to video encoders from a temporal action localization loss, which enables the learning of video representations much more transferable to temporal action localization. Extensive experiments show that the proposed LSP is superior to the existing action classification-based pre-training counterpart, and achieves new state-of-the-art performance.