ReGen: A good Generative zero-shot video classifier should be Rewarded
Published
International Conference on Computer Vision (ICCV)
Abstract
Generative Video Classifiers recently emerged as a promising direction for open set action recognition, capable of producing sample-specific outputs. However such models tend to lose their zero-shot ability when trained using the class names directly. With the aim of alleviating this, we propose a novel reinforcement learning based framework with a tri-fold objective and reward functions: 1) a taxonomy-level discrimination reward that encourages the predicted text to be correctly classified as the ground truth class, 2) a fine-grained CLIP reward that encourages sample-specific outputs and 3) a grammar and coherence reward that maintains the consistency of the text. When evaluated on the standard classification benchmarks our approach significantly outperforms prior generative approaches for zero-shot classification.