Token Fusion: Bridging the Gap between Token Pruning and Token Merging


IEEE Winter Conference on Applications of Computer Vision (WACV)



Research Areas


Vision Transformers (ViTs) have gained traction as powerful backbones in computer vision, outstripping many traditional CNNs. However, their computational overhead, majorly attributed to the self-attention mechanism, renders deployment on resource-constrained edge devices a challenge. One emergent solution revolves around token reduction techniques, including token pruning and token merging. In this paper, we introduce Token Fusion, (ToFu) an innovative method that combines the strengths of both token pruning and token merging. Token pruning excels when the model exhibits sensitivity to input interpolations, whereas token merging shines when the model is smoother with respect to the input. We combine this to propose a new scheme called Token Fusion. Furthermore, we address the shortcomings of average merging, which fails to conserve the inherent feature norm, leading to distribution shift. To counteract this, we present MLERP merging, an evolution of the SLERP technique, adept at consolidating multiple tokens while preserving the norm distribution. ToFu can be applied to ViTs with or without training and our empirical evaluations suggest that it sets a new benchmark both classification and image generation tasks in terms of computational efficiency and model accuracy.