Abstract
Transformers have seen growing interest in processing different modalities, including language and images. As a result, we can process vision and language data using transformers that are architecturally similar. This feature of transformers provides us with several opportunities. This study explores weight sharing across two transformer backbones and within the same transformer backbone and prun-ing in a unified framework. More specifically, we investigate weight sharing and pruning for two components of the transformers: (1) Multi-Head Attention (MSA) and (2) Feed-Forward Network (FFN) layers. To jointly perform weight sharing and pruning, we propose to use a regularization term to align model weights and the desired structure during the pre-training step. The structure vectors of sharing and pruning are generated by using a hypernetwork, which can capture complex interactions between pruning and sharing across layers and modalities. The hypernetwork and model weights are trained iteratively so that the learned structure evolves along with model weights. After minimizing the proposed objective in the pre-training step, we perform weight sharing and pruning and fine-tune the model on downstream tasks. We perform experiments on vision and language tasks, including Referring Expression Comprehension (REC) and Visual Question Answering (VQA), using the state-of-the-art models: MDETR and GLIP. Our experiments show that we can reduce the size of the MDETR and GLIP by 35 − 40% by sharing and pruning MSA and FFN weights without significant loss in accuracy. 1 INTRODUCTION The dominant architecture in natural language processing (NLP) is Transformer Vaswani et al. (2017). Besides NLP, recent advance in computer vision shows that transformer based model, like ViT Dosovitskiy et al. (2021) or DeiT Touvron et al. (2020), can achieve similar or even better performance than convolutional neural networks (CNNs) on various tasks. As a result, it allows us to use architecturally similar models on cross-modal tasks with vision and language data. This setting naturally provides foundations to structurally share weights across different modalities. The advantage of weight sharing is that it encourages weight reuse and thus reduces the number of parameters while maintaining the model capacity to some extent. On the other hand, existing weight sharing techniques have many limitations. Most of them Lee et al. (2021); You et al. (2022); Lan et al. (2019); Reid et al. (2021) use manual designed sharing rules to share a whole layer or block, largely restricting the flexibility of weight sharing. This reduced flexibility can lead to drastic performance drop. To maximally utilize model parameters, we proposed to unify cross-modal sharing, layer-wise sharing and pruning, all in a single unified framework. Unlike previous works, the minimal structure of these operations is a weight vector instead of a whole layer or block, which drastically increases the flexibility of sharing and pruning. Instead of manually designed strategies, the position of sharing and pruning is learned in an end-to-end differentiable manner. To purse a better trade-off between the model performance and the parameter efficiency, we aim to maximize the flexibility by utilizing the structure of transformer backbones. If only cross-modal sharing is considered, there will be an upper bound for the compression rate (∼ 50%) when sharing all layers of one backbone for another one. Another direction is to share layers within a single 1