Numerical Optimizations for Weighted Value Decomposition on Language Models


Conference on Empirical Methods in Natural Language Processing (EMNLP)




Singular value decomposition (SVD) is one of the most popular compression methods that approximate a target matrix with smaller matrices. However, standard SVD treats the parameters within the matrix with equal importance, which is a simple but unrealistic assumption. In real cases, the parameters of a trained neural network model affect the task performance unevenly, suggesting non-equal importance among the parameters. Therefore, this paper proposed Fisher information weighted Value Decomposition (FVD) to compress a neural network model with the awareness about parameter importance. Unlike standard SVD, FVD is a non-convex optimization problem that lacks a closed-form solution. Therefore, optimizing FVD is non-trivial.
We systematically investigated multiple optimization strategies to tackle the problem and examined our method by compressing transformer-based language models.
Further, we designed a metric to predict when the SVD may introduce a significant performance drop, and our FVD can be a rescue strategy.
The extensive evaluations demonstrate that our FVD can perform comparable or even better with current SOTA methods in compressing Transformer-based language models.
Also, the analysis of Transformer-blocks shows that our FVD can achieve significant performance improvements over SVD on the sub-structure factorization.