Publications

Compress & Cache: Vision token compression for efficient generation and retrieval

Published

Neural Information Processing Systems (NeurIPS)

Date

2025.12.03

Research Areas

Abstract

This work aims to compress the vision tokens of an LVLM into a representation that is simultaneously suitable for (a) generative and (b) discriminative tasks, (c) is nearly lossless, and (d) storage-efficient. To this end, we propose C&C, a novel compression method that leverages the LVLM itself for task-agnostic visual token compression. Unlike prior methods that perform token reduction on-the-fly, our approach offloads computation to a dedicated, upfront indexing stage, effectively decoupling compression from generation. This enables learning more powerful representations for generation during inference. At the core of C&C is a ``double-forward pass'' training strategy. During the first forward pass, the LLM (of the LVLM) creates a bottleneck by compressing the dense visual tokens into a few summary tokens. Subsequently, the second forward pass processes the language instruction(s) alongside the summary tokens, used as a direct replacement for the image ones. The training of C&C is guided by two key losses: an autoregressive loss applied after the second pass that provides a direct optimization objective for reconstructing the original information flow, and a contrastive loss applied after the first pass to bolster the representational strength of the summary tokens, particularly for discriminative tasks. Moreover, we propose stage-specific adapters for further enhancing performance. C&C produces highly informative compressed representations. An in-depth ablation study confirms the efficacy of our approach. For generative tasks, we achieve a 2x higher compression rate without compromising capabilities, setting a new state-of-the-art. For discriminative tasks, we establish new state-of-the-art results on image retrieval and compositionality benchmarks.