Blog(1)
Since the scale of the state-of-the-art AI models has become deeper, model compression also has been attracting more attention as a method to let models be deployed on edge devices without accessing cloud servers.
Research Areas(0)
Publications(23)
Progressive Mixed-Precision Decoding for Efficient LLM Inference
AuthorHao (Mark) Chen, Fuwen Tan, Alexandros Kouris, Royson Lee, Stylianos Venieris
PublishedInternational Conference on Learning Representation (ICLR)
Date2025-04-25
QBB: Quantization with Binary Bases for LLMs
AuthorAdrian Bulat, Yassine Ouali, Georgios Tzimiropoulos
PublishedNeural Information Processing Systems (NeurIPS)
Date2024-12-11
Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers
AuthorJunhan Kim, Chungman Lee, Eulrang Cho, Kyungphil Park, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon
News(3)
Research in Automatic Speech Recognition (ASR) continues to show that larger models yield better results. But while state-of-the-art networks continue to grow with billions of parameters, the difficulty of deploying these models on device also increases.
At the British Machine Vision Conference (BMVC) 2021, we disseminated work on mobile inverse tone mapping. In this work, we tackled converting high resolution images to high dynamic range (HDR) images in real-time with a mobile-focused model which utilized low-bit quantization of its parameters in order to accelerate inference.
Others(0)