arXiv:2405.15025 Abstract | arXiv Analytics

arXiv:2405.15025 [cs.LG]Abstract References Reviews Resources

OAC: Output-adaptive Calibration for Accurate Post-training Quantization

Ali Edalati, Alireza Ghaffari, Masoud Asgharian, Lu Hou, Boxing Chen, Vahid Partovi Nia

Published 2024-05-23Version 1

Deployment of Large Language Models (LLMs) has major computational costs, due to their rapidly expanding size. Compression of LLMs reduces the memory footprint, latency, and energy required for their inference. Post-training Quantization (PTQ) techniques have been developed to compress LLMs while avoiding expensive re-training. Most PTQ approaches formulate the quantization error based on a layer-wise $\ell_2$ loss, ignoring the model output. Then, each layer is calibrated using its layer-wise Hessian to update the weights towards minimizing the $\ell_2$ quantization error. The Hessian is also used for detecting the most salient weights to quantization. Such PTQ approaches are prone to accuracy drop in low-precision quantization. We propose Output-adaptive Calibration (OAC) to incorporate the model output in the calibration process. We formulate the quantization error based on the distortion of the output cross-entropy loss. OAC approximates the output-adaptive Hessian for each layer under reasonable assumptions to reduce the computational complexity. The output-adaptive Hessians are used to update the weight matrices and detect the salient weights towards maintaining the model output. Our proposed method outperforms the state-of-the-art baselines such as SpQR and BiLLM, especially, at extreme low-precision (2-bit and binary) quantization.

Comments: 20 pages, 4 figures

Categories: cs.LG, cs.CL

Keywords: accurate post-training quantization, output-adaptive calibration, model output, quantization error, salient weights

Related articles: Most relevant | Search more

arXiv:2410.14713 [cs.LG] (Published 2024-10-09)

QuAILoRA: Quantization-Aware Initialization for LoRA

Neal Lawton, Aishwarya Padmakumar, Judith Gaspers, Jack FitzGerald, Anoop Kumar, Greg Ver Steeg, Aram Galstyan

arXiv:2310.00034 [cs.LG] (Published 2023-09-29)

PB-LLM: Partially Binarized Large Language Models

Yuzhang Shang, Zhihang Yuan, Qiang Wu, Zhen Dong

arXiv:2208.11580 [cs.LG] (Published 2022-08-24)

Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning