arXiv:2305.17333 Abstract | arXiv Analytics

arXiv:2305.17333 [cs.LG]Abstract References Reviews Resources

Fine-Tuning Language Models with Just Forward Passes

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, Sanjeev Arora

Published 2023-05-27Version 1

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.

Comments: Code available at https://github.com/princeton-nlp/MeZO

Categories: cs.LG, cs.CL

Keywords: fine-tuning language models, forward passes, classical zo analyses suggesting, significantly outperforms in-context learning, single a100 80gb gpu

Tags: github project

Related articles: Most relevant | Search more

arXiv:2206.00761 [cs.LG] (Published 2022-06-01)

On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

Tomasz Korbak, Hady Elsahar, Germán Kruszewski, Marc Dymetman

arXiv:2402.05406 [cs.LG] (Published 2024-02-08)

Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes

Lucio Dery, Steven Kolawole, Jean-Francois Kagey, Virginia Smith, Graham Neubig, Ameet Talwalkar

arXiv:2404.08080 [cs.LG] (Published 2024-04-11)