{
  "id": "2305.17333",
  "version": "v1",
  "published": "2023-05-27T02:28:10.000Z",
  "updated": "2023-05-27T02:28:10.000Z",
  "title": "Fine-Tuning Language Models with Just Forward Passes",
  "authors": [
    "Sadhika Malladi",
    "Tianyu Gao",
    "Eshaan Nichani",
    "Alex Damian",
    "Jason D. Lee",
    "Danqi Chen",
    "Sanjeev Arora"
  ],
  "comment": "Code available at https://github.com/princeton-nlp/MeZO",
  "categories": [
    "cs.LG",
    "cs.CL"
  ],
  "abstract": "Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.",
  "revisions": [
    {
      "version": "v1",
      "updated": "2023-05-27T02:28:10.000Z"
    }
  ],
  "analyses": {
    "keywords": [
      "fine-tuning language models",
      "forward passes",
      "classical zo analyses suggesting",
      "significantly outperforms in-context learning",
      "single a100 80gb gpu"
    ],
    "tags": [
      "github project"
    ],
    "note": {
      "typesetting": "TeX",
      "pages": 0,
      "language": "en",
      "license": "arXiv",
      "status": "editable"
    }
  }
}