arXiv:2402.09401 Abstract | arXiv Analytics

arXiv:2402.09401 [cs.LG]Abstract References Reviews Resources

Reinforcement Learning from Human Feedback with Active Queries

Published 2024-02-14, updated 2025-02-11Version 2

Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an $\tilde{O}(d^2/\Delta)$ instance-dependent regret bound and an $\tilde{O}(d^2/\Delta^2)$ query complexity, where $d$ is the dimension of feature space and $\Delta$ is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.

Comments: 28 pages, 1 figure, 4 table

Categories: cs.LG, cs.AI, cs.CL, math.OC, stat.ML

Keywords: human feedback, reinforcement learning, active queries, instance-dependent regret bound, state-of-the-art dpo method

Related articles: Most relevant | Search more

arXiv:2402.17747 [cs.LG] (Published 2024-02-27, updated 2024-06-08)

When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons

arXiv:2305.18438 [cs.LG] (Published 2023-05-29)

Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism

Zihao Li, Zhuoran Yang, Mengdi Wang

arXiv:2411.11761 [cs.LG] (Published 2024-11-18)

Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework