arXiv:2402.09401 [cs.LG]AbstractReferencesReviewsResources Classifications Subjects Themes Keywords human feedback, reinforcement learning, active queries, instance-dependent regret bound, state-of-the-art dpo method Tags Journal Information Publisher Journal Year Month Volume Number Pages DOI URL Miscellaneous Typesetting Pages Language License Submit Reset