arXiv Analytics

Sign in

arXiv:2406.09068 [cs.LG]AbstractReferencesReviewsResources

Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation

Claude Formanek, Callum Rhys Tilbury, Louise Beyers, Jonathan Shock, Arnu Pretorius

Published 2024-06-13Version 1

Offline multi-agent reinforcement learning (MARL) is an emerging field with great promise for real-world applications. Unfortunately, the current state of research in offline MARL is plagued by inconsistencies in baselines and evaluation protocols, which ultimately makes it difficult to accurately assess progress, trust newly proposed innovations, and allow researchers to easily build upon prior work. In this paper, we firstly identify significant shortcomings in existing methodologies for measuring the performance of novel algorithms through a representative study of published offline MARL work. Secondly, by directly comparing to this prior work, we demonstrate that simple, well-implemented baselines can achieve state-of-the-art (SOTA) results across a wide range of tasks. Specifically, we show that on 35 out of 47 datasets used in prior work (almost 75% of cases), we match or surpass the performance of the current purported SOTA. Strikingly, our baselines often substantially outperform these more sophisticated algorithms. Finally, we correct for the shortcomings highlighted from this prior work by introducing a straightforward standardised methodology for evaluation and by providing our baseline implementations with statistically robust results across several scenarios, useful for comparisons in future work. Our proposal includes simple and sensible steps that are easy to adopt, which in combination with solid baselines and comparative results, could substantially improve the overall rigour of empirical science in offline MARL moving forward.

Related articles: Most relevant | Search more
arXiv:2206.04921 [cs.LG] (Published 2022-06-10)
Offline Stochastic Shortest Path: Learning, Evaluation and Towards Optimality
arXiv:1505.00401 [cs.LG] (Published 2015-05-03)
Visualization of Tradeoff in Evaluation: from Precision-Recall & PN to LIFT, ROC & BIRD
arXiv:cs/0212014 [cs.LG] (Published 2002-12-08)
Extraction of Keyphrases from Text: Evaluation of Four Algorithms