arXiv:2311.04179 Abstract | arXiv Analytics

arXiv:2311.04179 [cs.LG]Abstract References Reviews Resources

On Leakage in Machine Learning Pipelines

Leonard Sasse, Eliana Nicolaisen-Sobesky, Juergen Dukart, Simon B. Eickhoff, Michael Götz, Sami Hamdan, Vera Komeyer, Abhijit Kulkarni, Juha Lahnakoski, Bradley C. Love, Federico Raimondo, Kaustubh R. Patil

Published 2023-11-07Version 1

Machine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to new data. This can have severe negative financial and societal implications. Our aim is to expand understanding associated with causes leading to leakage when designing, implementing, and evaluating ML pipelines. Illustrated by concrete examples, we provide a comprehensive overview and discussion of various types of leakage that may arise in ML pipelines.

Comments: first draft

Categories: cs.LG, cs.AI

Keywords: machine learning pipelines, mls popularity stems, overoptimistic performance estimates, concrete examples, contain leakage

Related articles: Most relevant | Search more

arXiv:2410.19643 [cs.LG] (Published 2024-10-25)

Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites

Nicolás Nieto et al.

arXiv:2108.07915 [cs.LG] (Published 2021-08-18)

Data Pricing in Machine Learning Pipelines

Zicun Cong, Xuan Luo, Pei Jian, Feida Zhu, Yong Zhang

arXiv:1810.09942 [cs.LG] (Published 2018-10-23)

Preprocessor Selection for Machine Learning Pipelines

Brandon Schoenfeld, Christophe Giraud-Carrier, Mason Poggemann, Jarom Christensen, Kevin Seppi

arXiv Analytics

arXiv:2311.04179 [cs.LG]Abstract References Reviews Resources

On Leakage in Machine Learning Pipelines

Links

Toolbox

arXiv:2311.04179 [cs.LG]AbstractReferencesReviewsResources

On Leakage in Machine Learning Pipelines

Links

Toolbox

arXiv:2311.04179 [cs.LG]Abstract References Reviews Resources