arXiv:2002.03860 Abstract | arXiv Analytics

arXiv:2002.03860 [stat.ML]Abstract References Reviews Resources

Missing Data Imputation using Optimal Transport

Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi

Published 2020-02-10Version 1

Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.

Categories: stat.ML, cs.LG

Keywords: missing data imputation, leverage optimal transport distances, out-perform state-of-the-art imputation methods, impute missing data values, machine learning algorithms

Related articles: Most relevant | Search more

arXiv:1206.2944 [stat.ML] (Published 2012-06-13, updated 2012-08-29)

Practical Bayesian Optimization of Machine Learning Algorithms

Jasper Snoek, Hugo Larochelle, Ryan P. Adams

arXiv:2302.00911 [stat.ML] (Published 2023-02-02)

Conditional expectation for missing data imputation

Mai Anh Vu, Thu Nguyen, Tu T. Do, Nhan Phan, Pål Halvorsen, Michael A. Riegler, Binh T. Nguyen

arXiv:1610.09075 [stat.ML] (Published 2016-10-28)

Missing Data Imputation for Supervised Learning

Jason Poulos, Rafael Valle