arXiv:2501.13905 Abstract | arXiv Analytics

arXiv:2501.13905 [cs.LG]Abstract References Reviews Resources

On Learning Representations for Tabular Data Distillation

Inwon Kang, Parikshit Ram, Yi Zhou, Horst Samulowitz, Oshani Seneviratne

Published 2025-01-23Version 1

Dataset distillation generates a small set of information-rich instances from a large dataset, resulting in reduced storage requirements, privacy or copyright risks, and computational costs for downstream modeling, though much of the research has focused on the image data modality. We study tabular data distillation, which brings in novel challenges such as the inherent feature heterogeneity and the common use of non-differentiable learning models (such as decision tree ensembles and nearest-neighbor predictors). To mitigate these challenges, we present $\texttt{TDColER}$, a tabular data distillation framework via column embeddings-based representation learning. To evaluate this framework, we also present a tabular data distillation benchmark, ${{\sf \small TDBench}}$. Based on an elaborate evaluation on ${{\sf \small TDBench}}$, resulting in 226,890 distilled datasets and 548,880 models trained on them, we demonstrate that $\texttt{TDColER}$ is able to boost the distilled data quality of off-the-shelf distillation schemes by 0.5-143% across 7 different tabular learning models.

Categories: cs.LG

Keywords: learning representations, tabular data distillation framework, tabular data distillation benchmark, study tabular data distillation, off-the-shelf distillation schemes

Related articles: Most relevant | Search more

arXiv:1507.08104 [cs.LG] (Published 2015-07-29)

Learning Representations for Outlier Detection on a Budget

Barbora Micenková, Brian McWilliams, Ira Assent

arXiv:1806.10069 [cs.LG] (Published 2018-06-26)

Deep $k$-Means: Jointly Clustering with $k$-Means and Learning Representations

Maziar Moradi Fard, Thibaut Thonet, Eric Gaussier

arXiv:1812.03928 [cs.LG] (Published 2018-12-10)

Learning Representations of Sets through Optimized Permutations

Yan Zhang, Jonathon Hare, Adam Prügel-Bennett