{ "id": "2209.12309", "version": "v1", "published": "2022-09-25T19:41:23.000Z", "updated": "2022-09-25T19:41:23.000Z", "title": "Feature Encodings for Gradient Boosting with Automunge", "authors": [ "Nicholas J. Teague" ], "comment": "10 pages, 4 figures, preprint", "categories": [ "cs.LG" ], "abstract": "Selecting a default feature encoding strategy for gradient boosted learning may consider metrics of training duration and achieved predictive performance associated with the feature representations. The Automunge library for dataframe preprocessing offers a default of binarization for categoric features and z-score normalization for numeric. The presented study sought to validate those defaults by way of benchmarking on a series of diverse data sets by encoding variations with tuned gradient boosted learning. We found that on average our chosen defaults were top performers both from a tuning duration and a model performance standpoint. Another key finding was that one hot encoding did not perform in a manner consistent with suitability to serve as a categoric default in comparison to categoric binarization. We present here these and further benchmarks.", "revisions": [ { "version": "v1", "updated": "2022-09-25T19:41:23.000Z" } ], "analyses": { "keywords": [ "gradient boosting", "default feature encoding strategy", "model performance standpoint", "diverse data sets", "gradient boosted learning" ], "note": { "typesetting": "TeX", "pages": 10, "language": "en", "license": "arXiv", "status": "editable" } } }