{
  "id": "2209.12309",
  "version": "v1",
  "published": "2022-09-25T19:41:23.000Z",
  "updated": "2022-09-25T19:41:23.000Z",
  "title": "Feature Encodings for Gradient Boosting with Automunge",
  "authors": [
    "Nicholas J. Teague"
  ],
  "comment": "10 pages, 4 figures, preprint",
  "categories": [
    "cs.LG"
  ],
  "abstract": "Selecting a default feature encoding strategy for gradient boosted learning may consider metrics of training duration and achieved predictive performance associated with the feature representations. The Automunge library for dataframe preprocessing offers a default of binarization for categoric features and z-score normalization for numeric. The presented study sought to validate those defaults by way of benchmarking on a series of diverse data sets by encoding variations with tuned gradient boosted learning. We found that on average our chosen defaults were top performers both from a tuning duration and a model performance standpoint. Another key finding was that one hot encoding did not perform in a manner consistent with suitability to serve as a categoric default in comparison to categoric binarization. We present here these and further benchmarks.",
  "revisions": [
    {
      "version": "v1",
      "updated": "2022-09-25T19:41:23.000Z"
    }
  ],
  "analyses": {
    "keywords": [
      "gradient boosting",
      "default feature encoding strategy",
      "model performance standpoint",
      "diverse data sets",
      "gradient boosted learning"
    ],
    "note": {
      "typesetting": "TeX",
      "pages": 10,
      "language": "en",
      "license": "arXiv",
      "status": "editable"
    }
  }
}