GBM Model Evaluation — Beyond Prediction Accuracy

9 min readFeb 10, 2024

Gradient Boosting Machine (GBM) is a powerful ensemble learning technique used widely in machine learning for classification and regression problems. It constructs a sequence of weak learners, typically decision trees, by iteratively fitting the residuals of previous learners. Through this progressive refinement process, GBM gradually improves the overall prediction accuracy. This method has proven highly effective for tabular data modeling, where its predictive performance remains the king of the hill even when benchmarked against state-of-the-art deep learning methods; see this Kaggle competition and the corresponding paper.

Kaggle leaderboard is typically ranked by prediction accuracy on hold-out testing data. However, the mere reliance on prediction accuracy may inadvertently promote a culture of over-optimism, encouraging the development of overly complex machine learning models that are against model simplicity, interpretability and generalizability. Consequently, this approach may misdirect efforts away from developing models deployable for real applications in production environments. Google Research underscored this concern in their paper titled “Underspecification Presents Challenges for Credibility in Modern Machine Learning”.

There are several implementations of GBM that have gained popularity in machine learning community:

GBDT (Gradient Boosted Decision Trees) in scikit-learn, accessible through the sklearn.ensemble classes
XGBoost (eXtreme Gradient Boosting) with exceptional speed, scalability and robust capabilities; https://github.com/dmlc/xgboost
LightGBM with fast training speed and low memory usage, developed by Microsoft; https://github.com/microsoft/LightGBM
CatBoost with special proficiency in handling categorical variables and use of symmetric trees; https://github.com/catboost/catboost.

In this tutorial, we pick XGBoost as an example tool to train a GBM model based on a simulated credit dataset. Subsequently, we demonstrate how to perform model evaluation under the model risk management framework.

1) Example Data and Model

from piml import Experiment
exp = Experiment()
exp.data_loader(data="SimuCredit")
exp.data_summary(feature_exclude=["Gender", "Race"], silent=True)
exp.data_prepare(test_ratio=0.2, random_state=0, silent=True)

**SimuCredit data** for predicting response “Approved” with “Gender” and “Race” variables removed.

from xgboost import XGBClassifier
model = XGBClassifier(max_depth=3, n_estimators=1000, 
                      learning_rate=0.01, random_state=0)
exp.model_train(model, name="XGB-default")
exp.model_diagnose(model="XGB-default", show='accuracy_table'

Let’s assume the target model for evaluation to be XGBoost trained with hyperparameters max_depth 3, n_estimators (i.e. number of trees) 1000 and learning_rate 0.01. Among several accuracy metrics, the AUC (area under ROC curve) evaluated on the 20% hold-out testing data is 0.7564.

2) Hyperparameter Tuning

To validate the key hyperparameters like max_depth, n_estimators and learning_rate for XGBoost, we can use the model_tune() function in PiML, which supports grid and random search methods for sklearn-style models. The tuning procedure splits the original training data into two parts (training and validation) and reports the metrics based on validation sample.

a) Grid Search

parameters = {'max_depth': [1, 2, 3, 4, 5, 6, 7, 8]}
result = exp.model_tune(model="XGB-default",
                        method="grid",
                        parameters=parameters,
                        metric=['AUC', 'LogLoss'],
                        test_ratio=0.4,
                        random_state=0)
result.data

The grid search for max_depth verifies the original model choice of 3. Note that the AUC values for max_depth 2, 3 and 4 are very close.

b) Random Search

import numpy as np
import scipy
import matplotlib.pyplot as plt
parameter_space = {'max_depth': [1, 2, 3, 4, 5, 6, 7, 8],
                   'n_estimators': scipy.stats.randint(100, 2000),
                   'learning_rate': scipy.stats.loguniform(0.001, 0.5)}
result = exp.model_tune(model="XGB-default",
                        method="randomized",
                        parameters=parameter_space,
                        metric=["AUC", "LogLoss"],
                        n_runs=100,
                        test_ratio=0.4,
                        random_state=0)
df = pd.concat([pd.DataFrame(result.cv_results_['params']),
                pd.DataFrame(result.cv_results_["mean_test_AUC"], columns=["AUC"]),
                pd.DataFrame(result.cv_results_["mean_test_LogLoss"], columns=["LogLoss"])], axis=1)
df.boxplot(column=['AUC'], by='max_depth', grid=False, figsize=(6, 5))
plt.show()

The random search with 100 trials on random combinations of {max_depth, n_estimators, learning_rate} was performed, with the best parameters given by 3, 1107, 0.0074, respectively. These values appear to align well with the original model choice.

However, the boxplots of validation metrics unveil the phenomenon known as the “Rashomon Effect”, where multiple tuned models exhibit comparable prediction performance. Among these top-performing models, many are observed to have max_depth 2.

3) Model Explainability

Model explainability is of paramount importance for ensuring conceptual soundness in the banking industry, as emphasized by the SR11–7 MRM framework. For machine learning models, achieving explainability involves conducting feature importance analysis and verifying the input-output relationship. Additionally, it may involve assessing the necessity for imposing constraints such as monotonicity to further enhance interpretability.

The model_explain() function in PiML supports various post-hoc explanation methods that are model-agnostic and often rely on approximation. However, it is recommended to use these methods with caution and to validate the accuracy of the approximation with inherently interpretable benchmark models.

Feature Importance Analysis for Model Explainability

The PFI (permutation feature importance) analysis indicates the top four important features, namely “Delinquency”, “Utilization”, “Mortgage” and “Balance”. For each of these identified feature, one can examine the input-output relationship by the PDP (partial dependence plot) and ALE (accumulated local effects) methods. These two methods serve to cross-validate the approximate explanations for ensuring consistency. Notably, the feature “Utilization” demonstrates a decreasing effect, while “Balance” shows an increasing effect, while neither features strictly adheres to monotonicity. Therefore, implementation the monotonicity constraint is recommended for enhanced model interpretability.

4) Interpretable Benchmark Models

The previous evaluation results suggest that an XGBoost model of max_depth 2 (referred to as XGB2) may perform comparably well. XGB2 is among the list of inherently interpretable models in the PiML toobox. With each tree splitting at most twice, XGB2 inherently captures up to two-way interactions. PiML incorporates a purification procedure to isolate the main effects from two-way interactions, thereby making XGB2 easy to interpret.

Meanwhile, let us train another interpretable benchmark model called GAMI-Net, which captures main effects and two-way interactions using ReLU neural networks. For further insights and details regarding GAMI-Net, one may refer to the original paper available at https://arxiv.org/abs/2003.0713. Stay tuned as we prepare to release another article introducing GAMI-Net in depth.

For both XGB2 and GAMI-Net benchmark models, we impose the monotonicity constraints for “Balance” (increasing) and “Utilization” (decreasing). PiML supports customization of each interpretable model, with XGB2 model configuration shown below.

The resulting performance leaderboard is currently ranked by testing AUC, with the original XGB-default model retaining its position as the best -performing model. However, it is important to note that if ranked by other metrics such as testing accuracy (ACC) or F1 score, GAMI-Net would become the top performer, while it takes much longer for model training.

The model_interpret() function in PIML offers a panel that showcases inherent interpretability for each benchmark model, serving as a valuable tool for validating feature importance analysis and assessing input-output relationship. Below, on the left, you can see the XGB2 benchmarking result with piecewise constant curves, while on the right, the GAMI-Net result is displayed with piecewise linear curves.

5) Weakness Detection

Model weakness can be detected locally by the error slicing technique in PiML. It segments data along the slicing features and then identifies the weak regions where the training or testing performance metrics exceed a pre-specified threshold. This approach enables the detection of areas where the model may be underperforming or exhibiting weaknesses, allowing for targeted improvements or adjustments.

The model_diagnose() function in PiML presents a panel of model diagnostics, where the WeakSpot tab provides options for weakness detection. Below, shown on the left is the result of weak regions along 1D slicing feature “Delinquency” (with ACC threshold 1.0), while on the right shows the result of 2D slicing features “Delinquency” and “Utilization” (with ACC threshold 1.1).

6) Benign Overfitting?

As introduced earlier, XGBoost model is an ensemble of numerous decision trees with iterative refinement. Similar to many other complex models, XGBoost is prone to overfitting such that there is significant gap between training and testing performance. However, a notion of “Benign Overfitting” has emerged within the context of overparametrized models like GBMs and DNNs (deep neural networks). This notion suggests that despite the presence of overfitting, the complex models can achieve high accuracy on both training and testing datasets.

Let’s investigate the phenomenon of benign overfitting specifically for XGBoost, then examine whether this phenomenon truly has benign implications for model development in practical applications. The Python script provided below progressively evaluates the training and testing performance of XGBoost over the number of boostings (i.e. n_iteration).

import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
train_x, train_y, _ = exp.get_data(train=True)
test_x, test_y, _ = exp.get_data(test=True)

def score(model, iteration_range):
    train_auc = roc_auc_score(train_y, model.predict_proba(train_x, iteration_range=iteration_range)[:, 1])
    test_auc = roc_auc_score(test_y, model.predict_proba(test_x, iteration_range=iteration_range)[:, 1])
    tree_idx = np.logical_and(xgbinfo["Tree"] < iteration_range[1], xgbinfo["Tree"] >= iteration_range[0])
    n_leaves = np.sum(xgbinfo[tree_idx].loc[:, ["Feature"]].values == "Leaf")
    return {'n_leaves': n_leaves, 'n_iteration': iteration_range[1], 'train_auc': train_auc, 'test_auc': test_auc}

all_metrics = []
model = exp.get_model("XGB-default").estimator
xgbinfo = model.get_booster().trees_to_dataframe()
for i in range(0, model.get_num_boosting_rounds(), 2):
    metrics = score(model, (0, i + 1))
    all_metrics.append(metrics)
all_resluts = pd.DataFrame(all_metrics)

plt.figure(figsize=(8,5))
plt.scatter(all_resluts["n_iteration"], all_resluts["train_auc"], label="Train", color="blue", s=5)
plt.scatter(all_resluts["n_iteration"], all_resluts["test_auc"], label="Test", color="red",  s=5)
plt.legend()
plt.ylabel("AUC")
plt.xlabel("n_iteration")
plt.title("Benign Overfitting: Tendency Toward Overparametrized Models")
plt.show()

The plot illustrates that as n_iteration goes beyond approximately 500, the training AUC continues to increase, indicating overfitting. However, unlike the typical U-shaped pattern observed in the classical bias-variance tradeoff, the testing AUC remains relatively stable, suggesting that the model maintains its prediction accuracy on the testing data.

The concept of benign overfitting resonates with the aforementioned Rashomon Effect in hyperparameter tuning. When all the XGBoost models with n_iteration greater than 500 exhibit similar testing performance, opting for a simpler model is preferable due to better interpretability. As we see later in the robustness test, the overly complex XGBoost models are less robust than the interpretable benchmark models.

Utilizing the slicing technique enables us to detect the overfitting regions and conduct benchmark analysis. Shown below is the comparison of AUC gap for the three models (XGB-default, XGB2, and GAMI-Net) along the slicing feature “Delinquency”. It is evident that XGBoost-default exhibits more overfitting in the majority of the segmented regions.

7) Robustness

It is known that overfitting models usually perform poorly in dynamic or changing environments. The robustness test is to evaluate the performance degradation on the testing data under covariate noise perturbation. For simplicity, let’s consider the raw perturbation by injecting normally distributed noises with varying levels to all the continuous features.

The model_compare() function in PiML provides a panel for conducting robustness benchmarking analysis. Specifically, let’s compare XGBoost-default with two interpretable benchmark models (XGB2 and GAMI-Net), where we choose a small noise step 0.05 and examine the performance degradation through testing AUC.

The analysis result depicted by the plot reveals that GAMI-Net demonstrates superior robustness against covariate noise perturbation, maintaining a high level of performance across varying noise degrees. In contrast, the testing AUC for both XGBoost-default and XGB2 shows significant decline even with small noise perturbation. It is noteworthy that both XGBoost models were configured with n_estimators 1000, a setting which was initially presumed for benign overfitting. However, the robustness test result indicates that what might be misconstrued as benign overfitting could indeed be a symptom of underlying vulnerabilities to practically dynamically changing environments.

🍭Python Notebook

Python Jupyter notebook for this tutorial is available in PiML Github repo, which can be executed by Google Colab through the following link:

Example XGBoost Model Evaluation

Thank you for reading!

If you are, like us, passionate about data science, interpretable machine learning, and/or model risk management, please feel free to contact us on LinkedIn.

All images without a source credit were created by the author or generated by PiML.

About PiML
PiML was born as an educational toolbox for interpretable machine learning in Python community. It is designed for both model development and model diagnostics. Since its first launch on May 4th, 2022, PiML has continuously updated with new features and capabilities, together with a complete user guide.