Model Diagnostics: Error & Resilience

7 min readOct 9, 2023

PiML was born as an educational toolbox for interpretable machine learning in Python community. It is designed for both model development and model diagnostics. Since its first launch on May 4th, 2022, PiML has continuously updated with new features and capabilities, together with a complete user guide.

Source: https://github.com/SelfExplainML/PiML-Toolbox/

We’ve recently made a minor update to its latest version 0.5.1, incorporating features based on user request through Github issues.

Sep 28, 2023 PiML v0.5.1 release with new features:
- Model-free diagnostic test APIs.
- Sliced overfitting test with ACC and AUC metrics.
- Other miscellaneous improvements and bug fixes.
- Support for a wider range of computing environments, including Mac ARM.

As it happens, Aijun from the PiML Team is involved in teaching a model validation class for Master of Data Science students at UNC Charlotte in Fall 2023. Three topics of model diagnostics from the syllabus will be covered:

Topics of Model Diagnostics, UNCC Master of Data Science, Fall 2023

We cover Aijun’s lecture notes on these topics in PiML tutorials, together with case studies and Python codes that are shared here.

Machine Learning Model Diagnostics

🏆Model performance is not all you need.

ML model performance is often measured by accuracy, as examined via standard overall metrics (e.g. MSE, MAE, R2, ACC, AUC, F1-score, Precision and Recall).

However, model risk assessment by single-valued metrics is insufficient. More granular diagnostics and outcome testing are needed, including

Resilience test: anticipate performance degradation due to input distribution drift
Reliability test: assess prediction confidence by uncertainty quantification
Robustness test: assess performance degradation due to small input perturbation

A primary goal of model diagnostics is weakness identification, i.e., to identify regions and drivers where the model performs weak in various ways. The model may have an overall good performance, but may be underfitting or overfitting in sub regions, due to under/overrepresenting local nonlinearity or interaction effects. It may also be exposed to unreliable regions with larger prediction uncertainty. Sometimes these model weaknesses can be detectable, explainable and fixable. In other times the model may be revealed with potential limitation due to sparse data or low signal-to-noise ratio, thus facing a difficult-to-predict problem.

Bias and fairness is another important aspect of model diagnostics, for ensuring responsible practice of machine learning. It is to identify disparity or discrimination against demographic groups and mitigate bias.

All of these diagnostic tests are covered by the PiML toolbox.

In this tutorial, we start with error analysis and resilience test. We demonstrate how PiML is used for model weakness identification in the context of training data, testing data, and distribution drift scenarios.

Part 1: Error and Resilience

Suppose we are given two example models, one is a regression case for the California Housing data, the other a binary classification case for the SimuCredit data. The Python notebooks can be found from PiML Github or experimented directly through Google Colab:

In what follows, we focus on the regression case.

🤖Data and Model Pipelines

PiML provides the convenient low-code interfaces for data, model and test pipelines. After installing the package by “pip install piml”, one may initiate an experiment and load data:

from piml import Experiment
exp = Experiment()
exp.data_loader(data='CaliforniaHousing_trim2')

California Housing data (trim2 version in PiML Toolbox)

Data summary, preprocessing and visualization can be easily carried out in PiML. The last variable MedHouseVal is the response; as usual, we split the data into 80–20 for training and testing. Please refer to the example notebook.

In model pipeline, PiML provides a whole list of interpretable machine learning models that come with model-specific interpretation, as well as post-hoc explanation. We will have a dedicated thread covering them; stay tuned.

In this tutorial, let’s take XGBoost (depth 5) and DNN (ReLU activation) off the shelf, register them into PiML pipeline, and check their prediction accuracy. It is shown that XGB5 performs better than ReLU-DNN, even when XGB5 is overfitting.

from xgboost import XGBRegressor
XGB = XGBRegressor(max_depth=5, n_estimators=500)
exp.model_train(model=XGB, name='XGB5')

from sklearn.neural_network import MLPRegressor
DNN = MLPRegressor(hidden_layer_sizes=[40]*4, 
                   activation="relu", random_state=0)
exp.model_train(model=DNN, name='ReLUDNN')

exp.model_compare(models=["XGB5", "ReLUDNN"], 
                  show="accuracy_plot", metric="MSE", 
                  figsize=(6, 4))

🔬Diagnostics by Residual Plot

Let’s diagnose ReLU-DNN by PiML exp.model_diagnose(). The accuracy tab shows not only performance metrics like MSE, MAE and R2, but also the residual plot where the user can choose training/testing data and x-axis. From the result below, it is evident that the model does not perform well on large response values.

🔬Diagnostics by Error Slicing

To pinpoint model weaknesses at a granular level, PiML offers error slicing techniques.

Specify an appropriate metric based on individual prediction residuals: e.g., MSE for regression, ACC for classification, train-test performance gap, prediction interval bandwidth, etc.
Specify 1 or 2 slicing features of interest;
Evaluate the metric for each sample in the target data (training or testing) as pseudo responses;
Segment the target data along the slicing features, by a) Histogram slicing with equal-space binning, or b) fitting a decision tree or tree-ensemble to generate the sub-regions;
Identify the sub-regions with average metric exceeding the pre-specified threshold, subject to minimum sample condition.

Slicing can be applied to the training data for check of underfitting regions, which is called WeakSpot in PiML.

exp.model_diagnose(model="ReLUDNN", show="weakspot", metric="MSE",
          slice_method="histogram", slice_features=["MedInc"],
          threshold=1.2, min_samples=20, use_test=False, figsize=(6,5))

Slicing can be also applied with train-test performance gap for check of overfitting regions:

exp.model_diagnose(model="ReLUDNN", show="overfit", metric="MSE",
          slice_method="histogram", slice_features=["MedInc", "HouseAge"],
          threshold=1.2, min_samples=100, figsize=(6, 5))

🚀Resilience Test

Resilience test is to anticipate performance degradation under covariate distribution drift. Distributionally resilient models would show mild degradation in performance under distribution drift, under the risk minimization framework:

where S represents the set of distribution drift scenarios. There are four resilient scenarios offered in the PiML toolbox:

Worst-sample: percentage of worst-performing test samples based on residual magnitude
Worst-cluster: worst-performing cluster of test samples based on K-means clustering
Outer-sample: percentage of boundary/outlying test samples distant from the sample mean
Hard-sample: percentage of difficult-to-predict test samples based on an auxiliary model

Note that scenarios 1 and 2 are model-specific, while scenarios 3 and 4 are model-agnostic.

Run PiML exp.model_diagnose() for ReLU-DNN and choose the resilience tab. Let’s illustrate the resilience test with worst-sample scenario. The last plot shows the curve of performance degradation as the percentage of worst-performing test samples varies. For a user-defined ratio (here 20%), it also shows the distribution shift between 20%-worst samples versus the remaining 80% samples.

The distribution drift can be measured by two-sample test between empirical distributions, including Kullback-Leibler (KL) divergence, Kolmogorov-Smirnov (KS) and Cramer-von Mises (CM) statistics, Population Stability Index (PSI), and Wasserstein distance. Among others, PSI tests one variable at a time,

based on the proportions of samples in each bucket of the target vs. base population. As a rule of thumb, when PSI ≥ 0.2, it indicates a significant distribution change. PSI between 0.1 and 0.2 indicates a modest change. The variables with notable drift are deemed to be sensitive or vulnerable in the resilience test.

From the plots above, the variable AveOccup in CaliforniaHousing data is checked with modest distribution drift. Its sensitivity to worst-performing samples can be verified through PiML WeakSpot over testing samples, where the smaller AvgOccup regions tend to be more difficult to predict than the larger counterpart.

⛳One More Thing

In the resilience test, we may choose other distribution drift scenarios, e.g. worst-cluster for the binary classification model XGB5 based on SimuCredit data, with result shown below.

It is found that a sub-region of Utilization is tied with the worst-performing cluster among K-Means clusters with K = 5.

Following this spirit, we can conduct model diagnostics on a segment-by-segment basis. In the supplementary Python notebooks for both cases, an experimental segmented diagnostics was coded to check performance heterogeneity, by utilizing the latest released PiML scored_test APIs:

from piml.scored_test import test_accuracy, residual_plot, slicing_weakspot

We would invite you to join discussion about best practices for segmented diagnostics. We wish to roll it out as a new feature in future PiML release.

Thank you for reading!

All images without a source credit were created by the author or generated by PiML.