Generalized Additive Models and Explainable Boosting Machines

Your standard algorithm for tabular data?

Dr Daniel Kapitan

daniel@kapitan.net

Eindhoven AI Systems Institute

September 10, 2024

Attribution & copyright notice

This lecture is based on the following open access materials:

James et al., Introduction to Statistical Learning with Python (ISLP), chapter 7
Rich Caruana, InterpretML: Explainable Boosting Machines (EBMs)

Source code: https://github.com/anthology-of-data-science/lecture-gam-ebm

Daniel Kapitan, Generalized Additive Models and Explainable Boosting Machines.
This work is licensed under CC BY-SA 4.0

Why this lecture?

Learning objectives: moving beyond linearity

Generalized Additive Models

Know how to use additive models with a single feature
- polynomial regression
- regression splines
- smoothing splines
Know how to use generalized additive models with multiple features
- for regression
- for classification

Explainable Boosting Machines

Know how to use Explainable Boosting Machines by
- training smoothed splines
- correcting the learned splines
- interpreting EBMs
Reflect on usefulness of EBMs for high-risk applications

Generalized Additive Models

Moving beyond linearity

From linear regression

\[ y_{i} = \beta_{0} + \beta_{1} x_{i} + \epsilon_{i} \]

From logistic regression

\[\begin{align} p(x)& = \frac{e^{y_{i}}}{1 + e^{y_{i}}}\\ \log \left( {\frac{p(x)}{1 - p(x)}} \right) & = \beta_{0} + \beta_{1} x_{i} + \epsilon_{i} \end{align}\]

… to polynomial regression

\[ y_{i} = \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i}^{2} + \beta_{3} x_{i}^{3} + \ldots +\epsilon_{i} \]

… to logistic polynomial regression

\[\begin{align} p(x)& = \frac{e^{y_{i}}}{1 + e^{y_{i}}}\\ \log\left({\frac{p(x)}{1 - p(x)}}\right)& = \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i}^{2} + \beta_{3} x_{i}^{3} + \ldots +\epsilon_{i} \end{align}\]

Polynomial regression

Generalization of linear regression

Use ordinary least squares for estimating coefficients
Unusual for d > 3 or 4, because curve becomes too flexible
Classification usually uses logit or log-odds formula

Concept of basis functions

Polynomials are an example of basis functions
Fourier basis is other commonly used basis function (sin, cos) for periodic functions

Example: `Wage` data

Regression splines

Not as many extra degrees of freedom as you may think

Piecewise cubic: 2 x 4 coefficients
–> 8 degrees of freedom (DoF)

Continuous cubic (no gaps): one extra constraint –> 7 DoF

Cubic spline: require 1st and 2nd derivative to be continuous –> two extra constraints –> 5 DoF

Smoothing splines

Determine set of knots with regularization

\[ {\color{green}\sum^{n}_{i=1} {\left(y_{i} - g(x_{i}) \right)}^{2}} + {\color{#ff4f5e} \lambda \int g''(t)^{2}dt} \]

Same principle as Lasso and Ridge regression: \({\color{green} loss} + {\color{#ff4f5e} penalty}\)

Low \(\color{#ff4f5e}\lambda\): low penalty for ‘wildly oscillating’ function \(g(x)\)
High \(\color{#ff4f5e}\lambda\): high penalty forces \(g(x)\) to become smoother (hence the name)
Selection of \(\color{#ff4f5e}\lambda\) done with cross validation (usually LOOCV)

Smooting splines

Example: tricep skinfold thickness as a function of age

Source: Biostatistics Collaboration of Australia.

Generalized additive models with multiple features

From multiple linear regression:

\[ y_{i} = \beta_{0} + \beta_{1} x_{i1} + \beta_{2} x_{i2} + \ldots + \beta_{p} x_{ip} ++ \epsilon_{i} \]

… to GAMs

\[\begin{align} y_{i} &= \beta_{0} + f_{1}(x_{i1}) + f_{2}(x_{i2}) + \ldots + f_{p}(x_{ip}) + \epsilon_{i} \\ y_{i} &= \beta_{0} + \sum^{p}_{j=1} f(x_{ij}) + \epsilon_{i} \end{align}\]

Generalized: for each function \(f_j\) you can choose which (non-)linear basis function you want to use
Additive: we assume we can add contributions of each separate \(f_j\)

GAM for `Wage` data

Regression using natural splines

\(wage = \beta_0 + f_1(year) + f_2(age) + f_3(education)\)

\(f_1\): four degrees of freedom, \(f_2\): five degrees of freedom

GAM for `Wage` data

Regression using smoothing splines

\(wage = \beta_0 + f_1(year) + f_2(age) + f_3(education)\)

\(f_1\): four degrees of freedom, \(f_2\): five degrees of freedom

GAM for `Wage` data

Probability of earning more than 250 thousand dollars per year

\(\log\left({{p(x)}/{1 - p(x)}}\right) = \beta_0 + beta_1 \times year + f_2(age) + f_3(education)\)

\(f_2\): five degrees of freedom

Pros and Cons of GAMS

\(\color{green}\bigtriangleup\) You can fit a non-linear \(f_j\) to each \(X_j\), so we can automatically model such relationships (no need for manual transformation)

\(\color{green}\bigtriangleup\) Using non-linear functions potentially results in more accurate predictions

\(\color{green}\bigtriangleup\) Because model is additive, you can examine effect of each feature \(X_j\) on response \(Y\) individually

\(\color{green}\bigtriangleup\) Smoothness of functions can be summarized via degrees of freedom

\(\color{orange}\bigtriangledown\) Additive model may be too restrictive, doesn’t include interactions

\(\color{orange}\bigtriangledown\) Can be computationally expensive for many features

Explainable Boosting Machines

Remember how gradient boosting works?

Source: Python Geeks

Generalized additive models with pairwise interactions (\(GA^{2}M\))

Microsoft Research implemented it and branded it as EBM

\[ g(E[y]) = \beta_0 + {\color{#00458b} \sum f_i(x_{i})} + {\color{#6e008b} \sum f_{ij}(x_{ij})} \]

\(g(E[y]):\)

link function, identity for regression, logit for logistic regression

\({\color{#00458b} \sum f_i(x_{i})}:\)

GAM, but now using shallow trees as basis function

\({\color{#6e008b} \sum f_{ij}(x_{i})}:\)

pairwise interactions

Combines different ideas into single model (Lou et al. (2019))
- Fast detection of pairwise interactions
- Uses gradient boosted trees, training one feature at a time
- Cycles features with each iteration to mitigate effect of co-linearity
Implemented in InterpretML

Intuition how EBM algorithm works

Intuition how EBM algorithm works (intermediate result)

Fast detection of pairwise interactions

Lou et al. (2013)

Searching cuts on input space of \(x_i\) and \(x_j\). On the left we show a heat map on the target for different values of \(x_i\) and \(x_j\). \(c_i\) and \(c_j\) are cuts for \(x_i\) and \(x_j\), respectively. On the right we show an extremely simple predictor of modeling pairwise interaction.

Intuition how EBM algorithm works (final result)

Performance of EBM on some datasets

Chang et al. (2020)

Test set AUCs (%) across ten datasets average over five runs. Best number in each row is in bold.

Example: predicting pneumonia mortality risk

The Pneumonia Data with 46 features

Cooper et al. (1997)

The dataset contains 14,199 cases of pneumonia collected from 78 hospitals between July 1987 and December 1988.

Using EBMs to detect common flaws in data

Chen et al. (2021)

EBM shape function graphs can be helpful in identifying various types of dataset flaws.
In many cases, users with domain expertise are needed to examine what the model has learned.
In some cases, EBMs provide simple tools for correcting problems in the models, when correcting the data is not feasible or too difficult.

Missing values assumed normal

Chen et al. (2021)

EBM shape function of “heart rate” for predicting pneumonia mortality risk. Left: missing values result in unrealistic high risk score. Right: corrected risk score.

Correction for confounders and treatment effects

Chen et al. (2021)

Left: confounder of retirement at age 67, resulting in sharp increase of risk. Social effect of doctors trying harder to cure centenarians results in lower risk. Right: patients who have a history of asthma have lower pneumonia mortality risk than general population, since they admitted directly into ICU and get more aggressive care, thereby lowering their risk of death.

Discovering new protocols?

Chen et al. (2021)

Left: patients get treated when blood urea nitrogent reaches ~50. When BUN goes over 100, dialysis is given. Right: patients in ICU get treated at systolic blood pressures (SBP) of 175, 200 and 255.

Discovering new protocols?

Chen et al. (2021)

Left: possible improvement by moving dialysis treatment to 80. Rightpatients get treated when blood urea nitrogent reaches ~50. When BUN goes over 100, dialysis is given. Right: adjust “inappropriate” treatment thresholds with flattend red lines.

Where to go from here?

Try it yourself

Work through one of the examples on InterpreML
Reproduce the results from Chang et al. (2020) with the
- Adult census income dataset
- COMPAS Recividism dataset
Read the paper by Nori et al. (2021) how EBMs can be combined with differential privacy (DP) to achieve state-of-the-art accuracy whilst preserving privacy

Generalized Additive Models and Explainable Boosting Machines

Attribution & copyright notice

Why this lecture?

Learning objectives: moving beyond linearity

Generalized Additive Models

Explainable Boosting Machines

Generalized Additive Models

Moving beyond linearity

From linear regression

From logistic regression

… to polynomial regression

… to logistic polynomial regression

Polynomial regression

Generalization of linear regression

Concept of basis functions

Example: Wage data

Regression splines

Not as many extra degrees of freedom as you may think

Smoothing splines

Determine set of knots with regularization

Smooting splines

Example: tricep skinfold thickness as a function of age

Generalized additive models with multiple features

From multiple linear regression:

… to GAMs

GAM for Wage data

Regression using natural splines

GAM for Wage data

Regression using smoothing splines

GAM for Wage data

Probability of earning more than 250 thousand dollars per year

Pros and Cons of GAMS

Explainable Boosting Machines

Remember how gradient boosting works?

Generalized additive models with pairwise interactions (\(GA^{2}M\))

Microsoft Research implemented it and branded it as EBM

Intuition how EBM algorithm works

Intuition how EBM algorithm works (intermediate result)

Fast detection of pairwise interactions

Intuition how EBM algorithm works (final result)

Performance of EBM on some datasets

Example: predicting pneumonia mortality risk

The Pneumonia Data with 46 features

Using EBMs to detect common flaws in data

Missing values assumed normal

Correction for confounders and treatment effects

Discovering new protocols?

Discovering new protocols?

Where to go from here?

Try it yourself

Thanks for your attention.

Example: `Wage` data

GAM for `Wage` data

GAM for `Wage` data

GAM for `Wage` data