Generalized Additive Models and Explainable Boosting Machines

Your standard algorithm for tabular data?

Dr Daniel Kapitan

Eindhoven AI Systems Institute

September 10, 2024

Why this lecture?

Source: Adaptation of ISLP, figure 2.7

Learning objectives: moving beyond linearity


Generalized Additive Models

  • Know how to use additive models with a single feature
    • polynomial regression
    • regression splines
    • smoothing splines
  • Know how to use generalized additive models with multiple features
    • for regression
    • for classification

Explainable Boosting Machines

  • Know how to use Explainable Boosting Machines by
    • training smoothed splines
    • correcting the learned splines
    • interpreting EBMs
  • Reflect on usefulness of EBMs for high-risk applications

Generalized Additive Models

Moving beyond linearity


From linear regression

\[ y_{i} = \beta_{0} + \beta_{1} x_{i} + \epsilon_{i} \]


From logistic regression

\[\begin{align} p(x)& = \frac{e^{y_{i}}}{1 + e^{y_{i}}}\\ \log \left( {\frac{p(x)}{1 - p(x)}} \right) & = \beta_{0} + \beta_{1} x_{i} + \epsilon_{i} \end{align}\]

… to polynomial regression

\[ y_{i} = \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i}^{2} + \beta_{3} x_{i}^{3} + \ldots +\epsilon_{i} \]

… to logistic polynomial regression

\[\begin{align} p(x)& = \frac{e^{y_{i}}}{1 + e^{y_{i}}}\\ \log\left({\frac{p(x)}{1 - p(x)}}\right)& = \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i}^{2} + \beta_{3} x_{i}^{3} + \ldots +\epsilon_{i} \end{align}\]

Polynomial regression


Generalization of linear regression

  • Use ordinary least squares for estimating coefficients
  • Unusual for d > 3 or 4, because curve becomes too flexible
  • Classification usually uses logit or log-odds formula


Concept of basis functions

  • Polynomials are an example of basis functions
  • Fourier basis is other commonly used basis function (sin, cos) for periodic functions

Example: Wage data

Source: ISLP, figure 7.1

Regression splines

Not as many extra degrees of freedom as you may think

Source: ISLP, figure 7.3

Piecewise cubic: 2 x 4 coefficients
–> 8 degrees of freedom (DoF)


Continuous cubic (no gaps): one extra constraint –> 7 DoF


Cubic spline: require 1st and 2nd derivative to be continuous –> two extra constraints –> 5 DoF

Smoothing splines

Determine set of knots with regularization


\[ {\color{green}\sum^{n}_{i=1} {\left(y_{i} - g(x_{i}) \right)}^{2}} + {\color{#ff4f5e} \lambda \int g''(t)^{2}dt} \]


Same principle as Lasso and Ridge regression: \({\color{green} loss} + {\color{#ff4f5e} penalty}\)

  • Low \(\color{#ff4f5e}\lambda\): low penalty for ‘wildly oscillating’ function \(g(x)\)
  • High \(\color{#ff4f5e}\lambda\): high penalty forces \(g(x)\) to become smoother (hence the name)
  • Selection of \(\color{#ff4f5e}\lambda\) done with cross validation (usually LOOCV)

Smooting splines

Example: tricep skinfold thickness as a function of age

Source: Biostatistics Collaboration of Australia.

Generalized additive models with multiple features


From multiple linear regression:

\[ y_{i} = \beta_{0} + \beta_{1} x_{i1} + \beta_{2} x_{i2} + \ldots + \beta_{p} x_{ip} ++ \epsilon_{i} \]

… to GAMs

\[\begin{align} y_{i} &= \beta_{0} + f_{1}(x_{i1}) + f_{2}(x_{i2}) + \ldots + f_{p}(x_{ip}) + \epsilon_{i} \\ y_{i} &= \beta_{0} + \sum^{p}_{j=1} f(x_{ij}) + \epsilon_{i} \end{align}\]

  • Generalized: for each function \(f_j\) you can choose which (non-)linear basis function you want to use
  • Additive: we assume we can add contributions of each separate \(f_j\)

GAM for Wage data

Regression using natural splines

\(wage = \beta_0 + f_1(year) + f_2(age) + f_3(education)\)

\(f_1\): four degrees of freedom, \(f_2\): five degrees of freedom

Source: ISLP, figure 7.11.

GAM for Wage data

Regression using smoothing splines

\(wage = \beta_0 + f_1(year) + f_2(age) + f_3(education)\)

\(f_1\): four degrees of freedom, \(f_2\): five degrees of freedom

Source: ISLP, figure 7.12.

GAM for Wage data

Probability of earning more than 250 thousand dollars per year

\(\log\left({{p(x)}/{1 - p(x)}}\right) = \beta_0 + beta_1 \times year + f_2(age) + f_3(education)\)

\(f_2\): five degrees of freedom

Source: ISLP, figure 7.12.

Pros and Cons of GAMS

\(\color{green}\bigtriangleup\) You can fit a non-linear \(f_j\) to each \(X_j\), so we can automatically model such relationships (no need for manual transformation)

\(\color{green}\bigtriangleup\) Using non-linear functions potentially results in more accurate predictions

\(\color{green}\bigtriangleup\) Because model is additive, you can examine effect of each feature \(X_j\) on response \(Y\) individually

\(\color{green}\bigtriangleup\) Smoothness of functions can be summarized via degrees of freedom

\(\color{orange}\bigtriangledown\) Additive model may be too restrictive, doesn’t include interactions

\(\color{orange}\bigtriangledown\) Can be computationally expensive for many features

Explainable Boosting Machines

Remember how gradient boosting works?

Source: Python Geeks

Generalized additive models with pairwise interactions (\(GA^{2}M\))

Microsoft Research implemented it and branded it as EBM

\[ g(E[y]) = \beta_0 + {\color{#00458b} \sum f_i(x_{i})} + {\color{#6e008b} \sum f_{ij}(x_{ij})} \]

\(g(E[y]):\)

link function, identity for regression, logit for logistic regression

\({\color{#00458b} \sum f_i(x_{i})}:\)

GAM, but now using shallow trees as basis function

\({\color{#6e008b} \sum f_{ij}(x_{i})}:\)

pairwise interactions

  • Combines different ideas into single model (Lou et al. (2019))
    • Fast detection of pairwise interactions
    • Uses gradient boosted trees, training one feature at a time
    • Cycles features with each iteration to mitigate effect of co-linearity
  • Implemented in InterpretML

Intuition how EBM algorithm works

Intuition how EBM algorithm works (intermediate result)

Fast detection of pairwise interactions

Lou et al. (2013)

Searching cuts on input space of \(x_i\) and \(x_j\). On the left we show a heat map on the target for different values of \(x_i\) and \(x_j\). \(c_i\) and \(c_j\) are cuts for \(x_i\) and \(x_j\), respectively. On the right we show an extremely simple predictor of modeling pairwise interaction.

Intuition how EBM algorithm works (final result)

Performance of EBM on some datasets

Chang et al. (2020)

Test set AUCs (%) across ten datasets average over five runs. Best number in each row is in bold.

Example: predicting pneumonia mortality risk

The Pneumonia Data with 46 features

Cooper et al. (1997)

The dataset contains 14,199 cases of pneumonia collected from 78 hospitals between July 1987 and December 1988.

Using EBMs to detect common flaws in data

Chen et al. (2021)


  1. EBM shape function graphs can be helpful in identifying various types of dataset flaws.
  2. In many cases, users with domain expertise are needed to examine what the model has learned.
  3. In some cases, EBMs provide simple tools for correcting problems in the models, when correcting the data is not feasible or too difficult.

Missing values assumed normal

Chen et al. (2021)


EBM shape function of “heart rate” for predicting pneumonia mortality risk. Left: missing values result in unrealistic high risk score. Right: corrected risk score.

Correction for confounders and treatment effects

Chen et al. (2021)


Left: confounder of retirement at age 67, resulting in sharp increase of risk. Social effect of doctors trying harder to cure centenarians results in lower risk. Right: patients who have a history of asthma have lower pneumonia mortality risk than general population, since they admitted directly into ICU and get more aggressive care, thereby lowering their risk of death.

Discovering new protocols?

Chen et al. (2021)


Left: patients get treated when blood urea nitrogent reaches ~50. When BUN goes over 100, dialysis is given. Right: patients in ICU get treated at systolic blood pressures (SBP) of 175, 200 and 255.

Discovering new protocols?

Chen et al. (2021)


Left: possible improvement by moving dialysis treatment to 80. Rightpatients get treated when blood urea nitrogent reaches ~50. When BUN goes over 100, dialysis is given. Right: adjust “inappropriate” treatment thresholds with flattend red lines.

Where to go from here?

Try it yourself


Thanks for your attention.