Chapter 2: Interpretable Models

Some machine learning models are already inherently interpretable, e.g. simple LMs, GLMs, GAMs and rule-based models. These models are briefly summarized and their interpretation clarified.

§2.01: Inherently Interpretable Models - Motivation



§2.02: Linear Regression Model (LM)


Assumptions of the Linear Model

  1. Linear Relationship between the features and target.
  2. $\epsilon$ and $y \mid x$ are normally distributed with homoscedastic (constant) variance i.e. $ \epsilon \sim N(0, \sigma^2) \Rightarrow y \mid x \sim N(x^T \theta, \sigma^2)$. If the homoscedastic variance assumption is violated then inference based metrics like $p-$value or $t-$ statistics are no longer valid/reliable.
  3. Features $x_j$ are independent from the error term $\epsilon$. Therefore if we plot a single feature against the error term, we should see a point-cloud with no trend.
  4. No or little multicollinearity i.e. there are no strong correlations in our dataset.

Interpretation of Weights (Feature Effects)

Inference


§2.03: LM - Interactions and LASSO


Regularization via LASSO


§2.04: Generalized Linear Models


Logistic Regression


§2.05: Rule-based Models


Decision Trees

CART (Classification and Regression Trees)

CART is a non-parametric decision tree learning technique that can produce classification or regression trees.

ctree (Conditional Inference Tree)

Conditional Inference Trees (ctree) are a type of decision tree that uses a statistical framework based on conditional inference procedures for recursive partitioning. This approach aims to avoid the variable selection bias (favoring variables with more potential split points) present in algorithms like CART.

mob (Model-Based Recursive Partitioning)

Model-Based Recursive Partitioning (mob) is an extension of recursive partitioning that allows for fitting parametric models (like lm, glm, etc) in the terminal nodes of the tree. The partitioning is done based on finding subgroups in the data that exhibit statistically significant differences in the parameters of these models.

Key Differences (LLM Generated)
Feature CART ctree (Conditional Inference Trees) mob (Model-Based Recursive Partitioning)
Primary Goal Prediction (classification or regression) with simple node models. Unbiased variable selection and partitioning based on statistical significance. Identifying subgroups with structurally different parametric models.
Splitting Logic Greedy search for purity/SSE improvement. Statistical tests of independence between predictors and response. Statistical tests for parameter instability across partitioning variables.
Variable Bias Can be biased towards variables with more potential split points. Aims to be unbiased in variable selection. Focuses on variables that cause parameter changes in the node models.
Stopping Rule Grow full tree, then prune using cost-complexity and cross-validation. Stops when no statistically significant splits are found (e.g., p-value threshold). Stops when no significant parameter instability is detected.
Pruning Essential (cost-complexity pruning). Often not needed due to statistical stopping criterion. Pruning can be applied, or statistical stopping criteria used.
Node Models Constant value (majority class for classification, mean for regression). Constant value (majority class for classification, mean for regression). Parametric models (e.g., linear models, GLMs, survival models).
Statistical Basis Heuristic (impurity reduction). Formal statistical inference (permutation tests). Formal statistical inference (parameter instability tests, M-fluctuation tests).
Output Insight Decision rules leading to a prediction. Decision rules with statistical backing for splits. Tree structure showing subgroups where different model parameters apply.

Other Rule Based Models


§2.06: Generalized Additive Models and Boosting

Generalized Additive Models (GAM)

Explainable Boosting Machines (EBMs)

Feature Generalized Additive Models (GAMs) Model-Based Boosting
Strengths Interpretable, Flexible, Additive Structure Automatic feature selection, Regularization
Weaknesses Manual parameter tuning, No automatic interactions Greedy selection can miss important features

Stage 1: Main Effects

We first initialize all main effect function \(f_j^{(0)} = 0 \forall j\) and \(\hat{y}^{(0)} = 0\). Calculate pseudo-residuals \(\tilde{r}^{(0)}\) For \(M\) iterations (where each iteration cycles through ALL features):

  1. Fit a shallow bagged tree \(T_j^{(m)}\) using only feature \(j\) as input and the pseudo-residuals \(\tilde{r}^{(m-1)}\) as target.
  2. Update shape function \(f_j^{(1)}(x_j) = f_j^{(m-1)}(x_j) + \eta T_j^{(m)(x_j)}\)
  3. Update prediction \(\hat{y}^{(1)} = \sum_{j=1}^p f_j^{(1)}(x_j)\)
  4. Recompute the pseudo-residuals \(\tilde{r}^{(1)}\)

The final model consists of \(M\) shallow trees per feature:

\[\text{EBM Model } = \sum_{j=1}^p \sum_{m=1}^M \eta T_j^{(m)}(x_j) = \sum_{j=1}^p \hat{f}_j(x_j)\]

Stage 2: Interaction Effects

Comparison of EBM and Model-Based Boosting

Feature EBM (Explainable Boosting Machine) MB-Boost (Model-Based Boosting)
Base Learner Bagged 2–4-leaf trees, one feature per tree $\Rightarrow$ step-function shape $f_j$ User chooses component-wise learner (linear term, P-spline, tree, random effect, …)
Iteration Policy Round-robin ($\forall j$) each boosting pass; tiny learning rate $\eta \approx 0.01$. Greedy; update the single component that yields the largest loss reduction.
Regularisation Many iterations M (5–10k); early stopping via internal CV on out-of-bag samples; bagging further lowers variance. Shrinkage $\nu \in (0,1]$; early stop by CV/AIC; component selection acts like an $L_0/L_1$ penalty $\Rightarrow$ sparsity.
Interactions FAST ranks and selects top-K interaction pairs, fitted as bivariate trees $\Rightarrow$ GA2M Interactions are modeled only when the user supplies them.