Some machine learning models are already inherently interpretable, e.g. simple LMs, GLMs, GAMs and rule-based models. These models are briefly summarized and their interpretation clarified.
The most straightforward approach to achieving interpretability is to use inherently interpretable models. There are certain classes of models that are deemed inherently interpretable like linear models, additive models, decision trees, rule-based learning, model/component-based boosting, etc.
Pros: For such cases, model-agnostic methods are often not required and this eliminates a source of error. Furthermore they are often simple and require fairly small training time. Some classes like GLMs can estimate monotonic effects. Since many people from different domains are familiar with interpretable models, it increases trust and facilitates easier communication of results.
Cons: These models often require strong assumptions about the data (normal errors for LRs or assuming linear structure when the underlying data is quadratic). When these assumptions are wrong models may perform poorly. Inherently interpretable models may also be hard to interpret e.g. a linear model with a lot of features and interactions or a decision tree with a huge depth. Furthermore due to their limited flexibility they may struggle to model complex relationships.
An important thing to remember is that inherently interpretable models do not provide all types of explanations. For example counterfactual explanations are useful for even LRs and decision trees.
Whilst some argue that interpretable models should be preferred to models that require post-hoc analysis, they sometimes require a lot of time and energy on data pre-processing and/or manual feature engineering. It is also hard to achieve this for data where end-to-end learning is crucial i.e. feature extraction for image/text/audio data often leads to information loss leading to bad performance.
The Linear Regression Model is given by:
\[y = \theta_0 + \theta_1 x_1 + ... + \theta_p x_p + \epsilon = x^T \theta + \epsilon\]where the model consists of $p+1$ weights due to the intercept $\theta_0$
The feature importance is given by the $t-$statistic value:
\[| t_{\hat{\theta_j}} | = | \frac{\hat{\theta_j}}{SE(\hat{\theta_j})}|\]where high $t-$ values indicate important (significant) features.
The probability of obtaining a more extreme test statistic assuming $H_0: \theta_j = 0$ i.e. assuming feature $j$ is “useless” (not significant) is given by the $p-$value. A high $\mid t \mid$ corresponds to a small $p-$value.
The standard linear regression can be extended by including higher order effects $(x_j^2)$ or interaction effects $(x_i * x_j)$ which have their own weights. Both of these effects make the model more flexible but also make it less interpretable. Unlike ML model (like NN for eg), we need to perform feature engineering and specify all effects we want to model. Now the marginal effect can no longer be interpreted by single weights anymore.
Adding a quadratic effect for temperature for example makes our interpretation non-linear and the effect of temperature on bike-rentals is now determined by two coefficients:
\[\theta_{temp} x_{temp} + \theta_{temp^2} x_{temp^2}\]Adding in quadratic and interaction effects makes the problem even more complicated to interpret:
GLMs extend LMs by still using a linear score, but allowing for distributions from the exponential family where a link function $g$ links the linear score/predictor $x^T \theta$ to the expectation of the conditional distribution i.e. :
\[g(E(y | x)) = x^T \theta \Leftrightarrow E(y|x) = g^{-1}(x^T \theta)\]For LM, the $g$ is simply the identity function.
The logistic regression is equivalent to the GLM with bernoulli distributed conditional expectation and the logit/sigmoid link function.
\[g(x) = \log(\frac{x}{1 - x}) \Rightarrow g^{-1}(x) = \frac{1}{1 + e^{-x}}\]The logistic regression models the probabilities for binary classification through:
\[\pi(x) = E(y | x) = P(y=1| x) = g^{-1}(x^T \theta) = \frac{1}{1 + e^{-x^T \theta}}\]If we rearrange the terms to solve for $x^T \theta$, you will see that the log-odds is linear i.e.
\[\log(\frac{\pi(x)}{1-\pi(x)}) = x^T \theta\]This means that changing $x_j$ by one unit, changes the log-odds of class 1 compared to class 0 by $\theta_j$. However interpreting wrt to the odds-ratio is more common because the odds-ration is given by $\frac{\pi(x)}{1-\pi(x)} = e^{x^T \theta}$. Now if we compare the odds when $x_j$ increments by 1 against $x_j$, the ratio:
\[\frac{odds_{x_j + 1}}{odds} = \frac{e^{\theta_0 + ... + \theta_j (x_j + 1) + ...}}{e^{\theta_0 + ... + \theta_j x_j + ...}}= e^{\theta_j}\]Thus, changing $x_j$ by 1 unit, changes the odds-ratio by a (multiplicative) factor of $e^{\theta_j}$.
Note how logistic regression will always give rise to a linear classifier. The decision boundary will always be a $p-$dimensional hyperplane. This is because a threshold $t$ is set such that it is classified as 1 if $\pi(x) > t$ and classified as 0 otherwise.
The main idea of decision trees is to partition data into subsets based on cut-off values in features (found by minimizing split criterion via greedy search especially in CART algorithms) and predict constant mean $c_m$ in leaf node $R_m$:
\[\hat{f}(x) = \Sigma_{m=1}^M c_m 1\{ x \in R_m \}\]CART is a non-parametric decision tree learning technique that can produce classification or regression trees.
Conditional Inference Trees (ctree) are a type of decision tree that uses a statistical framework based on conditional inference procedures for recursive partitioning. This approach aims to avoid the variable selection bias (favoring variables with more potential split points) present in algorithms like CART.
Model-Based Recursive Partitioning (mob) is an extension of recursive partitioning that allows for fitting parametric models (like lm, glm, etc) in the terminal nodes of the tree. The partitioning is done based on finding subgroups in the data that exhibit statistically significant differences in the parameters of these models.
Feature | CART | ctree (Conditional Inference Trees) | mob (Model-Based Recursive Partitioning) |
---|---|---|---|
Primary Goal | Prediction (classification or regression) with simple node models. | Unbiased variable selection and partitioning based on statistical significance. | Identifying subgroups with structurally different parametric models. |
Splitting Logic | Greedy search for purity/SSE improvement. | Statistical tests of independence between predictors and response. | Statistical tests for parameter instability across partitioning variables. |
Variable Bias | Can be biased towards variables with more potential split points. | Aims to be unbiased in variable selection. | Focuses on variables that cause parameter changes in the node models. |
Stopping Rule | Grow full tree, then prune using cost-complexity and cross-validation. | Stops when no statistically significant splits are found (e.g., p-value threshold). | Stops when no significant parameter instability is detected. |
Pruning | Essential (cost-complexity pruning). | Often not needed due to statistical stopping criterion. | Pruning can be applied, or statistical stopping criteria used. |
Node Models | Constant value (majority class for classification, mean for regression). | Constant value (majority class for classification, mean for regression). | Parametric models (e.g., linear models, GLMs, survival models). |
Statistical Basis | Heuristic (impurity reduction). | Formal statistical inference (permutation tests). | Formal statistical inference (parameter instability tests, M-fluctuation tests). |
Output Insight | Decision rules leading to a prediction. | Decision rules with statistical backing for splits. | Tree structure showing subgroups where different model parameters apply. |
GAMs extend (G)LMs by allowing non-linear relationships between some or all the features and outcome variable in order to address the limitations of linear models. The fundamental idea behind GAMs is to replace the linear terms $\beta_j(x_j)$ found in (G)LMs with flexible, smooth functions $\beta_j f(x_j)$ of the features. This allows the model to capture non-linear effects while maintaining an additive structure. The general form of a GAM can be expressed as:
\[g(E[y]) = \beta_0 + \beta_1 f_1(x_1) + \beta_2 f_2(x_2) + ...\]where $g$ is the link function and $f_j$ is a spline of the $j^{th}$ feature. These splines have control parameters to control flexibility.
This structure preserves the additive nature of the model, meaning the effect of each feature on the prediction is independent of the other features, simplifying interpretation. However, GAMs can also be extended to include selected pairwise interactions. This would have to be manually prescribed to the model. Furthermore allowing for arbitrary interactions between the features makes the model more complex and is now going into the territory of non-parametric ML Models which are not really interpretable.
The nice thing about additive structure is that they can be directly interpreted and don’t require us to perform additional analysis (like PDPs).