Feature Effects indicate the change in prediction due to changes in feature values. This chapter explains the feature effects methods ICE curves, PDP and ALE plots.
Individual Conditional Expectation (ICE) plots are a model-agnostic method used to visualize how the prediction for a single observation changes as a subset specific features varies, while all other features for that observation are held constant. They offer a local interpretation, showing the effect of a feature for an individual instance.
The construction of ICE curves involves the following steps for each observation $i$, and features of interest $x_S$
Grid Point Generation: A set of grid values ${ x_S^{(1)},x_S^{(2)},…,x_S^{(g)} }$ is created for the feature of interest. These grid points span the range of $x_S$. Common methods for selecting grid values include equidistant, random sampling from observed feature values, quantiles of observed feature values with the latter two preserving the marginal distribution. However even these can create unrealistic datapoints if there are interactions especially (for example summer and temperature of -1 could be an impossible observation).
Visualization: These points are then plotted and and they are connected to form the ICE curve for $i^{th}$ observation.
Partial Dependence Plots (PDPs) are a model-agnostic method that visualizes the average marginal effect of one or two features on the predicted outcome of a machine learning model. They provide a global interpretation by showing how, on average, the model’s prediction changes as the feature(s) of interest vary, while averaging out the effects of all other features
The PD function is formally defined as the expectation of the model’s prediction $\hat{f}(x_S, x_{-S})$ with respect to the marginal distribution of features not in $S$ i.e. ($-S$).
\[f_{S, PD}(x_S) = E_{x_{-S}}[\hat{f}(x_S, x_{-S})] = \int_{- \infty}^\infty \hat{f}(x_S, x_{-S}) d P(x_{-S})\]Normally, this is estimated using the grid-values generated by averaging the ICE-curves point-wise. \(\hat{ f_{S, PD}(x_S)} = \frac{1}{n}\sum_{i=1}^n \hat{f}(x_S, x_{-S})\)
PDP plots average ICE curves and thus may obscure heterogenous effects. Therefore it is important to plot both the ICE curves and the PDP together to detect these.
Centered Individual Conditional Expectation (c-ICE) plots are a visualization technique used to enhance the interpretability of standard ICE plots, particularly when trying to discern heterogeneous effects or interactions between features. They address a common issue with regular ICE plots where differences in the starting prediction levels (intercepts) for various instances can make it difficult to compare the shapes of the individual curve.
Simply centering them at a fixed reference point solves the issue. Often $x’= \min (x_S)$ is used. \(\hat{f}_{S, cICE}(x_S) = \hat{f}(x_S, x_{-S}^{(i)}) - \hat{f}(x', x_{-S}^{(i)}) = \hat{f}^{(i)}(x_S) - \hat{f}^{(i)}(x')\)
Marginal effects (MEs) quantify changes in model predictions resulting from changes in one or more features. They are particularly useful when parameter-based interpretations (like coefficients in linear models) are not straightforward due to model complexity or interactions.
There are two main ways to compute marginal effects:
Both methods measure the local effect of a feature on the prediction, but they differ in their assumptions and robustness.
Definition: The derivative marginal effect of feature $x_j$ at point $x$ is the partial derivative of the prediction function $f$ with respect to $x_j$:
\[dME_j(x) = \frac{\partial \hat{f}(x)}{\partial x_j} \approx \frac{\hat{f}(x_1,..,x_j+h_j,..,x_p) - \hat{f}(x_1,..,x_j -h_j,..,x_p)}{2 h_j}\]Definition: The forward marginal effect of feature $x_j$ at point $x$ with step size $h$ is:
\[fME_j(x, h_j) = \hat{f}(x_1,...,x_j + h_j,...,x_p) - \hat{f}(x)\]Traditionally a reference category was fixed, and the ME was calculated by keeping all other features constant except changing the category. The definition for fME mirrors the continuous equivalent:
\[fME_j(x; x_j^{new}) = \hat{f}(x_j^{new}, x_{-j}) - \hat{x_j, x_{-j}}\]It captures the global overall effect is simply the average of marginal effects across all individual observations. You can use fME or dME (formula below is for fME)
\[AME_s = \frac{1}{n} \Sigma_{i=1}^n [\hat{f}(x^{(i)}_s + h_s, x^{(i)}_{-s}) - \hat{f}(x^{(i)})]\]MEs provide a single scalar number to quantify the effect. Simultaneously perturbing multiple features still yields a scalar. Moreover it is captured at the actual data point and captures interactions without any assumptions, provides a non-linearity measure and is computationally cheap.
Compare a neural network vs a correctly specified polynomial model to fit the above data. Although both have similar performance, note that the prediction surface for the neural network looks very different from the correctly specified quadratic. The neural network will still give predictions in the outside regions but may behave somewhat strangely. This is common for neural networks since they are highly complex and allow for the model to learn complex relations within the data distribution.
It is important to note that PD plots aren’t wrong. PD Plots describes the feature effect that the model estimated. Since the model also makes predictions for out-of-distribution data, the feature effects captured here reflect that. However if our interest is to interpret the feature effects of the data distribution, then PDP is the wrong tool. PD Plots give us an estimation of the feature effect that is true to the model, not the data.
One might argue that we look at a neighbourhood of the points instead of the full marginal aka the conditional distribution, then that should suffice and give us the feature effect present in the data:
\[E_{x_2|x_1}(\hat{f}(x_1, x_2)|x_1) \approx \hat{f}_{1,M}(x_1) = \frac{1}{|N(x_1)|} \Sigma_{i \in N(x_1)} \hat{f}(x_1, x_2)\]Where $N(x_1)$ is the index of all points within a fixed distance.
However, this created an omitted variable bias. When we condition on $x_1$, we’re not isolating its pure effect. We’re capturing the combined effect of $x_1$ and all features that are correlated with it.When we fix $x_1=c$ and look at nearby observations, we’re implicitly also constraining $x_2$ to values around $E[x_2 \mid x_1 = c]$
ALE attempts to solve this problem by taking partial derivatives (local effects) of prediction function w.r.t. feature of interest and integrating (accumulate) them w.r.t. the same feature.