This chapter focuses on understanding how ML models make predictions by breaking down their behavior into simpler, interpretable components. This is achieved through the concept of Functional Decomposition, with specific methods like Classical Functional ANOVA (fANOVA) and Friedman's H-Statistic.
where each $g_S(x_S)$ depends only of the subset of features indexed by $S$. The empty subset refers to the constant aka intercept, singletons are the main effects and the larger sets ($n>1$)correspond to the $n^{th}$ order interaction effect. This yields an additive model that exactly equals the original function, so each term has an immediate interpretation.
A requirement for this approach is that the model prediction function $\hat{f} $is square integrable. As with any functional decomposition, the functional ANOVA decomposes the function into components:
\[\hat{f}(x) = \sum_{S \subset \{1,...,p\}} g_S(x_S)\]where each component is defined as (with $V \subsetneqq S$ i.e. only considering all proper subsets of $S$):
\[\begin{align*} g_S(x_S) &= \hat{f}_{S; PD}(x_S) - \sum_{V \subsetneqq S} g_V(x_V)\\ &= E_{x_{-S}}[\hat{f}(x_S, x_{-S})] - \sum_{V \subsetneqq S} g_V(x_V) \\ &= \int \hat{f}(x_S, x_{-S}) dP(X_{-S}) - \sum_{V \subsetneqq S} g_V(x_V)\\ &= \text{(Expectation integrates } \hat{f} \text{ over all input features except } x_S \text{)} - \text{ (subtract all lower-order components)} \end{align*}\]Now we can concretely obtain the following
For the Intercept:, the subset $S = {\phi}$ is the empty set and therefore $-S$ contains all the features. The intercept can also be interpreted as the expectation of the prediction function when we assume that all features are uniformly distributed.
\[g_{\phi} = \int_{X}\hat{f(x)} d P(X) = E_X [\hat{f}(x)]\]The first order effect:
In practice calculating PD-functions for all subsets is very expensive, instead there are sampled using Monte-Carlo integration i.e. a fixed grid is used and the integral is estimated as follows:
\[\hat{f}_{S, PD}(x_S^*)=E_{x_{-S}}[\hat{f}(x_S^*, X_{-S})] \approx \frac{1}{n} \sum_{i=1}^{n} \hat{f}(x_S^*, x_{-s}^{(i)})\]fANOVA has a few interesting properties that make it very attractive
Variance Decomposition: As a consequence of orthogonality, we can decompose the variance to a sum without any covariances. Thus we can decompose it as follows:
\[\begin{align*} Var[\hat{f}(x)] &= Var[g_\phi + g_1(x_1) + g_2(x_2) + g_{1,2}(x_1, x_2) + ... + g_{1,...,p}(x)] \\ &= Var[g_\phi] + Var[g_1(x_1)] + ... + Var[g_{1,...,p}(x)] \end{align*}\]If a feature $j$ has no interaction with any of the other features, we can express the prediction function as a sum of partial dependence functions, where the first summand depends only on $j$ and the second on all other features except $j$. Similarly a two-way PD function is just a sum of the two individual PD functions if there are no interactions (we use centered PD functions here):
\[\hat{f}_{jk, PD}^C(x_j, x_k) = \hat{f}_{j, PD}^C(x_j) + \hat{f}_{k, PD}^C(x_k)\]The H-Statistic for a 2-way (can also be generalized to an $n-$ way interaction) interaction between feature $j$ and $k$ is defined as:
\[\begin{align*} H_{jk}^2 &= \frac{Var[\hat{f}^C_{jk, PD}(x_j, x_k) - \hat{f}^C_{j, PD}(x_j) - \hat{f}^C_{k, PD}(x_k)]}{Var[\hat{f}^C_{jk, PD}(x_j, x_k)]} \\ &= \frac{\sum_{i=1}^n (\hat{f}^C_{jk, PD}(x_j^{(i)}, x_k^{(i)}) - \hat{f}^C_{j, PD}(x_j^{(i)})-\hat{f}^C_{k, PD}(x_k))^2}{\sum_{i=1}^n \hat{f}^C_{jk, PD}(x_j^{(i)}, x_k^{(i)})} \\ \end{align*}\]This measures the strength of the interaction quantitatively and small (close to 0) is considered weak interaction and close to 1 is considered a strong interaction.
Analogous to the 2-way interaction strength, we can also find the overall strength of interactions between feature $j$ and all other features where use the centered prediction function
\[\begin{align*} H_{j}^2 &= \frac{Var[\hat{f}^C(x) - \hat{f}^C_{j, PD}(x_j) - \hat{f}^C_{-j, PD}(x_{-j})]}{Var[\hat{f}^C(x)]} \\ &= \frac{\sum_{i=1}^n (\hat{f}^C(x^{(i)}) - \hat{f}^C_{j, PD}(x_j^{(i)})-\hat{f}^C_{-j, PD}(x_{-j}^{(i)}))^2}{\sum_{i=1}^n \hat{f}^C(x^{(i)})^2} \end{align*}\]H-statistic provides a general definition of interactions and an algorithm to compute them/ However for interaction order $k$ it needs approximately $2^k$ PD-functions which makes it very compute intensive.
Standard fANOVA builds up on PD-functions and as PD-functions aren’t robust to correlated features, neither is fANOVA. If we integrate over the uniform distribution, when in reality features are dependent, we create a new dataset that deviates from the joint distribution and extrapolates to unlikely combinations of feature values.
Generalized functional ANOVA is a decomposition that also works for dependent features. It relaxes the vanishing conditions requirement. So they no longer imply orthogonality, but rather hierarchical orthogonality:
\[E_x[g_v(x_v) g_S(x_S)] = 0 \forall V \subsetneqq S\]i.e. only components which are lower in the the hierarchy are orthogonal.
While it also provides a variance decomposition, it is difficult to estimate, computationally expensive, and requires a manual choice of weight function.
We can also create functional decomposition of ALE plots which handle dependencies well, computationally fast. However it is theoretically more involved, orthogonality properties are more complicated, provides no variance decomposition but is a good alternative in practice.
Functional decompositions, if computed offer a lot of insight into the model including complete analysis of all interactions. However in practice it is often infeasible and.