Chapter 4: Functional Decomposition

This chapter focuses on understanding how ML models make predictions by breaking down their behavior into simpler, interpretable components. This is achieved through the concept of Functional Decomposition, with specific methods like Classical Functional ANOVA (fANOVA) and Friedman's H-Statistic.

§4.01: Introduction to Functional Decomposition

Functional decomposition aims to open up the black box ML model by breaking down the prediction function $\hat{f}(x)$ (with $\mathbf{x}\in\mathbb{R}^p$) into a sum of components. Each component represents the effect of a specific feature or a combination of features on the prediction. This allows for a complete understanding of the model’s interaction structure. Formally, we express:

\[\hat{f}(x) = \sum_{S \subset \{1,...,p\}} g_S(x_S)\]

where each $g_S(x_S)$ depends only of the subset of features indexed by $S$. The empty subset refers to the constant aka intercept, singletons are the main effects and the larger sets ($n>1$)correspond to the $n^{th}$ order interaction effect. This yields an additive model that exactly equals the original function, so each term has an immediate interpretation.

Example: Additive Decomposition

Consider the function $\hat{f}(x_1, x_2) = 4 - 2x_1 + 0.3 e^{x_2} + \mid x_1 \mid x_2$. This function can be decomposed as follows:

Intercept $g_{\phi}(x_1, x_2) = 4$
Main effect of $x_1$: $g_1(x_1, x_2) = 2x_1$
Main effect of $x_2$: $g_2(x_1, x_2) = 0.3e^{x_2}$
Interaction effect: $g_{1,2}(x_1, x_2) = \mid x_1 \mid x_2$

Problems

While powerful for interpretability, calculating these decompositions can be extremely difficult and often infeasible, especially for models with many features $p$, as there are $2^p$ possible terms.
Furthermore, the definition of functional decomposition is not unique, meaning multiple valid decompositions can exist for the same function, making it challenging to find a “meaningful” one. For example, the entire function $\hat{f}(x) $ could trivially be assigned to the highest-order interaction term $g_{1,…,p}(x_1,…,x_p)$, with all other terms being zero.
Decomposition of a decision tree could yield all lower-order effects being hidden inside higher-order terms because the representation could be just the interaction terms at the final leaf nodes.

§4.02: Functional ANOVA (fANOVA)

A requirement for this approach is that the model prediction function $\hat{f} $is square integrable. As with any functional decomposition, the functional ANOVA decomposes the function into components:
\[\hat{f}(x) = \sum_{S \subset \{1,...,p\}} g_S(x_S)\]
where each component is defined as (with $V \subsetneqq S$ i.e. only considering all proper subsets of $S$):
\[\begin{align*} g_S(x_S) &= \hat{f}_{S; PD}(x_S) - \sum_{V \subsetneqq S} g_V(x_V)\\ &= E_{x_{-S}}[\hat{f}(x_S, x_{-S})] - \sum_{V \subsetneqq S} g_V(x_V) \\ &= \int \hat{f}(x_S, x_{-S}) dP(X_{-S}) - \sum_{V \subsetneqq S} g_V(x_V)\\ &= \text{(Expectation integrates } \hat{f} \text{ over all input features except } x_S \text{)} - \text{ (subtract all lower-order components)} \end{align*}\]

Now we can concretely obtain the following

For the Intercept:, the subset $S = {\phi}$ is the empty set and therefore $-S$ contains all the features. The intercept can also be interpreted as the expectation of the prediction function when we assume that all features are uniformly distributed.
\[g_{\phi} = \int_{X}\hat{f(x)} d P(X) = E_X [\hat{f}(x)]\]
The first order effect:

\[g_1(x) = \int_{x_{-1}} (\hat{f}(x_1, x_{-1})) dP(X_{-1}) - g_{\phi} = E_{x_{-1}}[\hat{f}(x) | X_1 = x_1] - g_{\phi}\]

The second-order effect:

\[g_{1,2} = \int{x_{-\{1,2\}}} (\hat{f}(x_{1,2}, x_{-\{1,2\}}) dP(X_{-\{1,2\}})) - g_1(x_1) - g_2(x_2) - g_{\phi}\]

In practice calculating PD-functions for all subsets is very expensive, instead there are sampled using Monte-Carlo integration i.e. a fixed grid is used and the integral is estimated as follows:
\[\hat{f}_{S, PD}(x_S^*)=E_{x_{-S}}[\hat{f}(x_S^*, X_{-S})] \approx \frac{1}{n} \sum_{i=1}^{n} \hat{f}(x_S^*, x_{-s}^{(i)})\]

Advantages of fANOVA

fANOVA has a few interesting properties that make it very attractive

Zero Mean / Vanishing Conditions: $\int \hat{g}_S (x_S) d P(x_S) = 0$ for each $S \neq \phi$.
- All effects or interactions are centered around zero. As a consequence, the interpretation at a position $x$ is relative to the centered prediction and not the absolute prediction.
- $E_{x_V}[g_S(x_S)] = \int g_S dP(x_V) = 0$ where $V \subsetneqq S$ i.e. $g_S$ contains no lower-order effects but only pure interaction terms.
- The definition of fANOVA implies vanishing conditions i.e. any functional decomposition that has vanishing conditions is equivalent to fANOVA.
Orthogonality: $\int \hat{g_S}(x_S) \hat{g_V} (x_V) dP(X) = 0 \text{ for } S \neq V$.
- This implies that components do not share information. For example, the first-order effect of feature $x_1$ and the interaction term $X_{1,2}$ are not correlated. Because of orthogonality, all components are “pure” in the sense that they do not mix effects.
- The more interesting consequence arises for orthogonality of hierarchical components, where one component contains features of another, for example, the interaction between $x_1$ and $x_2$, and the main effect of feature $x_1$. In contrast, a two-dimensional partial dependence plot for $x_1$ and $x_2$ would contain four effects: the intercept, the two main effects of $x_1$ and $x_2$, and the interaction between them.
- The functional ANOVA component for $\hat{f_{1,2}}(x_1, x_2)$ contains only the pure interaction.
Variance Decomposition: As a consequence of orthogonality, we can decompose the variance to a sum without any covariances. Thus we can decompose it as follows:
\[\begin{align*} Var[\hat{f}(x)] &= Var[g_\phi + g_1(x_1) + g_2(x_2) + g_{1,2}(x_1, x_2) + ... + g_{1,...,p}(x)] \\ &= Var[g_\phi] + Var[g_1(x_1)] + ... + Var[g_{1,...,p}(x)] \end{align*}\]
Sobol Index: Dividing the above equation by $\hat{f}(x)$ we get a sum of fractions where each term is basically the fraction of variance explained by it. This can be used as an importance measure called - $S_V = \frac{Var[g_V(x_V)]}{Var[\hat{f}(x)]}$

§4.03: Friedmans H-Statistic

If a feature $j$ has no interaction with any of the other features, we can express the prediction function as a sum of partial dependence functions, where the first summand depends only on $j$ and the second on all other features except $j$. Similarly a two-way PD function is just a sum of the two individual PD functions if there are no interactions (we use centered PD functions here):
\[\hat{f}_{jk, PD}^C(x_j, x_k) = \hat{f}_{j, PD}^C(x_j) + \hat{f}_{k, PD}^C(x_k)\]
The H-Statistic for a 2-way (can also be generalized to an $n-$ way interaction) interaction between feature $j$ and $k$ is defined as:
\[\begin{align*} H_{jk}^2 &= \frac{Var[\hat{f}^C_{jk, PD}(x_j, x_k) - \hat{f}^C_{j, PD}(x_j) - \hat{f}^C_{k, PD}(x_k)]}{Var[\hat{f}^C_{jk, PD}(x_j, x_k)]} \\ &= \frac{\sum_{i=1}^n (\hat{f}^C_{jk, PD}(x_j^{(i)}, x_k^{(i)}) - \hat{f}^C_{j, PD}(x_j^{(i)})-\hat{f}^C_{k, PD}(x_k))^2}{\sum_{i=1}^n \hat{f}^C_{jk, PD}(x_j^{(i)}, x_k^{(i)})} \\ \end{align*}\]
This measures the strength of the interaction quantitatively and small (close to 0) is considered weak interaction and close to 1 is considered a strong interaction.
Analogous to the 2-way interaction strength, we can also find the overall strength of interactions between feature $j$ and all other features where use the centered prediction function
\[\begin{align*} H_{j}^2 &= \frac{Var[\hat{f}^C(x) - \hat{f}^C_{j, PD}(x_j) - \hat{f}^C_{-j, PD}(x_{-j})]}{Var[\hat{f}^C(x)]} \\ &= \frac{\sum_{i=1}^n (\hat{f}^C(x^{(i)}) - \hat{f}^C_{j, PD}(x_j^{(i)})-\hat{f}^C_{-j, PD}(x_{-j}^{(i)}))^2}{\sum_{i=1}^n \hat{f}^C(x^{(i)})^2} \end{align*}\]
H-statistic provides a general definition of interactions and an algorithm to compute them/ However for interaction order $k$ it needs approximately $2^k$ PD-functions which makes it very compute intensive.

§4.04: Further Methods

Standard fANOVA builds up on PD-functions and as PD-functions aren’t robust to correlated features, neither is fANOVA. If we integrate over the uniform distribution, when in reality features are dependent, we create a new dataset that deviates from the joint distribution and extrapolates to unlikely combinations of feature values.
Generalized functional ANOVA is a decomposition that also works for dependent features. It relaxes the vanishing conditions requirement. So they no longer imply orthogonality, but rather hierarchical orthogonality:
\[E_x[g_v(x_v) g_S(x_S)] = 0 \forall V \subsetneqq S\]
i.e. only components which are lower in the the hierarchy are orthogonal.
While it also provides a variance decomposition, it is difficult to estimate, computationally expensive, and requires a manual choice of weight function.
We can also create functional decomposition of ALE plots which handle dependencies well, computationally fast. However it is theoretically more involved, orthogonality properties are more complicated, provides no variance decomposition but is a good alternative in practice.
Functional decompositions, if computed offer a lot of insight into the model including complete analysis of all interactions. However in practice it is often infeasible and.