Chapter 4: Functional Decomposition

This chapter focuses on understanding how ML models make predictions by breaking down their behavior into simpler, interpretable components. This is achieved through the concept of Functional Decomposition, with specific methods like Classical Functional ANOVA (fANOVA) and Friedman's H-Statistic.

§4.01: Introduction to Functional Decomposition

\[\hat{f}(x) = \sum_{S \subset \{1,...,p\}} g_S(x_S)\]

where each $g_S(x_S)$ depends only of the subset of features indexed by $S$. The empty subset refers to the constant aka intercept, singletons are the main effects and the larger sets ($n>1$)correspond to the $n^{th}$ order interaction effect. This yields an additive model that exactly equals the original function, so each term has an immediate interpretation.

Example: Additive Decomposition Consider the function $\hat{f}(x_1, x_2) = 4 - 2x_1 + 0.3 e^{x_2} + \mid x_1 \mid x_2$. This function can be decomposed as follows:
  • Intercept $g_{\phi}(x_1, x_2) = 4$
  • Main effect of $x_1$: $g_1(x_1, x_2) = 2x_1$
  • Main effect of $x_2$: $g_2(x_1, x_2) = 0.3e^{x_2}$
  • Interaction effect: $g_{1,2}(x_1, x_2) = \mid x_1 \mid x_2$

Problems

  1. While powerful for interpretability, calculating these decompositions can be extremely difficult and often infeasible, especially for models with many features $p$, as there are $2^p$ possible terms.
  2. Furthermore, the definition of functional decomposition is not unique, meaning multiple valid decompositions can exist for the same function, making it challenging to find a “meaningful” one. For example, the entire function $\hat{f}(x) $ could trivially be assigned to the highest-order interaction term $g_{1,…,p}(x_1,…,x_p)$, with all other terms being zero.
  3. Decomposition of a decision tree could yield all lower-order effects being hidden inside higher-order terms because the representation could be just the interaction terms at the final leaf nodes.

§4.02: Functional ANOVA (fANOVA)

Now we can concretely obtain the following

  1. For the Intercept:, the subset $S = {\phi}$ is the empty set and therefore $-S$ contains all the features. The intercept can also be interpreted as the expectation of the prediction function when we assume that all features are uniformly distributed.

    \[g_{\phi} = \int_{X}\hat{f(x)} d P(X) = E_X [\hat{f}(x)]\]
  2. The first order effect:

\[g_1(x) = \int_{x_{-1}} (\hat{f}(x_1, x_{-1})) dP(X_{-1}) - g_{\phi} = E_{x_{-1}}[\hat{f}(x) | X_1 = x_1] - g_{\phi}\]
  1. The second-order effect:
\[g_{1,2} = \int{x_{-\{1,2\}}} (\hat{f}(x_{1,2}, x_{-\{1,2\}}) dP(X_{-\{1,2\}})) - g_1(x_1) - g_2(x_2) - g_{\phi}\]

Advantages of fANOVA

fANOVA has a few interesting properties that make it very attractive

  1. Zero Mean / Vanishing Conditions: $\int \hat{g}_S (x_S) d P(x_S) = 0$ for each $S \neq \phi$.
    • All effects or interactions are centered around zero. As a consequence, the interpretation at a position $x$ is relative to the centered prediction and not the absolute prediction.
    • $E_{x_V}[g_S(x_S)] = \int g_S dP(x_V) = 0$ where $V \subsetneqq S$ i.e. $g_S$ contains no lower-order effects but only pure interaction terms.
    • The definition of fANOVA implies vanishing conditions i.e. any functional decomposition that has vanishing conditions is equivalent to fANOVA.
  2. Orthogonality: $\int \hat{g_S}(x_S) \hat{g_V} (x_V) dP(X) = 0 \text{ for } S \neq V$.
    • This implies that components do not share information. For example, the first-order effect of feature $x_1$ and the interaction term $X_{1,2}$ are not correlated. Because of orthogonality, all components are “pure” in the sense that they do not mix effects.
    • The more interesting consequence arises for orthogonality of hierarchical components, where one component contains features of another, for example, the interaction between $x_1$ and $x_2$, and the main effect of feature $x_1$. In contrast, a two-dimensional partial dependence plot for $x_1$ and $x_2$ would contain four effects: the intercept, the two main effects of $x_1$ and $x_2$, and the interaction between them.
    • The functional ANOVA component for $\hat{f_{1,2}}(x_1, x_2)$ contains only the pure interaction.
  3. Variance Decomposition: As a consequence of orthogonality, we can decompose the variance to a sum without any covariances. Thus we can decompose it as follows:

    \[\begin{align*} Var[\hat{f}(x)] &= Var[g_\phi + g_1(x_1) + g_2(x_2) + g_{1,2}(x_1, x_2) + ... + g_{1,...,p}(x)] \\ &= Var[g_\phi] + Var[g_1(x_1)] + ... + Var[g_{1,...,p}(x)] \end{align*}\]
  4. Sobol Index: Dividing the above equation by $\hat{f}(x)$ we get a sum of fractions where each term is basically the fraction of variance explained by it. This can be used as an importance measure called - $S_V = \frac{Var[g_V(x_V)]}{Var[\hat{f}(x)]}$

§4.03: Friedmans H-Statistic


§4.04: Further Methods