Chapter 5: Shapley

Shapley values originate from classical game theory and aim to fairly devide a payout between players. In this section a brief explanation of Shapley values in game theory is given, followed by an adaption to IML resulting in the method SHAP

§5.01: Shapley Values



§4.02: Functional ANOVA (fANOVA)

Now we can concretely obtain the following

  1. For the Intercept:, the subset $S = {\phi}$ is the empty set and therefore $-S$ contains all the features. The intercept can also be interpreted as the expectation of the prediction function when we assume that all features are uniformly distributed.

    \[g_{\phi} = \int_{X}\hat{f(x)} d P(X) = E_X [\hat{f}(x)]\]
  2. The first order effect:

\[g_1(x) = \int_{x_{-1}} (\hat{f}(x_1, x_{-1})) dP(X_{-1}) - g_{\phi} = E_{x_{-1}}[\hat{f}(x) | X_1 = x_1] - g_{\phi}\]
  1. The second-order effect:
\[g_{1,2} = \int{x_{-\{1,2\}}} (\hat{f}(x_{1,2}, x_{-\{1,2\}}) dP(X_{-\{1,2\}})) - g_1(x_1) - g_2(x_2) - g_{\phi}\]

Advantages of fANOVA

fANOVA has a few interesting properties that make it very attractive

  1. Zero Mean / Vanishing Conditions: $\int \hat{g}_S (x_S) d P(x_S) = 0$ for each $S \neq \phi$.
    • All effects or interactions are centered around zero. As a consequence, the interpretation at a position $x$ is relative to the centered prediction and not the absolute prediction.
    • $E_{x_V}[g_S(x_S)] = \int g_S dP(x_V) = 0$ where $V \subsetneqq S$ i.e. $g_S$ contains no lower-order effects but only pure interaction terms.
    • The definition of fANOVA implies vanishing conditions i.e. any functional decomposition that has vanishing conditions is equivalent to fANOVA.
  2. Orthogonality: $\int \hat{g_S}(x_S) \hat{g_V} (x_V) dP(X) = 0 \text{ for } S \neq V$.
    • This implies that components do not share information. For example, the first-order effect of feature $x_1$ and the interaction term $X_{1,2}$ are not correlated. Because of orthogonality, all components are “pure” in the sense that they do not mix effects.
    • The more interesting consequence arises for orthogonality of hierarchical components, where one component contains features of another, for example, the interaction between $x_1$ and $x_2$, and the main effect of feature $x_1$. In contrast, a two-dimensional partial dependence plot for $x_1$ and $x_2$ would contain four effects: the intercept, the two main effects of $x_1$ and $x_2$, and the interaction between them.
    • The functional ANOVA component for $\hat{f_{1,2}}(x_1, x_2)$ contains only the pure interaction.
  3. Variance Decomposition: As a consequence of orthogonality, we can decompose the variance to a sum without any covariances. Thus we can decompose it as follows:

    \[\begin{align*} Var[\hat{f}(x)] &= Var[g_\phi + g_1(x_1) + g_2(x_2) + g_{1,2}(x_1, x_2) + ... + g_{1,...,p}(x)] \\ &= Var[g_\phi] + Var[g_1(x_1)] + ... + Var[g_{1,...,p}(x)] \end{align*}\]
  4. Sobol Index: Dividing the above equation by $\hat{f}(x)$ we get a sum of fractions where each term is basically the fraction of variance explained by it. This can be used as an importance measure called - $S_V = \frac{Var[g_V(x_V)]}{Var[\hat{f}(x)]}$

§4.03: Friedmans H-Statistic


§4.04: Further Methods