Shapley values originate from classical game theory and aim to fairly devide a payout between players. In this section a brief explanation of Shapley values in game theory is given, followed by an adaption to IML resulting in the method SHAP
A requirement for this approach is that the model prediction function $\hat{f} $is square integrable. As with any functional decomposition, the functional ANOVA decomposes the function into components:
\[\hat{f}(x) = \sum_{S \subset \{1,...,p\}} g_S(x_S)\]where each component is defined as (with $V \subsetneqq S$ i.e. only considering all proper subsets of $S$):
\[\begin{align*} g_S(x_S) &= \hat{f}_{S; PD}(x_S) - \sum_{V \subsetneqq S} g_V(x_V)\\ &= E_{x_{-S}}[\hat{f}(x_S, x_{-S})] - \sum_{V \subsetneqq S} g_V(x_V) \\ &= \int \hat{f}(x_S, x_{-S}) dP(X_{-S}) - \sum_{V \subsetneqq S} g_V(x_V)\\ &= \text{(Expectation integrates } \hat{f} \text{ over all input features except } x_S \text{)} - \text{ (subtract all lower-order components)} \end{align*}\]Now we can concretely obtain the following
For the Intercept:, the subset $S = {\phi}$ is the empty set and therefore $-S$ contains all the features. The intercept can also be interpreted as the expectation of the prediction function when we assume that all features are uniformly distributed.
\[g_{\phi} = \int_{X}\hat{f(x)} d P(X) = E_X [\hat{f}(x)]\]The first order effect:
In practice calculating PD-functions for all subsets is very expensive, instead there are sampled using Monte-Carlo integration i.e. a fixed grid is used and the integral is estimated as follows:
\[\hat{f}_{S, PD}(x_S^*)=E_{x_{-S}}[\hat{f}(x_S^*, X_{-S})] \approx \frac{1}{n} \sum_{i=1}^{n} \hat{f}(x_S^*, x_{-s}^{(i)})\]fANOVA has a few interesting properties that make it very attractive
Variance Decomposition: As a consequence of orthogonality, we can decompose the variance to a sum without any covariances. Thus we can decompose it as follows:
\[\begin{align*} Var[\hat{f}(x)] &= Var[g_\phi + g_1(x_1) + g_2(x_2) + g_{1,2}(x_1, x_2) + ... + g_{1,...,p}(x)] \\ &= Var[g_\phi] + Var[g_1(x_1)] + ... + Var[g_{1,...,p}(x)] \end{align*}\]If a feature $j$ has no interaction with any of the other features, we can express the prediction function as a sum of partial dependence functions, where the first summand depends only on $j$ and the second on all other features except $j$. Similarly a two-way PD function is just a sum of the two individual PD functions if there are no interactions (we use centered PD functions here):
\[\hat{f}_{jk, PD}^C(x_j, x_k) = \hat{f}_{j, PD}^C(x_j) + \hat{f}_{k, PD}^C(x_k)\]The H-Statistic for a 2-way (can also be generalized to an $n-$ way interaction) interaction between feature $j$ and $k$ is defined as:
\[\begin{align*} H_{jk}^2 &= \frac{Var[\hat{f}^C_{jk, PD}(x_j, x_k) - \hat{f}^C_{j, PD}(x_j) - \hat{f}^C_{k, PD}(x_k)]}{Var[\hat{f}^C_{jk, PD}(x_j, x_k)]} \\ &= \frac{\sum_{i=1}^n (\hat{f}^C_{jk, PD}(x_j^{(i)}, x_k^{(i)}) - \hat{f}^C_{j, PD}(x_j^{(i)})-\hat{f}^C_{k, PD}(x_k))^2}{\sum_{i=1}^n \hat{f}^C_{jk, PD}(x_j^{(i)}, x_k^{(i)})} \\ \end{align*}\]This measures the strength of the interaction quantitatively and small (close to 0) is considered weak interaction and close to 1 is considered a strong interaction.
Analogous to the 2-way interaction strength, we can also find the overall strength of interactions between feature $j$ and all other features where use the centered prediction function
\[\begin{align*} H_{j}^2 &= \frac{Var[\hat{f}^C(x) - \hat{f}^C_{j, PD}(x_j) - \hat{f}^C_{-j, PD}(x_{-j})]}{Var[\hat{f}^C(x)]} \\ &= \frac{\sum_{i=1}^n (\hat{f}^C(x^{(i)}) - \hat{f}^C_{j, PD}(x_j^{(i)})-\hat{f}^C_{-j, PD}(x_{-j}^{(i)}))^2}{\sum_{i=1}^n \hat{f}^C(x^{(i)})^2} \end{align*}\]H-statistic provides a general definition of interactions and an algorithm to compute them/ However for interaction order $k$ it needs approximately $2^k$ PD-functions which makes it very compute intensive.
Standard fANOVA builds up on PD-functions and as PD-functions aren’t robust to correlated features, neither is fANOVA. If we integrate over the uniform distribution, when in reality features are dependent, we create a new dataset that deviates from the joint distribution and extrapolates to unlikely combinations of feature values.
Generalized functional ANOVA is a decomposition that also works for dependent features. It relaxes the vanishing conditions requirement. So they no longer imply orthogonality, but rather hierarchical orthogonality:
\[E_x[g_v(x_v) g_S(x_S)] = 0 \forall V \subsetneqq S\]i.e. only components which are lower in the the hierarchy are orthogonal.
While it also provides a variance decomposition, it is difficult to estimate, computationally expensive, and requires a manual choice of weight function.
We can also create functional decomposition of ALE plots which handle dependencies well, computationally fast. However it is theoretically more involved, orthogonality properties are more complicated, provides no variance decomposition but is a good alternative in practice.
Functional decompositions, if computed offer a lot of insight into the model including complete analysis of all interactions. However in practice it is often infeasible and.