Shapley values originate from classical game theory and aim to fairly devide a payout between players. In this section a brief explanation of Shapley values in game theory is given, followed by an adaption to IML resulting in the method SHAP.
Shapley values originate from cooperative game theory and provide a method to fairly distribute a total payout among collaborating players. In machine learning, this concept is adapted to explain individual model predictions by treating features as “players” that cooperate to produce a prediction.
Some players might be contributing more than others in the coalition, and we need to determine a way to fairly distribute the payout i.e. find the marginal contribution of \(j^{th}\) player:
\[\nabla(j, S) = v(S \cup j ) - v(S)\]Averaging over the marginal contributions across different subsets may not always recover the total payout. This is because the value a player adds depends not only on who else is in the other coalition, but also on the joining order.
Formally the order definition of shapely value for player \(j\) is given by \(\phi_j\):
\[\phi_j = \frac{1}{\mid P \mid !} \sum_{\tau \in \Pi} (v(S_j^\tau \cup \{j\}) - v(S_j^\tau))\]where \(\Pi\) is the set of all permutations and \(S_j^\tau\) is the set of players before \(j\) joins the ordering.
This is usually computationally expensive as the number of orderings is \(\mid P \mid !\). Instead we can sample a fixed \(M \llless \mid P \mid !\) and average it out:
\[\phi_j \approx \frac{1}{M} \sum_{\tau \in \Pi_M} (v(S_j^\tau \cup \{j\}) - v(S_j^\tau))\]where \(\Pi_M \subset \Pi\) is a random sampling of \(M\) player orderings.
Note however that the set \(S_j^\tau\) can occur in multiple permutations. If our player of interest was 3, then for the orderings \(\{1,2,3\}, \{2,1,3\}\), the set \(S_j^\tau = \{1,2\}\). The marginal contribution of the set \(\{1,2\}\) occurs twice because \(\{1,2,3 \}\) has two orderings in it whereby 3 joins last. More generally,
The set definition is equivalent to the order definition, but is much quicker in computation. In the order definition, each term (order) had a weight of \(\frac{1}{\mid P \mid !}\). Instead now we look at each set which has a weight of \(\frac{\mid S \mid ! (\mid P \mid - \mid S \mid - 1)!}{\mid P \mid !}\).
\[\phi_j = \sum_{S \subseteq P \ \{j\}}\frac{\mid S \mid ! (\mid P \mid - \mid S \mid - 1)!}{\mid P \mid !} (v(S \cup \{j\}) - v(S))\]The Shapley value provides a fair payout \(\phi_j\) for each player \(j \in P\) and uniquely satisfies the following axioms for any value function \(v\) :
Efficiency: The total payout \(v(P)\) is fully allocated to players:
\[\sum_j \phi_j = v(P)\]i.e. Payout of combined game = payout of the two separate games
The Shapely Value \(\phi_j\) quantifies the contribution of \(x_j\) via:
\[\phi_j(x) = \frac{1}{\mid P \mid !} \sum_{\tau in \Pi} f_{S_j^\tau \cup \{j\}}(x_{_{S_j^\tau \cup \{j\}}}) - f_{S_j^\tau}(x_j^\tau)\]i.e. \(\phi_j\) quantifies how much feature \(x_j\) contributes to the difference between the prediction \(\hat{f}\) and mean-prediction \(\hat{f}_\emptyset\)
An important thing to note here: For the calculation of marginal contribution, the value of all preceding features + value of feature j is taken from \(x^{(i)}\) and the remaining from a random point and it is compared against taking all feature values preceding j being taken from \(x^{(i)}\) and the rest including \(j^{th}\) feature value being taken from a random point.
All in all, Shapley Values have a strong theoretical foundation from cooperative game theory, produce fair attribution, and also provide contrastive explanations by quantifying each features role in deviating from the average prediction. However they come at a huge computational cost and are not robust to correlated data.
SHAP expresses the prediction of \(x\) as a sum of contributions from each feature:
\[g(z') = \phi_0 + \sum_{j=1}^p \phi_j z^{'}_j\]where \(z' \in \{0,1\}^p\) i.e. it is a \(p-\)dimensional vector with \(z^{'}_j = 1\) if feature \(j\) is present in the coalition, 0 otherwise (influence of \(x_j\) is to be marginalized out).
We fit a surrogate model \(g(z')\) satisfying Shapley axioms and recovering \(\hat{f}(x)\) when all features are present (i.e. \(\mathbf{z'} = \mathbf{1}\)). Evaluation of \(g(z')\) can be done in one of two ways:
Suppose our point of interest is \(\mathbf{x} = (51.6, 5.1, 17.0)\)
Step 1: From all possible coalitions (\(2^p\)), we sample \(K\) from the simplified binary feature space. \(z^{'(k)} \in \{0,1\}^p\) indicates which features are present in the \(k^{th}\) coalition. This then needs to be mapped to the original space:
Step 2: For a each coalition vector (say \(\mathbf{z'} = (1,0,0)\)), the features that are not part of the coalition need to be marginalized out. We draw multiple background samples, keep the original feature value for \(x_0\), and the rest are taken from the background data. This is done \(B\) times: \(E_{X_{-S}}[f(x_S, X_{-S})] \approx \frac{1}{B} \sum_{b=1}^B \hat{f}(h_{x,x^{'(b)}}(z'))\) where \(h_{x,x^{'(b)}}(z')\) keeps the original feature values for those in the coalition and takes a background sample for the rest.
Step 3: Compute the kernel weights for the surrogate model since we learn the most about a feature’s effect when it either appears in isolation or in near-complete context.
Step 4: Fit a weighted linear model:
\[g(z') = \phi_0 + \sum_{j=1}^p \phi_j z_j^{'}\]The weighted least-squares objective is then given by:
\[\min_\phi \sum_{k=1}^K \pi_x(z^{\(k)})[\hat{f}(h_x(z^{'(k)})) - g(z^{'(k)})]^2\]where
\(\phi_0 = E[\hat{f(x)}]\) and \(\sum_{j=1}^p = \hat{f}(x) - \phi_0\)
The coalitions with \(\infty\) weights (i.e. \(\mathbf{z'}=1\) or \(\mathbf{z'} = 0\)) are instead used as constraints ensuring local accuracy and missingness.
Step 5: Return SHAP values: Estimated Kernel SHAP are equivalent to Shapley values. An important thing to note is that the weights of the Kernel SHAP are NOT the same as the weights from the Shapely Set definition.
We can run SHAP for every observation and get a \(n \times p\) matrix of Shapley values. Averaging the absolute value of the Shapley values for each feature over all observation (i.e. column-wise averages) gives us a “Feature Importance” metric:
\[I_j = \frac{1}{n} \sum_{i=1}^n \mid \phi_j^{(i)} \mid\]An important thing to note here is that Shapley FI is based only on model predicitions where as others are often based on model’s performance/loss. This means if you have a poor model, the the Shapely FI gives us the FI based on this model. The ones based on performance use the ground truth.