Model independent feature attributions: Shapley values that uncover nonlinear dependencies
 Published
 Accepted
 Received
 Academic Editor
 Charles Elkan
 Subject Areas
 Artificial Intelligence, Data Mining and Machine Learning, Data Science
 Keywords
 Shapley values, Statistics, Distance correlation, Explainable AI, Multiple correlation
 Copyright
 © 2021 Fryer et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
 Cite this article
 2021. Model independent feature attributions: Shapley values that uncover nonlinear dependencies. PeerJ Computer Science 7:e582 https://doi.org/10.7717/peerjcs.582
Abstract
Shapley values have become increasingly popular in the machine learning literature, thanks to their attractive axiomatisation, flexibility, and uniqueness in satisfying certain notions of ‘fairness’. The flexibility arises from the myriad potential forms of the Shapley value game formulation. Amongst the consequences of this flexibility is that there are now many types of Shapley values being discussed, with such variety being a source of potential misunderstanding. To the best of our knowledge, all existing game formulations in the machine learning and statistics literature fall into a category, which we name the modeldependent category of game formulations. In this work, we consider an alternative and novel formulation which leads to the first instance of what we call modelindependent Shapley values. These Shapley values use a measure of nonlinear dependence as the characteristic function. The strength of these Shapley values is in their ability to uncover and attribute nonlinear dependencies amongst features. We introduce and demonstrate the use of the energy distance correlations, affineinvariant distance correlation, and Hilbert–Schmidt independence criterion as Shapley value characteristic functions. In particular, we demonstrate their potential value for exploratory data analysis and model diagnostics. We conclude with an interesting expository application to a medical survey data set.
Introduction
There are many different meanings of the term “feature importance”, even in the context of Shapley values. Indeed, the meaning of a Shapley value depends on the underlying game formulation, referred to by Merrick & Taly (2019) as the explanation game. Although, this is so far rarely discussed explicitly in the existing literature. In general, Shapley value explanation games can be distinguished as either belonging to the modeldependent category or the modelindependent category. The latter category is distinguished by an absence of assumptions regarding the data generating process (DGP). Here, the term modeldependent refers to when the Shapley value depends on a choice of fitted model (such as the output of a machine learning algorithm), or on a set of fitted models (such as the set of submodels of a linear model).
Shapley values that uncover nonlinear dependencies (Sunnies) is, to the best of our knowledge, the only Shapleybased feature importance method that falls into the modelindependent category. In this category, feature importance scores attempt to determine what is a priori important, in the sense of understanding the partial dependence structures within the joint distribution describing the DGP. We show that these methods that generate modelindependent feature importance scores can appropriately be used as model diagnostic procedures, as well as procedures for exploratory data analysis.
Existing methods in the modeldependent category, on the other hand, seek to uncover what is perceived as important by the model (or class of models), either with regards to a performance measure (e.g., a goodnessoffit measure) or for measuring local influences on model predictions. Modeldependent definitions of feature importance scores can be distinguished further according as to whether they depend on a fitted (i.e., trained) model or on an unfitted class of models. We refer to these as withinmodel scores and betweenmodel scores, respectively. This distinction is important, since the objectives are markedly different.
Withinmodel Shapley values seek to describe how the model reacts to a variety of inputs, while, e.g., accounting for correlated features in the training data by systematically setting “absent” features to a reference input value, such as a conditional expectation. There are many use cases for withinmodel Shapley values, such as providing transparency to model predictions, e.g. for explaining a specific credit decision or detecting algorithmic discrimination (Datta, Sen & Zick, 2016), as well as understanding model structure, measuring interaction effects and detecting concept drift (Lundberg et al., 2020).
All withinmodel Shapley values that we are aware of fall into the class of single reference games, described by Merrick & Taly (2019). These include SAGE (Covert, Lundberg & Lee, 2020); SHAP (Lundberg & Lee, 2017); Shapley Sampling Values (Štrumbelj & Kononenko, 2013); Quantitative Input Influence (Datta, Sen & Zick, 2016); Interactionsbased Method for Explanation(IME) (Štrumbelj, Kononenko & Šikonja, 2009); and TreeExplainer (Lundberg et al., 2020). Note that some withinmodel feature importance methods, such as SHAP, can be described as model agnostic methods, since they may be applied to any trained model. Regardless, such values are dependent on a prior choice of fitted model.
In contrast to withinmodel Shapley values, betweenmodel Shapley values seek to determine which features influence an outcome of the model fitting procedure, itself, by repeatedly refitting the model to compute each marginal contribution. Such scores have been applied, for example, as a means for feature importance ranking in regression models. These include Shapley Regression Values (Lipovetsky & Conklin, 2001), ANOVA Shapley values (Owen & Prieur, 2017), and our prior work (Fryer, Strumke & Nguyen, 2020). The existing betweenmodel feature importance scores are all global feature importance scores, since they return a single Shapley value for each feature, over the entire data set. Sunnies is also a global score, though not a betweenmodel score.
A number of publications and associated software have been produced recently to efficiently estimate or calculate SHAP values. Tree SHAP, Kernel SHAP, Shapley Sampling Values, Max Shap, Deep Shap, LinearSHAP and LowOrderSHAP are all methods for either approximating or calculating SHAP values. However, these efficient modeldependent methods for calculating or approximating SHAP values are developed for local withinmodel scores, and are not suitable for Sunnies, which is a global and modelindependent score. While Sunnies does not fit under the modeldependent frameworks for efficient estimation, Shapley values in general can be approximated via a consistent Monte Carlo algorithm introduced by Song, Nelson & Staum (2016). While efficient approximations do exist, computational details are not the focus of this paper, where we focus on the concept and relevance of Sunnies.
In “Shapley Decomposition”, we introduce the concept of the Shapley value and its decomposition. We then introduce the notion of attributed dependence on labels (ADL), and briefly demonstrate the behaviour of the R^{2} characteristic function on a data set with nonlinear dependence, to motivate our alternative measures of nonlinear dependence in place of R^{2}. In “Measures of nonlinear dependence”, we describe three such measures: the Hilbert Schmidt Independence Criterion (HSIC), the Distance Correlation (DC) and the AffineInvariant Distance Correlation (AIDC). We use these as characteristic functions throughout the remainder of the work, although we focus primarily on the DC.
The DC, HSIC and AIDC do not constitute an exhaustive list of the available measures of nonlinear dependence. We do not provide here a comparison of their strengths and weaknesses. Instead, our objective is to propose and demonstrate a variety of use cases for the general technique of computing Shapley values for modelindependent measures of statistical dependence.
In “Exploration”, we demonstrate the value of ADL for exploratory data analysis, using a simulated DGP that exhibits mutual dependence without pairwise dependence. We also leverage this example to compare ADL to popular pairwise and modeldependent measures of dependence, highlighting a drawback of the pairwise methods, and of the popular XGBoost builtin “feature importance” score. We also show that SHAP performs favourably here. In “Diagnostics”, we introduce the concepts of attributed dependence on predictions (ADP) and attributed dependence on residuals (ADR). Using simulated DGPs, we demonstrate the potential for ADL, ADP and ADR to uncover and diagnose model misspecification and concept drift. For the concept drift demonstration (“Demonstration with concept drift”), we see that ADL provides comparable results to SAGE and SHAP, but without the need for a fitted model. Conclusions are drawn in Section “Discussion and Future Work”.
Shapley Decomposition
In approaching the question: “How do the different features X = (X_{1}, …, X_{d}) in this data set affect the outcome Y?”, the concept of a Shapley value is useful. The Shapley value has a long history in the theory of cooperative games, since its introduction in Shapley (1953), attracting the attention of various Nobel prizewinning economists (cf. Roth, 1988), and enjoying a recent surge of interest in the statistics and machine learning literature. Shapley (1953) formulated the Shapley value as the unique game theoretic solution concept, which satisfies a set of four simple and apparently desirable axioms: efficiency, additivity, symmetry and the null player axiom. For a recent monograph, defining these four axioms and introducing solution concepts in cooperative games, consult Algaba, Fragnelli & SánchezSoriano (2019).
As argued by Lipovetsky & Conklin (2001); Israeli (2007); Huettner & Sunder (2012), we can think of the outcome C(S) of a prediction or regression task as the outcome of a cooperative game, in which the set S = {X_{1}, …, X_{d}} of data features represent a coalition of players in the game. The function C is known as the characteristic function of the game. It maps elements S, in the power set 2^{[d]} of players, to a set of payoffs (or outcomes) and thus fully describes the game. Let d be the number of players. The marginal contribution of a player v ∈ S to a team S is defined as C(S∪{v}) − C(S). The average marginal contribution of player v, over the set S_{k} of all teams of size k that exclude v, is (1)${\overline{C}}_{k}\left(v\right)=\frac{1}{\left{S}_{k}\right}\sum _{S\in {S}_{k}}\left[C\left(S\cup \left\{v\right\}\right)C\left(S\right)\right],$ where $\left{S}_{k}\right=\left(\genfrac{}{}{0.0pt}{}{d1}{k}\right)$. The Shapley value of player v, then, is given by (2)${\varphi}_{v}\left(C\right)=\frac{1}{d}\sum _{k=0}^{d1}{\overline{C}}_{k}\left(v\right),$ i.e., ϕ_{v}(C) is the average of ${\overline{C}}_{k}\left(v\right)$ over all team sizes k.
Attributed dependence on labels
The characteristic function C(S) in Eq. (1) produces a single payoff for the features with indices in S. In the context of statistical modelling, the characteristic function will depend on Y and X. To express this we introduce the notation X_{S} = (X_{j})_{j∈S} as the projection of the feature vector onto the coordinates specified by S, and we write the characteristic function C_{Y}(S) with subscript Y to clarify its dependence on Y as well as X (via S). Now, we can define a new characteristic function R_{Y} in terms of the popular coefficient of multiple correlation R^{2}, as (3)${R}_{Y}\left(S\right)={R}^{2}\left(Y,X{}_{S}\right)=1\frac{\leftCor\left(Y,X{}_{S}\right)\right}{\leftCor\left(X{}_{S}\right)\right},$ where ⋅ and Cor(⋅) are the determinant operator and correlation matrix, respectively (cf. Fryer, Strumke & Nguyen, 2020).
The set of Shapley values of all features in X, using characteristic function C, is known as the Shapley decomposition of C amongst the features in X. For example, the Shapley decomposition of R_{Y}, from Eq. (3), is the set {ϕ_{v}(R_{Y}):v ∈ [d]}, calculated via Eq. (2).
In practice, the joint distribution of (Y, X^{T}) is unknown, so the Shapley decomposition of C is estimated via substitution of an empirical characteristic function $\stackrel{\u02c6}{C}$ in Eq. (1). In this context, we work with an n × S data matrix X_{S}, whose ith row is the vector x_{S} = (x_{ij})_{j∈S}, representing a single observation from X_{S}. As a function of this observed data, along with the vector of observed labels y = (y_{i})_{i∈[n]}, the empirical characteristic function ${\stackrel{\u02c6}{C}}_{\mathbf{y}}$ produces an estimate of C_{Y} that, with Eq. (1), gives the estimate ${\varphi}_{v}\left({\stackrel{\u02c6}{C}}_{\mathbf{y}}\right)$, which we refer to as the Attributed Dependence on Labels (ADL) for feature v.
Recognising dependence: Example 1
For example, the empirical R^{2} characteristic function ${\stackrel{\u02c6}{R}}_{\mathbf{y}}$ is given by (4)${\stackrel{\u02c6}{R}}_{\mathbf{y}}\left(S\right)=1\frac{\left\rho \left(\mathbf{y},\mathbf{X}{}_{S}\right)\right}{\left\rho \left(\mathbf{X}{}_{S}\right)\right},$ where ρ is the empirical Pearson correlation matrix.
Regardless of whether we use a population measure or an estimate, the R^{2} measures only the linear relationship between the response (i.e., labels) Y and features X. This implies the R^{2} may perform poorly as a measure of dependence in the presence of nonlinearity. The following example from a nonlinear DGP demonstrates this point.
Suppose the features X_{j}, j ∈ [d] are independently uniformly distributed on [ − 1, 1]. Given a diagonal matrix A = diag(a_{1}, …, a_{d}), let the response variable Y be determined by the quadratic form (5)$Y={X}^{T\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}}AX={a}_{1}{X}_{1}^{2}+\dots +{a}_{d}{X}_{d}^{2}.$ Then, the covariance Cov(Y, X_{j}) = 0 for all j ∈ [d]. This is because $Cov\left({X}^{T\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}}AX,{X}_{j}\right)=\sum _{j=1}^{d}Cov\left({X}_{j}^{2},{X}_{j}\right)=0,$ since E[X_{j}] = 0 and $\text{E}\left[{X}_{j}^{3}\right]=0$. In Fig. 1, we display the X_{4} cross section of 10, 000 observations generated from Eq. (5) with d = 5 and A = diag(0, 2, 4, 6, 8), along with the least squares line of best fit and associated R^{2} value. We visualize the results for the corresponding Shapley decomposition in Fig. 2. As expected, we see that the R^{2} is not able to capture the nonlinear dependence structure of Eq. (5), and thus neither is its Shapley decomposition.
We note that improvements on the results in Figs. 1 and 2 can be obtained by choosing a suitable linearising transformation of the features or response prior to calculating R^{2}, but such a transformation is not known to be discernible from data in general, except in the simplest cases.
Measures of nonlinear dependence
In the following, we describe three measures of nonlinear dependence that, when used as a characteristic function C, have the following properties.

Independence is detectable (in theory), i.e., if C(S) = 0, then the variables Y and X_{S} are independent. Equivalently, dependence is visible, i.e., if Y and X_{S} are dependent, then C(S) ≠ 0.

C is modelindependent. Thus, no assumptions are made about the DGP and no associated feature engineering or transformation of X or Y is necessary. Note that, since the Shapley values sum, by efficiency, to the distance correlation, we do get the guarantee that dependence on [d] is visible in the sum of Shapley values, by virtue of equalling C([d]). However, each individual Shapley value is not a distance correlation, but a linear combination of distance correlations, and thus cannot itself generally be interpreted as a distance correlation. The same is true for any sum of a strict subset of the Shapley values, since efficiency applies to the sum of all Shapley values, and not a strict subset of them.
Distance correlation and affine invariant distance correlation
The distance correlation, and its affine invariant adaptation, were both introduced by Székely, Rizzo & Bakirov (2007). Unlike the Pearson correlation, the distance correlation between Y and X is zero if and only if Y and X are statistically independent. However, the distance correlation is equal to 1 only if the dimensions of the linear spaces spanned by Y and X are equal, almost surely, and Y is a linear function of X.
First, the population distance covariance between the response Y and feature vector X is defined as a weighted L_{2} norm of the difference between the joint characteristic function^{1} , f_{YX} and the product of marginal characteristic functions f_{Y}f_{X}. In essence, this is a measure of squared deviation from the assumption of independence, i.e., the hypothesis that f_{YX} = f_{Y}f_{X}.
The empirical distance covariance ${V}_{n}^{2}$ is based on Euclidean distances between sample elements, and can be computed from data matrices Y, X as (6)${\stackrel{\u02c6}{V}}^{2}\left(\mathbf{Y},\mathbf{X}\right)=\sum _{i,j=1}^{n}A{\left(\mathbf{Y}\right)}_{ij}A{\left(\mathbf{X}\right)}_{ij},$ where the matrix function A(W) for W ∈ {Y, X} is given by $A{\left(\mathbf{W}\right)}_{ij}=B{\left(\mathbf{W}\right)}_{ij}\frac{1}{n}\sum _{i=1}^{n}B{\left(\mathbf{W}\right)}_{ij}\frac{1}{n}\sum _{j=1}^{n}B{\left(\mathbf{W}\right)}_{ij}+\frac{1}{{n}^{2}}\sum _{i,j=1}^{n}B{\left(\mathbf{W}\right)}_{ij},$ where ⋅ denotes the Euclidean norm, and B(W) is the n × n distance matrix with B(W)_{ij} = w_{i} − w_{j}, where w_{i} denotes the ith observation (row) of W. Here, Y is in general a matrix of observations, with potentially multiple features. Notice the difference between Y and y, where the latter is the (single column) label vector introduced in “Attributed dependence on labels”.
The empirical distance correlation $\stackrel{\u02c6}{\mathcal{R}}$ is given by (7)${\stackrel{\u02c6}{\mathcal{R}}}^{2}\left(\mathbf{Y},\mathbf{X}\right)=\frac{{\stackrel{\u02c6}{V}}^{2}\left(\mathbf{Y},\mathbf{X}\right)}{\sqrt{{\stackrel{\u02c6}{V}}^{2}\left(\mathbf{Y},\mathbf{Y}\right){\stackrel{\u02c6}{V}}^{2}\left(\mathbf{X},\mathbf{X}\right)}},$ for ${\stackrel{\u02c6}{V}}^{2}\left(\mathbf{Y},\mathbf{Y}\right){\stackrel{\u02c6}{V}}^{2}\left(\mathbf{X},\mathbf{X}\right)\ne 0$, and $\stackrel{\u02c6}{\mathcal{R}}\left(\mathbf{Y},\mathbf{X}\right)=0$ otherwise. For our purposes, we define the distance correlation characteristic function estimator (8)${\stackrel{\u02c6}{D}}_{\mathbf{y}}\left(S\right)={\stackrel{\u02c6}{\mathcal{R}}}^{2}\left(\mathbf{y},\mathbf{X}{}_{S}\right).$ A transformation of the form x↦Ax + b for a matrix A and vector b is called affine. Affine invariance of the distance correlation is desirable, particularly in the context of hypothesis testing, since statistical independence is preserved under the group of affine transformations. When Y and X are first scaled as ${\mathbf{Y}}^{\prime}=\mathbf{Y}{S}_{\mathbf{Y}}^{1\u22152}$ and ${\mathbf{X}}^{\prime}=\mathbf{X}{S}_{\mathbf{X}}^{1\u22152}$, the distance correlation $\stackrel{\u02c6}{V}\left({\mathbf{Y}}^{\prime},{\mathbf{X}}^{\prime}\right)$, becomes invariant under any affine transformation of Y and X (Székely, Rizzo & Bakirov, 2007, Section 3.2). Thus, the empirical affine invariant distance correlation is defined by (9)${\stackrel{\u02c6}{\mathcal{R}}}^{\prime}\left(\mathbf{Y},\mathbf{X}\right)=\stackrel{\u02c6}{\mathcal{R}}\left(\mathbf{Y}{S}_{\mathbf{Y}}^{1\u22152},\mathbf{X}{S}_{\mathbf{X}}^{1\u22152}\right),$ and we define the associated characteristic function estimator ${\stackrel{\u02c6}{D}}_{\mathbf{y}}^{\prime}$ in the same manner as Eq. (8). Monte Carlo studies regarding the properties of these measures are given by Székely, Rizzo & Bakirov (2007).
Hilbert–Schmidt independence criterion
The Hilbert Schmidt Independence Criterion (HSIC) is a kernelbased independence criterion, first introduced by Gretton et al. (2005a). Kernelbased independence detection methods have been adopted in a wide range of areas, such as independent component analysis (Gretton et al., 2007). The link between energy distancebased measures, such as the distance correlation, and kernelbased measures, such as the HSIC, was established by Sejdinovic et al. (2013). There, it is shown that the HSIC is a certain formal extension of the distance correlation.
The HSIC makes use of the crosscovariance operator, C_{YX}, between random vectors Y and X, which generalises the notion of a covariance. The response Y and feature vector X are each mapped to functions in a Reproducing Kernel Hilbert Spaces (RKHS), and the HSIC is defined as the Hilbert–Schmidt (HS) norm $\left\right{C}_{YX}{}_{HS}^{2}$ of the crosscovariance operator between these two spaces (Gretton et al., 2005b; Gretton et al., 2007; Gretton et al., 2005a). Given two kernels ℓ, k, associated to the RKHS of Y and X, respectively, and their empirical evaluation matrices L, K with row i and column j elements ℓ_{ij} = ℓ(y_{i}, y_{j}) and k_{ij} = k(x_{i}, x_{j}), where y_{i}, x_{i} denote the ith observation (row) in data matrices X and Y, respectively, the empirical HSIC can be calculated as (10)$\widehat{\mathrm{HSIC}}\left(\mathbf{Y},\mathbf{X}\right)=\frac{1}{{n}^{2}}\sum _{i,j}^{n}{k}_{ij}{\ell}_{ij}+\frac{1}{{n}^{4}}\sum _{i,j,q,r}^{n}{k}_{ij}{\ell}_{qr}\frac{2}{{n}^{3}}\sum _{i,j,q}^{n}{k}_{ij}{\ell}_{iq}.$ As in “Distance correlation and affine invariant distance correlation”, notice the difference between Y and y, where the latter is the (single column) label vector introduced in “Attributed dependence on labels”. Intuitively, this approach endows the crosscovariance operator with the ability to detect nonlinear dependence, and the HS norm measures the combined magnitude of the resulting dependence. For a thorough discussion of positive definite Kernels, with a machine learning emphasis, see the work of Hein & Bousquet (2004).
Calculating the HSIC requires selecting a kernel. The Gaussian kernel is a popular choice that has been subjected to extensive testing in comparison to other kernel methods [see, e.g., Gretton et al., 2005a]. For our purposes, we define the empirical HSIC characteristic function by (11)${\stackrel{\u02c6}{H}}_{y}\left(S\right)=\widehat{\mathrm{HSIC}}\left(\mathbf{y},\mathbf{X}{}_{S}\right),$ and use a Gaussian kernel. Figure 2 shows the Shapley decomposition of $\stackrel{\u02c6}{H}$ amongst the features generated from Eq. (5), again with d = 5 and A = diag(0, 2, 4, 6, 8). The decomposition has been normalised for comparability with the other measures of dependence presented in the figure. The HSIC can also be generalised to provide a measure of mutual dependence between any finite number of random vectors (Pfister et al., 2016).
Exploration
In machine learning problems, complete formal descriptions of the DGP are often impractical. However, there are advantages to gaining some understanding of the dependence structure. In particular, such an understanding is useful when inference about the data generating process is desired, such as in the contexts of causal inference, scientific inquiries (in general), or in qualitative investigations (cf. Navarro, 2018). In a regression or classification setting, the dependence structure between the features and response is an immediate point of focus. As we demonstrate in “Recognising dependence: Example 2”, the dependence structure cannot always be effectively probed by computing measures of dependence between labels and feature subsets, even when the number of marginal contributions is relatively small. In such cases, the Shapley value may not only allow us to summarise the interactions from many marginal contributions, but also to fairly distribute strength of dependence to the features.
Attributed dependence on labels (ADL) can be used for exploration in the absence of, or prior to, a choice of model; but, ADL can also be used in conjunction with a model—for example, to support, and even validate, model explanations. Even when a machine learning model is not parsimonious enough to be considered explainable, stakeholders in high risk settings may depend on the statement that “feature X_{i} is important for determining Y” in general. However, it is not always clear, in practice, whether such a statement about feature importance is being used to describe a property of the model, or a property of the DGP. In the following example, we demonstrate that ADL can be used to make statements about the DGP and to help qualify statements about a model.
Recognising dependence: Example 2
Consider a DGP involving the XOR function of two binary random variables X_{1}, X_{2}, with distributions given by P(X_{1} = 1) = P(X_{2} = 1) = 1∕2. The response is given by (12)$Y=\text{XOR}\left({X}_{1},{X}_{2}\right)={X}_{1}\left(1{X}_{2}\right)+{X}_{2}\left(1{X}_{1}\right).$ Notice that P(Y = iX_{k} = j) = P(Y = i), for all i, j ∈ {0, 1} and k ∈ {1, 2}. Thus, in this example, Y is completely statistically independent of each individual feature. However, since Y is determined entirely in terms of (X_{1}, X_{2}), it is clear that Y is statistically dependent on the pair. Thus, the features individually appear to have little impact on the response, yet together they have a strong impact when their mutual influence is considered.
Faced with a sample from (Y, X_{1}, X_{2}), when the DGP is unknown, a typical exploratory practice is to take a sample correlation matrix to estimate Cor(Y, X), producing all pairwise sample correlations as estimates of Cor(Y, X_{i}), for i ∈ [d]. A similar approach, in the presence of suspected nonlinearity, is to produce all pairwise distance correlations, or all pairwise HSIC values, rather than all pairwise correlations. Both the above approaches are modelindependent. For comparison, consider a pairwise modeldependent approach: fitting individual singlefeature models M_{i}, for i ∈ [d], that each predict Y as a function of one feature X_{i}; and reporting a measure of model performance for each of the d models, standardised by the result of a null feature model—that is, a model with no features (that may, for example, guess labels completely at random, or may use empirical moments of the response distribution to inform its guesses, ignoring X entirely).
As demonstrated by the results in Table 1, it is not possible for pairwise methods to capture interaction effects and mutual dependencies between features. However, Shapley feature attributions can overcome this limitation, both in the case of Sunnies and in the case of SHAP. By taking an exhaustive permutations based approach, Shapley values are able to effectively deal with partial dependencies and interaction effects amongst features. Note, all the Sunnies marginal contributions can, in this example, be derived from Table 1: the pairwise results state that ${\stackrel{\u02c6}{D}}_{Y}\left(\left\{1\right\}\right)={\stackrel{\u02c6}{D}}_{Y}\left(\left\{2\right\}\right)={\stackrel{\u02c6}{D}}_{Y}^{\prime}\left(\left\{1\right\}\right)={\stackrel{\u02c6}{D}}_{Y}^{\prime}\left(\left\{2\right\}\right)={\stackrel{\u02c6}{H}}_{Y}\left(\left\{1\right\}\right)={\stackrel{\u02c6}{H}}_{Y}\left(\left\{2\right\}\right)=0,$ and from Table 1 we can also derive ${\stackrel{\u02c6}{D}}_{Y}\left(\left\{1,2\right\}\right)={\stackrel{\u02c6}{D}}_{Y}^{\prime}\left(\left\{1,2\right\}\right)=2\times 0.265=0.53$ and ${\stackrel{\u02c6}{H}}_{Y}\left(\left\{1,2\right\}\right)=2\times 0.16=0.32$.
Method  Result X_{1}  Result X_{2} 

SHAP  3.19  3.19 
Shapley DC  0.265  0.265 
Shapley AIDC  0.265  0.265 
Shapley HSIC  0.16  0.16 
Pairwise XGB  0  0 
Pairwise dependence  0  0 
XGB feature importance  1  0 
The discrete XOR example demonstrates that ADL captures important symmetry between features, while pairwise methods fail to do so. The results in the final two rows of Table 1 are produced as follows: we train an XGBoost classifier on the discrete XOR problem in Eq. (12). Then, to ascertain the importance of each of the features X_{1} and X_{2}, in determining the target class, we use the XGBoost “feature importance” method, which defines a feature’s gain as “the improvement in accuracy brought by a feature to the branches it is on” (see https://xgboost.readthedocs.io/en/latest/Rpackage/discoverYourData.html).
Common experiences from users suggest that the XGBoost feature importance method can be unstable for less important features and in the presence of strong correlations between features (see e.g., https://stats.stackexchange.com/questions/279730/). However, in the current XOR example, features X_{1} and X_{2} are statistically independent (thus uncorrelated) and have the maximum importance that two equally important features can share (that is, together they produce the response deterministically).
Although the XGBoost classifier easily achieves a perfect classification accuracy on a validation set, the associated XGBoost gain for X_{1} is Gain(X_{1}) ≈ 0, while Gain(X_{2}) ≈ 1, or vice versa. In other words the full weight of the XGBoost feature importance under XOR is given to either one or the other feature. This is intuitively misleading, as both features are equally important in determining XOR, and any single one of the two features is alone not sufficient to achieve a classification accuracy greater than random guessing. In practice, ADL can help identify such flaws with other model explanation methods.
Diagnostics
In the following diagnostics sections, we present results using the distance correlation. However, similar results can also be obtained using the HSIC and the AIDC.
Model attributed dependence
Given a fitted model f, with associated predictions $\stackrel{\u02c6}{Y}=f\left(X\right)$, we seek to attribute shortcomings of the fitted model to individual features. We can do this by calculating the Shapley decomposition of the estimated strength of dependence between the model residuals $\epsilon =Y\stackrel{\u02c6}{Y}$, and the features X. In other words, feature v receives the attribution ϕ_{v}(C_{ε}); estimated by ${\varphi}_{v}\left({\stackrel{\u02c6}{C}}_{\mathbf{e}}\right)$, where $\mathbf{e}=\mathbf{y}\stackrel{\u02c6}{\mathbf{y}}$. We refer to this as the Attributed Dependence on Residuals (ADR) for feature v.
A different technique, for diagnosing model misspecification, is to calculate the Shapley decomposition of the estimated strength of dependence between $\stackrel{\u02c6}{Y}$ and X, so that each feature v receives attribution ${\varphi}_{v}\left({\stackrel{\u02c6}{C}}_{\stackrel{\u02c6}{\mathbf{y}}}\right)$. We call this the Attributed Dependence on Predictions (ADP), for feature v. This picture of the model generated dependence structure may then be compared, for example, to the observed dependence structure given the ADL $\left\{{\varphi}_{v}\left({\stackrel{\u02c6}{C}}_{\mathbf{y}}\right):v\in \left[d\right]\right\}$. The diagnostic goal, then, may be to check that, for all v, (13)${\varphi}_{v}\left({\stackrel{\u02c6}{C}}_{\stackrel{\u02c6}{\mathbf{y}}}\right){\varphi}_{v}\left({\stackrel{\u02c6}{C}}_{\mathbf{y}}\right)<\delta ,$ for some δ tolerance. In other words, a diagnostic strategy making use of ADP is to compare estimates of feature importance under the model’s representation of the joint distribution, to estimates of feature importance under the empirical joint distribution, and thus to individually inspect each feature for an apparent change in predictive relevance.
We note that these techniques, ADP and ADR, are agnostic to the chosen model. All that is needed is the model outputs and the corresponding model inputs—the inner workings of the model are irrelevant for attributing dependence on predictions and residuals to individual features in this way.
Demonstration with concept drift
We illustrate the ADR and ADL techniques together with a simple and intuitive synthetic demonstration involving concept drift, where the DGP changes over time, impacting the mean squared prediction error (MSE) of a deployed XGBoost model. The model is originally trained with the assumption that the DGP is static, and the performance of the model is monitored over time with the intention of detecting violations of this assumption, as well as attributing any such violation to one or more features. A subset of the deployed features can be selected for scrutiny, by considering removal only of those selected features from the model. To highlight this, our simulated DGP has 50 features, and we perform diagnostics on 4 out of those 50 features.
For comparison, we compute SAGE values of the model mean squared error (Covert, Lundberg & Lee, 2020) and we compute the mean SHAP values of the logarithm of the model loss function (Lundberg et al., 2020). We will refer to the latter as SHAPloss. For SAGE and SHAPloss values, we employ a DGP similar to Eq. (14), with sample size 1000, but with X_{i} ≡ 0 for all i > 4. These features were nullified for tractability of the SAGE computation, since, unlike for Sunnies, the authors are not aware of any established method for selectively computing SAGE values of a subset of the full feature set. SAGE and SHAPloss were chosen for their popularity and ability to provide global feature importance scores.
At the initial time t = 0, we define the DGP as a function of temporal increments t ∈ ℕ∪{0}, (14)$Y={X}_{1}+{X}_{2}+\left(1+\frac{t}{10}\right){X}_{3}+\left(1\frac{t}{10}\right){X}_{4}+\sum _{i=5}^{50}{X}_{i},$ where X_{i} ∼ N(0, 4), for i = 1, 2, 3, 4, and X_{i} ∼ N(0, 0.05), for 5 ≤ i ≤ 50. Features 1 through 4 are the most effectual to begin with, and we can imagine that these were flagged as important during model development, justifying the additional diagnostic attention they enjoy after deployment. We see from Eq. (14) that, after deployment, i.e., during periods 1 ≤ t ≤ 10, the effect of X_{4} decreases linearly to 0, while the effect of X_{3} increases proportionately over time. In what follows, these changes are clearly captured by the residual and response dependence attributions of those features, using the DC characteristic function.
The results, with a sample size of n = 1000, from the DGP in Eq. (14), are presented in Fig. 3. According to the ADL (top), X_{4} shows early signs of significantly reduced importance ${\varphi}_{4}\left({\stackrel{\u02c6}{C}}_{\mathbf{y}}\right)$, as X_{3} shows an increase in importance ${\varphi}_{3}\left({\stackrel{\u02c6}{C}}_{\mathbf{y}}\right)$, which is roughly symmetrical to the decrease in ${\varphi}_{4}\left({\stackrel{\u02c6}{C}}_{\mathbf{y}}\right)$. The ADR (bottom) show early significant signs that X_{3} is disproportionately affecting the residuals, with high ${\varphi}_{3}\left({\stackrel{\u02c6}{C}}_{\mathbf{e}}\right)$. The increase in residual attribution ${\varphi}_{4}\left({\stackrel{\u02c6}{C}}_{\mathbf{e}}\right)$ is also evident, though the observation ${\varphi}_{4}\left({\stackrel{\u02c6}{C}}_{\mathbf{e}}\right)<{\varphi}_{3}\left({\stackrel{\u02c6}{C}}_{\mathbf{e}}\right)$ suggests that the drift impact from X_{3} is the larger of the two.
The resulting SAGE and mean SHAPloss values are presented in Fig. 4. Interestingly, the behaviours of SHAPloss and SAGE are (up to scale and translation) analogous to the behaviour of ADL, rather than ADR, despite the modelindependence of ADL. A reason for this, in this example, is that the feature with higher (resp. lower) dependence on Y contributes less (resp. more) to the residuals. When interpreting the SHAPloss and SAGE outputs, it is important to note that the model loss is increasing with t, since the true planar trend in (Y, X_{3}, X_{4}) rotates away from the learned trend. So, while the SAGE and SHAPloss results may appear to make the paradoxical suggestion that X_{3} is utilised better by the model at t = 10 compared with t = 0, this is not the case: SAGE and SHAPloss are not accounting for the change in model loss over time. The model loss may decrease more when marginalising a feature under a misspecified model, than under a model with lower overall loss.
Demonstration with misspecified model
To illustrate the ADL, ADP and ADR techniques, we demonstrate a case where the model is misspecified on the training set, due to model bias. The inadequacy of this misspecified model is then detected on the validation set. Unlike the example given in “Model attributed dependence”, the DGP is unchanging between the two data sets. The key technique used in this demonstration is the comparison of differences between ADL (calculated in the absence of any model) and ADP (calculated using the output of a fitted model), in order to identify any differences in the attributions between dependence on labels and the dependence on the predictions produced by the misspecified model. Such a comparison, between model absence and model outputs, is not possible using purely modeldependent Shapley values.
To make this example intuitive, we avoid using a complex model such as XGBoost, in favour of a linear regression model. Since the simulated DGP is also linear, this example allows a simple comparison between the correct model and the misspecified model. The DGP in this example is (15)$Y={X}_{1}+{X}_{2}+5{X}_{3}{X}_{4}{X}_{5}+\epsilon ,$ where X_{1}, X_{2}, X_{3} ∼ N(0, 1) are continuous, ε ∼ N(0, 0.1) is a small random error, and X_{4}, X_{5} ∼ Bernoulli(1∕2) are binary. Hence, we can make the interpretation that the effect of X_{3} is modulated by X_{4} and X_{5}, such that X_{3} is effective, only if X_{4} = X_{5} = 1. For this demonstration, we fit a misspecified linear model EY = β_{0} + βX, where ${X}^{T\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}}={\left({X}_{i}\right)}_{i=1}^{d}$ is the vector of features, and ${\beta}_{0},\beta ={\left({\beta}_{i}\right)}_{i=1}^{d}$ are real coefficients. This is a simple case where the true DGP is unknown to the analyst, who therefore seeks to summarise the 80 marginal contributions from 5 features into 5 Shapley values.
Figure 5 shows the outputs for attributed dependence on labels, residuals and predictions, via ordinary least squares estimation. From these results, we make the following observations:

For X_{3} the ADP is significantly higher than the ADL.

For X_{4} and X_{5} the ADP is significantly lower than the ADL.

For X_{1}, X_{2} there is no significant difference between ADP and ADL.

For X_{1}, X_{2} ADR is negative, while X_{3}, X_{4}, X_{5} have positive ADR.
Observations (i) and (ii) suggest that the model EY = β_{0} + βX overestimates the importance of X_{3} and underestimates the importance of X_{4} and X_{5}. Observations (iii) and (iv) suggest that the model may adequately represent X_{1}, X_{2}, X_{3}, but that X_{3}, X_{4} and X_{5} are significantly more important for determining structure in the residuals than X_{1} and X_{2}. A residuals versus fits plot may be useful for confirming that this structure is present and of large enough magnitude to be considered relevant.
Having observed the result in Fig. 5, for the misspecified linear model EY = β_{0} + βX, we now fit the correct model: EY = β_{0} + βX + X_{3}X_{4}X_{5}, which includes the threeway interaction effect X_{3}X_{4}X_{5}. The results, shown in Fig. 6, show no significant difference between the ADL and ADP for any of the features, and no significant difference in ADR between the features.
Application to Detecting Gender Bias
We analyse a mortality data set produced by the US Centers for Disease Control (CDC) via the National Health and Nutrition Examination Survey (NHANES I) and the NHANES I Epidemiologic Followup Study(NHEFS) (Cox, 1998). The data set consists of 79 features from medical examinations of 14, 407 individuals, aged between 25 and 75 years, followed between 1971 and 1992. Amongst these people, 4, 785 deaths were recorded before 1992. A version of this data set was also recently made available in the SHAP package (Lundberg & Lee, 2017). The same data were recently analysed in Section 2.7, Lundberg et al. (2020) (see also https://github.com/suinleelab/treeexplainerstudy/tree/master/notebooks/mortality).
We use a Cox proportional hazards objective function in XGBoost, with learning rate (eta) 0.002, maximum tree depth 3, subsampling ratio 0.5, and 5,000 trees. Our training set containt 3,370 observations, balanced via random sampling to contain an equal number of males and females. We then test the model on three different data sets: a all male test set of size 1686, containing all males not in the training data; an all female test set of size 3,547, containing all females not in the training data; and a gender balanced test set of size 3,372. The data are labelled with the observed timetodeath of each patient during the followup study. For model fitting, we use the 16 features given in Table 2.
Feature name  Feature name 

Age  Sex 
Race  Serum albumin 
Serum cholesterol  Serum iron 
Serum magnesium  Serum protein 
Poverty index  Physical activity 
Red blood cells  Diastolic blood pressure 
Systolic blood pressure  Total iron binding capacity 
Transferrin saturation  Body mass index 
Of the features in Table 2, we focus on the Shapley values for a subset of wellestablished risk factors for mortality: age, physical activity, systolic blood pressure, cholesterol and BMI. Note that the results presented here are purely intended as a proof of concept—the results have not been investigated in a controlled study and none of the authors are experts in medicine. We do not intend for our results to be treated as a work of medical literature.
We decompose dependence on the labels, model predictions and residuals, amongst the three features: age, systolic blood pressure (SBP) and physical activity (PA), displaying the resulting ADL, ADP and ADR for each of the three test data sets in Fig. 7(using the DC characteristic function). From this analysis we make the following observations.

Age has a significantly higher attributed dependence on residuals compared with each of the other features, across all three test sets. This suggests that age may play an important role in the structure of the model’s residuals. This observation is supported by the dumbbells for age, which suggest a significant and sizeable difference between attributed dependence on prediction and attributed dependence on labels; that is, we have evidence that the model’s predictions show a greater attributed dependence on age than the labels do.

For SBP, we observe no significant difference between ADL and ADP for the balanced and all male test sets. However, in the all female test set, we do see a significant and moderately sized reduction in the attributed dependence on SBP for the model’s predictions compared with that of labels. This suggests that the model may represent the relationship between SBP and log relative risk of mortality less effectively on the all female test set than on the other two test sets. This observation is supported by the attributed dependence on residuals for SBP, which is significantly higher in the all female test set compared to the other two sets.

For PA, we see a low attributed dependence on residuals, and a nonsignificant difference between ADL and ADP, for all three test sets. Thus we do not have any reason, from this investigation, to suspect that the effect of physical activity is being poorly represented by the model.
The results regarding potential heterogeneity due to gender and systolic blood pressure are not suprising given that we expect, a priori, there to be a relationship between systolic blood pressure and risk of mortality (Port et al., 2000), and that studies also indicate this relationship to be nonlinear (Boutitie et al., 2002), as well as dependent on age and gender (Port, Garfinkel & Boyle, 2000). Furthermore, the mortality risk also depends on age and gender, independently of blood pressure (Port, Garfinkel & Boyle, 2000). We also expect physical activity to be important in predicting mortality risk (Mok et al., 2019).
Discussion and Future Work
After distinguishing between modeldependent and modelindependent Shapley values, in “Measures of nonlinear dependence”, we introduce energy distancebased and kernelbased characteristic functions, for the Shapley game formulation, as measures of nonlinear dependence. We assign the name ‘Sunnies’ to Shapley values that arise from such measures.
In “Recognising dependence: Example 1” and “Exploration”, we demonstrate that the resulting modelindependent Shapley values provide reasonable results compared to a number of alternatives on certain DGPs. The alternatives investigated are the XGBoost builtin feature importance score, pairwise measures of nonlinear dependence, and the R^{2} characteristic function. The investigated DGPs are a quadratic form, for its simple nonlinearity; and an XOR functional dependence, for its absence of pairwise statistical dependence. These examples are simple but effective, as they act as counterexamples to the validity of the targeted measures of dependence to which we draw comparison.
In “Diagnostics”, we demonstrate how the Shapley value decomposition, with these nonlinear dependence measures as characteristic function, can be used for model diagnostics. In particular, we see a variety of interesting examples, where model misspecification and concept drift can be identified and attributed to specific features. We approach model diagnostics from two angles, by scrutinising two values: the dependence attributed on predictions by the model (ADP), and the dependence between the model residuals and the input features (ADR). These are proofs of concept, and the techniques of attributed dependence on labels (ADL), ADP and ADR require development to become standard tools. However, the examples highlight the techniques’ potential, and we hope that this encourages greater interest in them.
We provide two demonstrations of the diagnostic methods: in “Demonstration with concept drift”, we use a data generating process which changes over time, and where the deployed model was trained at one initial point in time. Here, Sunnies successfully uncovers changes in the dependence structures of interest, and attributes them to the correct features, early in the dynamic process. The second demonstration, in “Demonstration with misspecified model”, shows how we use the attributed dependence on labels, model predictions and residuals, to detect which features’ dependencies or interactions are not being correctly captured by the model. Implicit in these demonstrations is the notion that the information from many marginal contributions is being summarised into a human digestible number of quantities. For example, in “Demonstration with misspecified model”, the 80 marginal contributions from 5 features are summarised as 5 Shapley values, in each of ADL, ADP and ADR, facilitating the simple graphical comparison in Fig. 5.
There is a practical difference between modelindependent and modeldependent methods, highlighted in “Demonstration with misspecified model”, when comparing the dependence structure in a data set, to the dependence structure captured by a model. Modelindependent methods can be applied to model predictions and residuals, but can also be applied to data labels as well. Thus, techniques using modelindependent Shapley values will be markedly different from modeldependent methods in both design and interpretation. Indeed, consider that there is a different interpretation between (a) the decomposition of a measure of statistical dependence, e.g., as a measure of distance between the joint distribution functions, with and without the independence assumption, and (b) the attribution of a measure of the functional dependence of a model on the value of its inputs.
While the DC does provide a population level (asymptotic) guarantee that dependence will be detected, it must be noted that, as discussed in “Distance correlation and affine invariant distance correlation”, the DC tends to be greater for a linear association than for a nonlinear association. These are not strengths, or weaknesses, of using a measure of nonlinear statistical dependence as the Shapley value characteristic function (i.e., the method we call Sunnies) but rather of the particular choice of characteristic function in this method. Work is needed to investigate other measures of statistical dependence in place of DC, HSIC or AIDC, and to provide a comparison between these methods, including a detailed analysis of strengths, limitations and computational efficiency. In this paper, we have not focused on such a detailed experimental evaluation and comparison, but on the exposition of the Sunnies method itself. A potential alternative to our use of energy correlation and HSIC is the class of maximal information based nonparametric exploration (MINE) statistics, or other mutual information based measures (Kinney & Atwal, 2014; Reshef et al., 2011).
Finally, in “Application to Detecting Gender Bias”, we apply Sunnies to a study on mortality data, with the aim of detecting effects caused by gender differences. We find that, when the model is trained on a gender balanced data set, a significant difference is detected between the model’s representation of the dependence structure via its predictions (ADP) and the dependence structure on the labels (ADL); a difference which is significant for females and not for males, even though the training data was gender balanced. Although we do not claim that our result is causal, it does provide evidence regarding the potential of Sunnies to uncover and attribute discrepancies that may otherwise go unnoticed, in real data.
A wellknown limitation when working with Shapley values, is their exponential computational time complexity. Ideally, in “Application to Detecting Gender Bias”, we would have calculated Shapley values of all 17 features. However, it is important to note that we do not need to calculate Shapley values of all features, if there is prior knowledge available regarding interesting or important features, or if features can be partitioned into independent blocks. To illustrate the idea of taking advantage of independent blocks, suppose we have a model with 15 features. If we know in advance that these features partition into 3 independent blocks of 5 features, then we can decompose the pairwise dependence of each block into 5 Shapley values. In this way, 15 Shapley values are computed from 240 withinblock marginal contributions, rather than the full number of 32, 768 marginal contributions. In the future, it may be interesting to also consider the computational efficiencies that may arise in scenarios where the sparsity structures of marginal contributions can be directly exploited, as well as the potential for examining such marginal contributions directly (e.g., via visualisation).
Finally, note that we have made the distinction that Shapley feature importance methods may or may not be modeldependent, but this distinction holds for model explanation methods in general. We believe that complete and satisfactory model explanations should ideally include a description from both categories.
All code and data necessary to produce the results in this manuscript are available on github.com/ex2o/sunnies.