shap.TreeExplainer

class shap.TreeExplainer(model, data=None, model_output='raw', feature_perturbation='interventional', **deprecated_options)

Uses Tree SHAP algorithms to explain the output of ensemble tree models.

Tree SHAP is a fast and exact method to estimate SHAP values for tree models and ensembles of trees, under several different possible assumptions about feature dependence. It depends on fast C++ implementations either inside an externel model package or in the local compiled C extention.

Parameters
modelmodel object

The tree based machine learning model that we want to explain. XGBoost, LightGBM, CatBoost, Pyspark and most tree-based scikit-learn models are supported.

datanumpy.array or pandas.DataFrame

The background dataset to use for integrating out features. This argument is optional when feature_perturbation=”tree_path_dependent”, since in that case we can use the number of training samples that went down each tree path as our background dataset (this is recorded in the model object).

feature_perturbation“interventional” (default) or “tree_path_dependent” (default when data=None)

Since SHAP values rely on conditional expectations we need to decide how to handle correlated (or otherwise dependent) input features. The “interventional” approach breaks the dependencies between features according to the rules dictated by casual inference (Janzing et al. 2019). Note that the “interventional” option requires a background dataset and its runtime scales linearly with the size of the background dataset you use. Anywhere from 100 to 1000 random background samples are good sizes to use. The “tree_path_dependent” approach is to just follow the trees and use the number of training examples that went down each leaf to represent the background distribution. This approach does not require a background dataset and so is used by default when no background dataset is provided.

model_output“raw”, “probability”, “log_loss”, or model method name

What output of the model should be explained. If “raw” then we explain the raw output of the trees, which varies by model. For regression models “raw” is the standard output, for binary classification in XGBoost this is the log odds ratio. If model_output is the name of a supported prediction method on the model object then we explain the output of that model method name. For example model_output=”predict_proba” explains the result of calling model.predict_proba. If “probability” then we explain the output of the model transformed into probability space (note that this means the SHAP values now sum to the probability output of the model). If “logloss” then we explain the log base e of the model loss function, so that the SHAP values sum up to the log loss of the model for each sample. This is helpful for breaking down model performance by feature. Currently the probability and logloss options are only supported when feature_dependence=”independent”.

Examples

See Tree Explainer Examples

__init__(model, data=None, model_output='raw', feature_perturbation='interventional', **deprecated_options)

Uses Shapley values to explain any machine learning model or python function.

This is the primary explainer interface for the SHAP library. It takes any combination of a model and masker and returns a callable subclass object that implements the particular estimation algorithm that was chosen.

Parameters
modelobject or function

User supplied function or model object that takes a dataset of samples and computes the output of the model for those samples.

maskerfunction, numpy.array, pandas.DataFrame, tokenizer, or a list of these for each model input

The function used to “mask” out hidden features of the form masked_args = masker(*model_args, mask=mask). It takes input in the same form as the model, but for just a single sample with a binary mask, then returns an iterable of masked samples. These masked samples will then be evaluated using the model function and the outputs averaged. slice() model(*masker(*args, mask=mask)).mean() As a shortcut for the standard masking using by SHAP you can pass a background data matrix instead of a function and that matrix will be used for masking. Domain specific masking functions are available in shap such as shap.ImageMasker for images and shap.TokenMasker for text. In addition to determining how to replace hidden features, the masker can also constrain the rules of the cooperative game used to explain the model. For example shap.TabularMasker(data, hclustering=”correlation”) will enforce a hierarchial clustering of coalitions for the game (in this special case the attributions are known as the Owen values).

linkfunction

The link function used to map between the output units of the model and the SHAP value units. By default it is shap.links.identity, but shap.links.logit can be useful so that expectations are computed in probability units while explanations remain in the (more naturally additive) log-odds units. For more details on how link functions work see any overview of link functions for generalized linear models.

algorithm“auto”, “permutation”, “partition”, “tree”, “kernel”, “sampling”, “linear”, “deep”, or “gradient”

The algorithm used to estimate the Shapley values. There are many different algorithms that can be used to estimate the Shapley values (and the related value for constrained games), each of these algorithms have various tradeoffs and are preferrable in different situations. By default the “auto” options attempts to make the best choice given the passed model and masker, but this choice can always be overriden by passing the name of a specific algorithm. The type of algorithm used will determine what type of subclass object is returned by this constructor, and you can also build those subclasses directly if you prefer or need more fine grained control over their options.

output_namesNone or list of strings

The names of the model outputs. For example if the model is an image classifier, then output_names would be the names of all the output classes. This parameter is optional. When output_names is None then the Explanation objects produced by this explainer will not have any output_names, which could effect downstream plots.

Methods

__init__(model[, data, model_output, …])

Uses Shapley values to explain any machine learning model or python function.

assert_additivity(phi, model_output)

explain_row(*row_args, max_evals, …)

Explains a single row and returns the tuple (row_values, row_expected_values, row_mask_shapes, main_effects).

shap_interaction_values(X[, y, tree_limit])

Estimate the SHAP interaction values for a set of samples.

shap_values(X[, y, tree_limit, approximate, …])

Estimate the SHAP values for a set of samples.

supports_model(model)

Determines if this explainer can handle the given model.

explain_row(*row_args, max_evals, main_effects, error_bounds, outputs, silent, **kwargs)

Explains a single row and returns the tuple (row_values, row_expected_values, row_mask_shapes, main_effects).

This is an abstract method meant to be implemented by each subclass.

Returns
tuple

A tuple of (row_values, row_expected_values, row_mask_shapes), where row_values is an array of the attribution values for each sample, row_expected_values is an array (or single value) representing the expected value of the model for each sample (which is the same for all samples unless there are fixed inputs present, like labels when explaining the loss), and row_mask_shapes is a list of all the input shapes (since the row_values is always flattened),

shap_interaction_values(X, y=None, tree_limit=None)

Estimate the SHAP interaction values for a set of samples.

Parameters
Xnumpy.array, pandas.DataFrame or catboost.Pool (for catboost)

A matrix of samples (# samples x # features) on which to explain the model’s output.

ynumpy.array

An array of label values for each sample. Used when explaining loss functions (not yet supported).

tree_limitNone (default) or int

Limit the number of trees used by the model. By default None means no use the limit of the original model, and -1 means no limit.

Returns
array or list

For models with a single output this returns a tensor of SHAP values (# samples x # features x # features). The matrix (# features x # features) for each sample sums to the difference between the model output for that sample and the expected value of the model output (which is stored in the expected_value attribute of the explainer). Each row of this matrix sums to the SHAP value for that feature for that sample. The diagonal entries of the matrix represent the “main effect” of that feature on the prediction and the symmetric off-diagonal entries represent the interaction effects between all pairs of features for that sample. For models with vector outputs this returns a list of tensors, one for each output.

shap_values(X, y=None, tree_limit=None, approximate=False, check_additivity=True, from_call=False)

Estimate the SHAP values for a set of samples.

Parameters
Xnumpy.array, pandas.DataFrame or catboost.Pool (for catboost)

A matrix of samples (# samples x # features) on which to explain the model’s output.

ynumpy.array

An array of label values for each sample. Used when explaining loss functions.

tree_limitNone (default) or int

Limit the number of trees used by the model. By default None means no use the limit of the original model, and -1 means no limit.

approximatebool

Run fast, but only roughly approximate the Tree SHAP values. This runs a method previously proposed by Saabas which only considers a single feature ordering. Take care since this does not have the consistency guarantees of Shapley values and places too much weight on lower splits in the tree.

check_additivitybool

Run a validation check that the sum of the SHAP values equals the output of the model. This check takes only a small amount of time, and will catch potential unforeseen errors. Note that this check only runs right now when explaining the margin of the model.

Returns
array or list

For models with a single output this returns a matrix of SHAP values (# samples x # features). Each row sums to the difference between the model output for that sample and the expected value of the model output (which is stored in the expected_value attribute of the explainer when it is constant). For models with vector outputs this returns a list of such matrices, one for each output.

static supports_model(model)

Determines if this explainer can handle the given model.

This is an abstract static method meant to be implemented by each subclass.