Many to One Explainer

The Many to One Explainer creates rule based explanations for many to one relationships. It provides insights into how the input features define groups of output features.

Method Signature

ExpDataFrame.explain(
 explainer: Literal['fedex', 'outlier', 'many_to_one', 'shapley', 'metainsight']='fedex',
 attributes: List = None,
 use_sampling: bool | None = None,
 sample_size: int | float = 5000
 labels=None, coverage_threshold: float = 0.7,
 max_explanation_length: int = 3,
 separation_threshold: float = 0.3,
 p_value: int = 1,
 explanation_form: Literal['conj', 'disj', 'conjunction', 'disjunction'] = 'conj',
 prune_if_too_many_labels: bool = True,
 max_labels: int = 10,
 pruning_method='largest',
 bin_numeric: bool = False,
 num_bins: int = 10,
 binning_method: str = 'quantile',
 label_name: str = 'label',
 explain_errors=True,
 error_explanation_threshold: float = 0.05,
)

Many to One Explainer Usage Example

# Import the necessary libraries
import pandas as pd
import pd_explain

# Load the "adult" dataset
adult = pd.read_csv(r'C:\adult.csv')

# Call the many to one explainer
adult.explain(explainer='many_to_one', labels='class')

Output: .. table:

+-----------------+----------------------------------------------------------+----------+------------------+--------------------------+
| Group / Cluster | Explanation                                              | Coverage | Separation Error | Separation Error Origins |
+=================+==========================================================+==========+==================+==========================+
| <=50K           | 1 <= education-num <= 10                                 | 0.75     | 0.15             | 100.00% from group >50K  |
| <=50k           | 0 <= capital-gain <= 5095.5                              | 1.0      | 0.21             | 100.00% from group >50K  |
| <=50k           | 0 <= capital-gain <= 5095.5 AND 1 <= education-num <= 10 | 0.75     | 0.13             | 100.00% from group >50K  |
| <=50k           | 0 <= capital-gain <= 4243.5                              | 0.99     | 0.2              | 100.00% from group >50K  |
| >50K            | No explanation found                                     | NaN      | NaN              | NaN                      |
+-----------------+----------------------------------------------------------+----------+------------------+--------------------------+

Coverage is the % of the group that is covered by the explanation. Separation Error is the % of data outside the group that is covered by the explanation.

Parameters

  • explainer (str): The explainer to use. This is shared with other explainers, but for the many to one explainer, it must be set to many_to_one.

  • attributes (list, optional): The attributes to consider when generating explanations. Default is None.

  • use_sampling (bool | None, optional): Whether to use sampling to speed up the computation. Default is to use the global setting.

  • sample_size (int | float, optional): The number of samples to use. Default is 5000. Using a float between 0 and 1 will use that fraction of the data.

  • labels (str | list | Series | DataFrame | ndarray | None): The labels defining the many to one relationship. Can be a name (or list of names) of a column in the DataFrame, a Series, a DataFrame, a numpy array, or None. None is only applicable for when the explainer is called on the result of a GroupBy operation, in which case the GroupBy groups will be inferred automatically. Otherwise, the labels must be provided. Defaults to None.

  • coverage_threshold (float, optional): The minimum coverage required for an explanation to be considered. Default is 0.7.

  • separation_threshold (float, optional): The minimum separation error required for an explanation to be considered. Default is 0.3.

  • max_explanation_length (int, optional): The maximum number of conditions in an explanation. Default is 3.

  • p_value (float, optional): A scaling parameter for the number of top attributes to consider when generating explanations. Number of attributes to consider = p_value * max_explanation_length. Default is 1.

  • explanation_form (str, optional): The form of the explanation. Default is conjunction. Other options are disjunction, or short forms conj and disj.

  • prune_if_too_many_labels (bool, optional): Whether to prune the labels to a smaller subset if there are too many. Default is True.

  • max_labels (int, optional): The number of labels to keep if prune_if_too_many_labels is True. If there are less labels, no pruning will be performed. Default is 10.

  • pruning_method (str, optional): The method to use for pruning labels. The options are:

    • largest: Keeps the k most frequent labels.

    • smallest: Keeps the k least frequent labels.

    • random: Keeps k random labels.

    • max_dist: Keeps the k labels with the largest mean distance between their centroids and the centroids of other labels.

    • min_dist: Keeps the k labels with the smallest mean distance between their centroids and the centroids of other labels.

    • max_silhouette: Keeps k labels with the largest silhouette score.

    • min_silhouette: Keeps k labels with the smallest silhouette score.

    Default is largest.

  • bin_numeric (bool, optional): If the labels are numeric, whether to bin them into categories. Default is False.

  • num_bins (int, optional): The number of bins to use if bin_numeric is True. If there are less unique values than num_bins, no binning will be performed. Default is 10.

  • bin_method (str, optional): The method to use for binning. The options are:

    • uniform: Bins are of equal width.

    • quantile: Bins are of equal frequency.

    Default is quantile.

  • label_name (str, optional): The name to give the labels if they are binned. Default is Label. Only needed if the labels do not come from a Series / DataFrame with a name, and will only affect its display in the explanation. For example, you may see x <= label <= y as a group name.

  • explain_errors (bool, optional): Whether to provide explanations for the origins of the separation error. Default is True.

  • error_explanation_threshold (float, optional): The threshold for much a group must individually contribute to the separation error to appear in the explanation. Groups that contribute less than this will be grouped together. Default is 0.05.

Other Usage Examples

We will now show other examples of how to use the many to one explainer with different parameters.

Example 1: Explaining Clustering Results

The many to one explainer works on any many-to-one relationship, including clustering results.

# Import the necessary libraries
import pandas as pd
import pd_explain
from sklearn.cluster import KMeans

# Load the adult dataset
adult = pd.read_csv(r'C:\adult.csv')

# Perform a clustering operation
clusters = KMeans(n_clusters=3).fit_predict(adult)

# Call the many to one explainer
adult.explain(explainer='many_to_one', labels=clusters)

Output: .. table:

+-----------------+----------------------------------------------------------------+----------+------------------+-------------------------------+
| Group / Cluster | Explanation                                                    | Coverage | Separation Error | Separation Error Origins      |
+=================+================================================================+==========+==================+===============================+
| 0               | 149278.5 <= fnlwgt <= 1490400                                  | 1.0      | 0.22             | 100.00% from group 1          |
| 0               | 149278.5 <= fnlwgt <= 1490400 AND 8.5 <= education-num <= 16.0 | 0.87     | 0.21             | 100.00% from group 1          |
| 1               | 291277.5 <= fnlwgt <= 1490400                                  | 1.0      | 0.0              | Rule has no separation error. |
| 2               | 13769 <= fnlwgt <= 149278.5                                    | 1.0      | 0.0              | Rule has no separation error. |
+-----------------+----------------------------------------------------------------+----------+------------------+-------------------------------+

Example 2: Explaining GroupBy Groups

If you perform a group-by operation, you can then call the many to one explainer on the result to get insights into the groups. Simply leave the labels parameter as None to infer the groups from the DataFrame. Note that it is only with group-by operations that you can leave the labels parameter as None, any other case requires you to provide the labels.

# Import the necessary libraries
import pandas as pd
import pd_explain

# Load the adult dataset
adult = pd.read_csv(r'C:\adult.csv')

# Perform a group by operation
gb_res = adult.groupby(['workclass', 'marital-status']).mean()

# Call the many to one explainer, with some additional optional parameters to customize the output
gb_res.explain(explainer='many_to_one', pruning_method='random', max_labels=3)

Output: .. table:

+---------------------------------------------+-----------------------------------------+----------+------------------+-----------------------------------------------------------------------------------------------------------------------------+
| Group / Cluster                             | Explanation                             | Coverage | Separation Error | Separation Error Origins                                                                                                    |
+=============================================+=========================================+==========+==================+=============================================================================================================================+
| (' Self-emp-inc', ' Separated')             | 26 <= age <= 69                         | 1.0      | 0.23             | 83.33% from group (' Self-emp-inc', ' Married-spouse-absent'), 16.67% from group (' Without-pay', ' Married-spouse-absent') |
| (' Self-emp-inc', ' Separated')             | occupation !=  Farming-fishing          | 0.95     | 0.17             | 100.00% from group (' Self-emp-inc', ' Married-spouse-absent')                                                              |
| (' Self-emp-inc', ' Married-spouse-absent') | sex !=  Female AND occupation ==  Sales | 0.8      | 0.0              | Rule has no separation error.                                                                                               |
| (' Self-emp-inc', ' Married-spouse-absent') | sex ==  Male AND occupation ==  Sales   | 0.8      | 0.0              | Rule has no separation error.                                                                                               |
| (' Without-pay', ' Married-spouse-absent')  | age == 68                               | 1.0      | 0.0              | Rule has no separation error.                                                                                               |
+---------------------------------------------+-----------------------------------------+----------+------------------+-----------------------------------------------------------------------------------------------------------------------------+

Example 3: Disjunctive Explanations

The many to one explainer can provide explanations based on either conjunctive or disjunctive rules. To get disjunctive explanations, set the explanation_form parameter to disj or disjunctive.

# Import the necessary libraries
import pandas as pd
import pd_explain

# Load the adult dataset
adult = pd.read_csv(r'C:\adult.csv')

# Call the many to one explainer with disjunctive explanations,
# as well as select only the categorical attributes to consider, and disable sampling for more accurate (but slower) results.
adult.explain(explainer='many_to_one', explanation_form='disj', labels='label',
                attributes=['workclass', 'education', 'marital-status', 'occupation', 'relationship'], use_sampling=False)

Output: .. table:

+-----------------+--------------------------------------------------------+----------+------------------+--------------------------+
| Group / Cluster | Explanation                                            | Coverage | Separation Error | Separation Error Origins |
+=================+========================================================+==========+==================+==========================+
| <=50K           | occupation != Prof-specialty OR education != Bachelors | 0.96     | 0.23             | 100.00% from group >50K  |
| <=50K           | occupation != Prof-specialty                           | 0.91     | 0.21             | 100.00% from group >50K  |
| >50K            | No explanation found                                   | NaN      | NaN              | NaN                      |
+-----------------+--------------------------------------------------------+----------+------------------+--------------------------+

Example 4: Passing a DataFrame as Labels

You can pass a DataFrame with more than one column as the labels, and not just a single column. Doing so each unique combination of the columns will be considered as a separate label, much like in the case of a group-by operation.

# Import the necessary libraries
import pandas as pd
import pd_explain

# Load the "adult" dataset
adult = pd.read_csv(r'C:\adult.csv')

# Select the labels
labels = adult[['workclass', 'marital-status']]

adult.drop(columns=['workclass', 'marital-status']).explain(explainer='many_to_one', labels=labels, pruning_method='min_dist', max_labels=3)

Output: .. table:

+---------------------------------------+--------------------------------------------------+----------+------------------+--------------------------------------------------------------------------------------------------------+
| Group / Cluster                       | Explanation                                      | Coverage | Separation Error | Separation Error Origins                                                                               |
+=======================================+==================================================+==========+==================+========================================================================================================+
| ('State-gov', 'Never-married')        | relationship != Husband AND relationship != Wife | 1.0      | 0.05             | 85.71% from group ('?', 'Married-civ-spouse'), 14.29% from group ('Federal-gov', 'Married-civ-spouse') |
| ('Federal-gov', 'Married-civ-spouse') | occupation != ? AND relationship == Husband      | 0.91     | 0.0              | Rule has no separation error.                                                                          |
| ('?', 'Married-civ-spouse')           | occupation == ?                                  | 1.0      | 0.0              | Rule has no separation error.                                                                          |
+---------------------------------------+--------------------------------------------------+----------+------------------+--------------------------------------------------------------------------------------------------------+

Example 5: Binning Numeric Labels

If your labels are numeric, you can bin them into categories to get more meaningful explanations. To do this, set the bin_numeric parameter to True, and optionally set the num_bins parameter to control the number of bins.

# Import the necessary libraries
import pandas as pd
import pd_explain

# Load the "adult" dataset
adult = pd.read_csv(r'C:\adult.csv')

# Call the many to one explainer, setting the bin_numeric parameter to True, and using a custom number of bins
adult.explain(explainer='many_to_one', labels='education-num', bin_numeric=True, num_bins=4)

Output: .. table:

+----------------------+------------------------------------------------------+----------+------------------+--------------------------------------------------------------------------------+
| Group / Cluster      | Explanation                                          | Coverage | Separation Error | Separation Error Origins                                                       |
+======================+======================================================+==========+==================+================================================================================+
| 0.999 < education-num <= 9.0 | education != Some-college AND education != Bachelors | 1.0      | 0.27             | 52.16% from group 13.0 < label <= 16.0, 47.84% from group 10.0 < label <= 13.0 |
| 9.0 < education-num <= 10.0  | education == Some-college                            | 1.0      | 0.0              | Rule has no separation error.                                                  |
| 10.0 < education-num <= 13.0 | No explanation found                                 | NaN      | NaN              | NaN                                                                            |
| 13.0 < education-num <= 16.0 | No explanation found                                 | NaN      | NaN              | NaN                                                                            |
+----------------------+------------------------------------------------------+----------+------------------+--------------------------------------------------------------------------------+

In this example, since the education-num column came from our dataframe, it had a name to display. Let’s instead provide it as a numpy array, and see how the output changes.

# Import the necessary libraries
import pandas as pd
import pd_explain

# Load the "adult" dataset
adult = pd.read_csv(r'C:\adult.csv')

# Call the many to one explainer, setting the bin_numeric parameter to True, and using a custom number of bins
adult.drop(columns='education-num').explain(explainer='many_to_one', labels=adult['education-num'].values, bin_numeric=True, num_bins=4)

Output: .. table:

+----------------------+------------------------------------------------------+----------+------------------+--------------------------------------------------------------------------------+
| Group / Cluster      | Explanation                                          | Coverage | Separation Error | Separation Error Origins                                                       |
+======================+======================================================+==========+==================+================================================================================+
| 0.999 < label <= 9.0 | education != Some-college AND education != Bachelors | 1.0      | 0.27             | 52.16% from group 12.0 < label <= 16.0, 47.84% from group 10.0 < label <= 12.0 |
| 9.0 < label <= 10.0  | education == Some-college                            | 1.0      | 0.0              | Rule has no separation error.                                                  |
| 10.0 < label <= 12.0 | No explanation found                                 | NaN      | NaN              | NaN                                                                            |
| 12.0 < label <= 16.0 | No explanation found                                 | NaN      | NaN              | NaN                                                                            |
+----------------------+------------------------------------------------------+----------+------------------+--------------------------------------------------------------------------------+

As you can see, the output now displays the label as label instead of education-num. If we want to change this, we can use the label_name parameter.

# Import the necessary libraries
import pandas as pd
import pd_explain

# Load the "adult" dataset
adult = pd.read_csv(r'C:\adult.csv')

# Call the many to one explainer, setting the bin_numeric parameter to True, and using a custom number of bins
adult.drop(columns='education-num').explain(explainer='many_to_one', labels=adult['education-num'].values, bin_numeric=True, num_bins=4, label_name='Education number')

Output: .. table:

+---------------------------------+------------------------------------------------------+----------+------------------+------------------------------------------------------------------------------------------------------+
| Group / Cluster                 | Explanation                                          | Coverage | Separation Error | Separation Error Origins                                                                             |
+=================================+======================================================+==========+==================+======================================================================================================+
| 0.999 < Education number <= 9.0 | education != Some-college AND education != Bachelors | 1.0      | 0.27             | 52.16% from group 12.0 < Education number <= 16.0, 47.84% from group 10.0 < Education number <= 12.0 |
| 9.0 < Education number <= 10.0  | education == Some-college                            | 1.0      | 0.0              | Rule has no separation error.                                                                        |
| 10.0 < Education number <= 12.0 | No explanation found                                 | NaN      | NaN              | NaN                                                                                                  |
| 12.0 < Education number <= 16.0 | No explanation found                                 | NaN      | NaN              | NaN                                                                                                  |
+---------------------------------+------------------------------------------------------+----------+------------------+------------------------------------------------------------------------------------------------------+