shap_ops#
This Module defines the functions needed to apply SHAP to ML models and datasets.
Functions#
This module defines the following functions:
preprocess_df_train
: Preprocess the dataframe.filter_status_column
: Filter out columns from the dataframe.filter_production_range
: Filter the dataframe using the production range.select_best_model
: Select the best ridge regression model based on a metric.compute_shap_values
: Compute SHAP values for a given Ridge model and dataset.process_columns
: Process and extract relevant columns based on certain conditions.apply_shap
: Process and apply SHAP (SHapley Additive exPlanations) to datasets.
The main function called inside wip.otm.py
is apply_shap
.
- wip.datatools.shap_ops.apply_shap(datasets: Dict[str, pd.DataFrame], models_results: Dict[str, List[Dict[str, ...]]], scalers: Dict[str, sklearn.preprocessing.MinMaxScaler], shap_cols: List[str] | None = None) pd.DataFrame [source]#
Process and apply SHAP (SHapley Additive exPlanations) to the provided datasets.
Given datasets, model results, and scalers, the function computes SHAP values to interpret the output of machine learning models. It returns a DataFrame with information regarding feature importance in relation to the target feature.
- Parameters
datasets (
Dict[str
,pd.DataFrame]
) – Dictionary containing the data for different models. Each key corresponds to a model name, and each value is apandas.DataFrame
.models_results (
Dict[str
,List[Dict[str
,]]]
) – Dictionary containing model results for different ridge regression models. ach key corresponds to a model name, and each value is a list of dictionaries with the keys: “conf”, “model”, and “metrics”.scalers (
Dict[str
,sklearn.preprocessing.MinMaxScaler]
) – A dictionary of tag’s scalers.shap_cols (
List[str] | None
, optional) – List of column names in datasets to which SHAP is applied. By default, SHAP is applied to [“compressao”, “SE PR”, “umidade”, “SE PP”].
- Returns
DataFrame containing the columns ‘Range_max’, ‘TAG’, ‘Valor_Real’, ‘Valor_Norm’, and ‘Ascending’ that provide information regarding the SHAP values and their relationship with the target features.
- Return type
pd.DataFrame
Notes
The function applies SHAP specifically for Linear models and makes use of the
LinearExplainer
from the SHAP library. Given a set of models and datasets, it selects the best model (based on MAPE) for each quality and calculates the SHAP values. Based on the SHAP values and certain conditions, a DataFrame is returned with information about the importance of features for different quality categories.
- wip.datatools.shap_ops.compute_shap_values(model: Ridge, dataset: DataFrame) ndarray [source]#
Compute SHAP values for a given Ridge model and dataset.
This function uses the
shap.LinearExplainer
to compute the SHAP values for the provided Ridge regression model and dataset. The dataset is expected to have the response variable in the last column.- Parameters
model (
sklearn.linear_model.Ridge
) – THE Ridge regression model for which SHAP values are to be computed.dataset (
pd.DataFrame
) – Dataset with feature columns and response variable. The response variable is assumed to be in the last column of the DataFrame.
- Returns
Array of SHAP values for each sample in the dataset.
- Return type
np.ndarray
- wip.datatools.shap_ops.filter_production_range(df_train: DataFrame, range_min: int, range_max: int, prod_pq: bool) DataFrame [source]#
Filter the dataframe based on a production range and optionally drop a column.
This function filters the provided dataframe based on a specified range of production values from the column “PROD_PQ_Y@08US”. Additionally, it can drop the “PROD_PQ_Y@08US” column from the resultant dataframe if the
prod_pq
parameter is set to True.- Parameters
df_train (
pd.DataFrame
) – Input dataframe to apply the filter to.range_min (
int
) – Minimum value of the production range for filtering.range_max (
int
) – Maximum value of the production range for filtering.prod_pq (
bool
) – If True, drop the “PROD_PQ_Y@08US” column from the filtered dataframe.
- Returns
Filtered dataframe.
- Return type
pd.DataFrame
- wip.datatools.shap_ops.filter_status_column(df_train: DataFrame) DataFrame [source]#
Filter out columns with names containing ‘status’ or ‘ProducaoPQ_Moagem’.
This function identifies and drops columns from the input dataframe if the column names contain the substring ‘status’ or ‘ProducaoPQ_Moagem’.
- Parameters
df_train (
pd.DataFrame
) – Input dataframe from which columns need to be filtered out.- Returns
A
pandas.Dataframe
without the columns containing the substrings ‘status’ or ‘ProducaoPQ_Moagem’.- Return type
pd.DataFrame
- wip.datatools.shap_ops.preprocess_df_train(df_train: DataFrame) DataFrame [source]#
Preprocess the dataframe by filtering and handling missing and infinite values.
This function filters out rows where “PROD_PQ_Y@08US” is non-positive, replaces infinite values with NaN, interpolates NaN values using linear interpolation, and then fills any remaining NaN values with 0.
- Parameters
df_train (
pd.DataFrame
) – Input dataframe to be preprocessed.- Returns
Preprocessed dataframe with non-positive “PROD_PQ_Y@08US” rows removed, and missing and infinite values handled.
- Return type
pd.DataFrame
- wip.datatools.shap_ops.process_columns(dataset: DataFrame, train_shap_values: ndarray, scalers: Dict[str, MinMaxScaler], range_max: int, qualidade: str) DataFrame [source]#
Process and extract relevant columns based on certain conditions.
This function processes the input dataframe columns based on certain conditions and then extracts information about them, such as the actual and normalized values of certain metrics, whether the values are ascending, etc. It then returns a new dataframe with this extracted information.
- Parameters
dataset (
pd.DataFrame
) – Input dataframe to process.train_shap_values (
np.ndarray
) – SHAP values for each feature in the dataset.scalers (
Dict[str
,sklearn.preprocessing.MinMaxScaler]
) – Dictionary containing the MinMaxScaler for each column in the dataset.range_max (
int
) – The maximum range value for the data.qualidade (
str
) – String indicating the quality parameter to process. Possible values are"compressao"
,"relacao gran"
,"SE PR"
,"SE PP"
, and"umidade"
.
- Returns
A
pandas.Dataframe
containing the extracted information with columns:”Range_max”
”TAG”
”Valor_Real”
”Valor_Norm”
”Ascending”
- Return type
pd.DataFrame
- wip.datatools.shap_ops.select_best_model(models_results: list, metric: str = 'MAPE') int [source]#
Select the best model based on the specified evaluation metric.
This function takes a list of dictionaries containing model results, constructs a DataFrame to collate the performance metrics, and then sorts the models based on the specified metric.
The function returns the index of the best model.
- Parameters
models_results (
list
) –A list of dictionaries where each dictionary has keys:
"conf"
: Configuration or name of the model."model"
: Trained model object."metrics"
: A dictionary of performance metrics that includes: -"mse"
: Mean squared error. -"mape"
: Median absolute percentage error. -"r2"
: R-squared for the test set. -"r"
: R-squared for the train set. -"r2_train"
: R-squared for the train set. -"r2_train_adj"
: Adjusted R-squared
metric (str
{"MSE", "MAPE", "R2", "R", "R2 Train", "R2 Train Adj"}
, default"MAPE"
) – The evaluation metric based on which the models are to be ranked. Possible options are:"MSE"
,"R2"
,"R"
,"R2 Train"
, and"R2 Train Adj"
.
- Returns
Index of the best model based on the specified metric.
- Return type