shap_ops#

This Module defines the functions needed to apply SHAP to ML models and datasets.

Functions#

This module defines the following functions:

  • preprocess_df_train: Preprocess the dataframe.

  • filter_status_column: Filter out columns from the dataframe.

  • filter_production_range: Filter the dataframe using the production range.

  • select_best_model: Select the best ridge regression model based on a metric.

  • compute_shap_values: Compute SHAP values for a given Ridge model and dataset.

  • process_columns: Process and extract relevant columns based on certain conditions.

  • apply_shap: Process and apply SHAP (SHapley Additive exPlanations) to datasets.

The main function called inside wip.otm.py is apply_shap.

wip.datatools.shap_ops.apply_shap(datasets: Dict[str, pd.DataFrame], models_results: Dict[str, List[Dict[str, ...]]], scalers: Dict[str, sklearn.preprocessing.MinMaxScaler], shap_cols: List[str] | None = None) pd.DataFrame[source]#

Process and apply SHAP (SHapley Additive exPlanations) to the provided datasets.

Given datasets, model results, and scalers, the function computes SHAP values to interpret the output of machine learning models. It returns a DataFrame with information regarding feature importance in relation to the target feature.

Parameters
  • datasets (Dict[str, pd.DataFrame]) – Dictionary containing the data for different models. Each key corresponds to a model name, and each value is a pandas.DataFrame.

  • models_results (Dict[str, List[Dict[str, ]]]) – Dictionary containing model results for different ridge regression models. ach key corresponds to a model name, and each value is a list of dictionaries with the keys: “conf”, “model”, and “metrics”.

  • scalers (Dict[str, sklearn.preprocessing.MinMaxScaler]) – A dictionary of tag’s scalers.

  • shap_cols (List[str] | None, optional) – List of column names in datasets to which SHAP is applied. By default, SHAP is applied to [“compressao”, “SE PR”, “umidade”, “SE PP”].

Returns

DataFrame containing the columns ‘Range_max’, ‘TAG’, ‘Valor_Real’, ‘Valor_Norm’, and ‘Ascending’ that provide information regarding the SHAP values and their relationship with the target features.

Return type

pd.DataFrame

Notes

The function applies SHAP specifically for Linear models and makes use of the LinearExplainer from the SHAP library. Given a set of models and datasets, it selects the best model (based on MAPE) for each quality and calculates the SHAP values. Based on the SHAP values and certain conditions, a DataFrame is returned with information about the importance of features for different quality categories.

wip.datatools.shap_ops.compute_shap_values(model: Ridge, dataset: DataFrame) ndarray[source]#

Compute SHAP values for a given Ridge model and dataset.

This function uses the shap.LinearExplainer to compute the SHAP values for the provided Ridge regression model and dataset. The dataset is expected to have the response variable in the last column.

Parameters
  • model (sklearn.linear_model.Ridge) – THE Ridge regression model for which SHAP values are to be computed.

  • dataset (pd.DataFrame) – Dataset with feature columns and response variable. The response variable is assumed to be in the last column of the DataFrame.

Returns

Array of SHAP values for each sample in the dataset.

Return type

np.ndarray

wip.datatools.shap_ops.filter_production_range(df_train: DataFrame, range_min: int, range_max: int, prod_pq: bool) DataFrame[source]#

Filter the dataframe based on a production range and optionally drop a column.

This function filters the provided dataframe based on a specified range of production values from the column “PROD_PQ_Y@08US”. Additionally, it can drop the “PROD_PQ_Y@08US” column from the resultant dataframe if the prod_pq parameter is set to True.

Parameters
  • df_train (pd.DataFrame) – Input dataframe to apply the filter to.

  • range_min (int) – Minimum value of the production range for filtering.

  • range_max (int) – Maximum value of the production range for filtering.

  • prod_pq (bool) – If True, drop the “PROD_PQ_Y@08US” column from the filtered dataframe.

Returns

Filtered dataframe.

Return type

pd.DataFrame

wip.datatools.shap_ops.filter_status_column(df_train: DataFrame) DataFrame[source]#

Filter out columns with names containing ‘status’ or ‘ProducaoPQ_Moagem’.

This function identifies and drops columns from the input dataframe if the column names contain the substring ‘status’ or ‘ProducaoPQ_Moagem’.

Parameters

df_train (pd.DataFrame) – Input dataframe from which columns need to be filtered out.

Returns

A pandas.Dataframe without the columns containing the substrings ‘status’ or ‘ProducaoPQ_Moagem’.

Return type

pd.DataFrame

wip.datatools.shap_ops.preprocess_df_train(df_train: DataFrame) DataFrame[source]#

Preprocess the dataframe by filtering and handling missing and infinite values.

This function filters out rows where “PROD_PQ_Y@08US” is non-positive, replaces infinite values with NaN, interpolates NaN values using linear interpolation, and then fills any remaining NaN values with 0.

Parameters

df_train (pd.DataFrame) – Input dataframe to be preprocessed.

Returns

Preprocessed dataframe with non-positive “PROD_PQ_Y@08US” rows removed, and missing and infinite values handled.

Return type

pd.DataFrame

wip.datatools.shap_ops.process_columns(dataset: DataFrame, train_shap_values: ndarray, scalers: Dict[str, MinMaxScaler], range_max: int, qualidade: str) DataFrame[source]#

Process and extract relevant columns based on certain conditions.

This function processes the input dataframe columns based on certain conditions and then extracts information about them, such as the actual and normalized values of certain metrics, whether the values are ascending, etc. It then returns a new dataframe with this extracted information.

Parameters
  • dataset (pd.DataFrame) – Input dataframe to process.

  • train_shap_values (np.ndarray) – SHAP values for each feature in the dataset.

  • scalers (Dict[str, sklearn.preprocessing.MinMaxScaler]) – Dictionary containing the MinMaxScaler for each column in the dataset.

  • range_max (int) – The maximum range value for the data.

  • qualidade (str) – String indicating the quality parameter to process. Possible values are "compressao", "relacao gran", "SE PR", "SE PP", and "umidade".

Returns

A pandas.Dataframe containing the extracted information with columns:

  • ”Range_max”

  • ”TAG”

  • ”Valor_Real”

  • ”Valor_Norm”

  • ”Ascending”

Return type

pd.DataFrame

wip.datatools.shap_ops.select_best_model(models_results: list, metric: str = 'MAPE') int[source]#

Select the best model based on the specified evaluation metric.

This function takes a list of dictionaries containing model results, constructs a DataFrame to collate the performance metrics, and then sorts the models based on the specified metric.

The function returns the index of the best model.

Parameters
  • models_results (list) –

    A list of dictionaries where each dictionary has keys:

    • "conf": Configuration or name of the model.

    • "model": Trained model object.

    • "metrics": A dictionary of performance metrics that includes: - "mse": Mean squared error. - "mape": Median absolute percentage error. - "r2": R-squared for the test set. - "r": R-squared for the train set. - "r2_train": R-squared for the train set. - "r2_train_adj": Adjusted R-squared

  • metric (str {"MSE", "MAPE", "R2", "R", "R2 Train", "R2 Train Adj"}, default "MAPE") – The evaluation metric based on which the models are to be ranked. Possible options are: "MSE", "R2", "R", "R2 Train", and "R2 Train Adj".

Returns

Index of the best model based on the specified metric.

Return type

int