mltrainer#

Module for training the ridge regression models.

Ths module defines the necessary functions to train the ridge regression models for each of the models from the pelletizing process.

Usage#

To train the ML models, run the following command:

$ python mltrainer.py

The above command will train the models and save them in the outputs/us8/predictive_models directory.

Developer Notes#

If you’re tasked with refactoring or making any changes to this module, please read these comments before starting your work.

Scalers#

One of the first aspects of this module you might be inclined to change is the scalers being used from sklearn.preprocessing.MinMaxScaler to a more advanced scaler like sklearn.preprocessing.RobustScaler. However, beware that making this change will be more complex than what it seems at first.

These scalers are created column-wise, and are used later by the module otm.py and on several other occasions. Making this change will require you to refactor several modules, adapting them to perform the upscale and downscale operations without either data_min_ or data_range_ attributes that only the sklearn.preprocessing.MinMaxScaler has.

Try counting the number of times that the scalers are used inside the modules subpackage, and the otm.py module. If you do so, you’ll probably realize the size of the task it’d be to change normalization algorithms; that should in normal circumstances be a simple change to make.

wip.mltrainer.apply_model_new(model_name: str, df_train: DataFrame, df_target: Series, self_train: bool = True, **kwargs) → List[Dict[str, Union[str, float, ndarray, Index]]][source]#

Applies a machine learning model to the provided data.

This function applies a model to the data using a pipeline and performs grid search with cross-validation. The pipeline and parameter grid are hard-coded in the function. The function returns the best model, metrics, and other relevant information.

Parameters

model_name (str) – Name of the process that represents the pelletizing stage (e.g., “abrasao”, “basicidade”, “compressao”, “finos”, “gas”).
df_train (pd.DataFrame) – Training data as a pandas’ DataFrame.
df_target (pd.Series) – Target values as a Pandas’ Series.
self_train (bool, optional, default True) – If True, fits the best estimator to the entire training data.
**kwargs (dict) –
Additional parameters can include:
- "param_combination".
- "param_validation".
- "param_plotting".

Returns

A list containing a dictionary with the following keys: - "conf": Model configuration. - "metrics": A dictionary containing the metrics (mse, mape, r2, r,

r2_train, r2_train_adj).

"model": The best estimator’s last step.
"grid": The fitted grid search object.
"columns": The columns of the input DataFrame.
"params": None (not used in the current implementation).
"indexes": Indexes of the folds used during cross-validation.
"ys": True target values during cross-validation.
"yhats": Predicted target values during cross-validation.
"predicted": The predictions made by the best estimator on the training data (if selfTrain=True).

Return type

list

wip.mltrainer.apply_naive_model(y_values, test_set_date)[source]#

Compute metrics for a naive model using the mean of training data.

This function applies a naive model where predictions on the test set are based on the average of the training data. It then computes and returns the Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE), and the R-squared (R2) score of the predictions against the actual values.

Parameters

y_values (pandas.Series) – Time series data where the index is of a datetime type.
test_set_date (datetime-like) – The date used to split the data into training and test sets. Data after this date is considered as a test set and data before this date is considered as a training set.

Returns

mse (float) – Mean Squared Error of the predictions.
mape (float) – The Mean Absolute Percentage Error of the predictions.
r2 (float) – R-squared score of the predictions.

Notes

This naive model solely relies on the average of the training data for predictions on the test set. This means the model does not capture any temporal trends or seasonality present in the time series data.

Examples

>>> y = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('20200101', periods=5))
>>> _test_set_date = '2020-01-04'
>>> apply_naive_model(y, _test_set_date)
(9.0, 0.6, nan)

wip.mltrainer.get_output_filepaths(outputs_folder: str | Path = None) → Tuple[Path, ...][source]#

Return a tuple of output file paths.

This function operates in two modes, depending on the value of outputs_folder: 1. If outputs_folder is None, the default file paths are used.

The default file paths are found inside wip.constants.

If outputs_folder is provided, new file paths in that folder are returned. Function creates the outputs_folder directory if it doesn’t exist.

Parameters

outputs_folder (str or Path, optional) – Path to the folder where the output files will be created. If not provided (default is None), the default file paths will be returned.

Returns

A tuple of Path objects representing the output file paths. These are the output file paths returned, in order:

models_results
scalers
models_coefficients
models_features
datasets

Return type

Tuple[Path, ]

wip.mltrainer.get_prediction(datasets: dict) → Tuple[Dict, Dict, Dict, DataFrame, Dict][source]#

Get prediction results for the given dataset type.

Trains and evaluate a model on the given dataset type, and return the prediction results, scalers, column limits, concatenated NIVE data, and target data dictionary.

Parameters

datasets (Dict[str, pd.DataFrame]) – Dictionary containing the datasets.

Returns

results (dict) – Prediction results for each model group.
scalers (dict) – sklearn.preprocessing.MinMaxScalers for each column.
limits (dict) – Limits for each column in the datasets dictionary.
nive_concat (pd.DataFrame) – Pandas DataFrame, with concatenated data from columns that start with “NIVE”.
df_target_dict (dict) – Dictionary containing target data for each model group.

Return type

Tuple[Dict, Dict, Dict, DataFrame, Dict]

Notes

The script has some hard-coded column names and specific operations that mightn’t be suitable for other use cases.

The script also has some unused code commented out. It is recommended to remove or review the commented code before using this function in a production environment.

New in version 0.1.0: Added datasets to the function’s an input parameter. Before the function referred to the datasets dictionary indirectly, assuming it existed in the global variables.

This makes its usage less transparent, and can cause problems when the function is used in a different context.

wip.mltrainer.train_ml_models(datasets_filepath=None, df_sql_filepath=None, outputs_folder=None)[source]#

Train machine learning models and save results.

This function trains the machine learning models for each step of the pelletizing process, using datasets stored in the given joblib files or default file paths. The function applies preprocessing steps, filters the data, performs model training, and calculates metrics.

Finally, it saves the results as joblib files in the specified output folder or default folder.

Parameters

datasets_filepath (str | Path | None) – Path to the file containing datasets for training. If not provided (default is None), the default dataset file path from wip.constants.DATASETS_FILEPATH is used.
df_sql_filepath (str or Path, optional) – Path to the file containing the original data used for generating the datasets dictionary. If not provided (default is None), the default data file path from wip.constants.DF_SQL_FILEPATH is used.
outputs_folder (str or Path, optional) – Path to the folder where model results and outputs will be saved. If not provided (default is None), the default output folder from wip.constants is used.

Notes

Preprocessing steps applied to clean the data and replace specific values.
Machine learning models trained using the Ridge regression algorithm.
Cross-validation performed with cv_n folds and cv_size as the test-set size.
Specific filters applied to the datasets based on model names.

New in version 0.1.0: Added the option to specify the inputs and outputs filepaths.

New in version 2.3.0: Added the skip_transformations parameter to allow for skipping the initial transformations and outliers’ removal processes, given that the files datasets_after_cleaning.joblib and df_sql_after_cleaning.joblib exist, and that the code is being executed outside DataBricks.

Changed in version 2.4.0:

Made changes to the function to allow for running the code inside DataBricks.

Changed in version 2.8.4:

Included all binary columns found inside the datasets to the parameter skip_columns passed to the function auto_clean_datasets.
Removed training of the baseline regression models, since they’re not being used by any later processes.

Changed in version 2.10.0:

Move the data transformation steps to a new function called clean_data inside wip.datatools.ml_filters. This function is now called at the end of “preprocessamento” workflow.
Add timestamp column to models_df DataFrame, containing the KPIs of the trained models. New versions of the models_df DataFrame are now appended to the existing KPIs CSV file, to allow tracking of the model’s performances through time.