mltrainer#
Module for training the ridge regression models.
Ths module defines the necessary functions to train the ridge regression models for each of the models from the pelletizing process.
Usage#
To train the ML models, run the following command:
$ python mltrainer.py
The above command will train the models and save them in the
outputs/us8/predictive_models
directory.
Developer Notes#
If you’re tasked with refactoring or making any changes to this module, please read these comments before starting your work.
Scalers#
One of the first aspects of this module you might be inclined to change
is the scalers being used from sklearn.preprocessing.MinMaxScaler
to a more advanced scaler like sklearn.preprocessing.RobustScaler
.
However, beware that making this change will be more complex than what it seems
at first.
These scalers are created column-wise, and are used later by the module
otm.py
and on several other occasions.
Making this change will require you to refactor several modules, adapting
them to perform the upscale and downscale operations without
either data_min_
or data_range_
attributes that only the
sklearn.preprocessing.MinMaxScaler
has.
Try counting the number of times that the scalers are
used inside the modules
subpackage, and the otm.py
module.
If you do so, you’ll probably realize the size of the task it’d be to
change normalization algorithms; that should in normal circumstances be
a simple change to make.
- wip.mltrainer.apply_model_new(model_name: str, df_train: DataFrame, df_target: Series, self_train: bool = True, **kwargs) List[Dict[str, Union[str, float, ndarray, Index]]] [source]#
Applies a machine learning model to the provided data.
This function applies a model to the data using a pipeline and performs grid search with cross-validation. The pipeline and parameter grid are hard-coded in the function. The function returns the best model, metrics, and other relevant information.
- Parameters
model_name (
str
) – Name of the process that represents the pelletizing stage (e.g., “abrasao”, “basicidade”, “compressao”, “finos”, “gas”).df_train (
pd.DataFrame
) – Training data as a pandas’ DataFrame.df_target (
pd.Series
) – Target values as a Pandas’ Series.self_train (
bool
, optional, defaultTrue
) – If True, fits the best estimator to the entire training data.**kwargs (
dict
) –Additional parameters can include:
"param_combination"
."param_validation"
."param_plotting"
.
- Returns
A list containing a dictionary with the following keys: -
"conf"
: Model configuration. -"metrics"
: A dictionary containing the metrics (mse, mape, r2, r,r2_train, r2_train_adj).
"model"
: The best estimator’s last step."grid"
: The fitted grid search object."columns"
: The columns of the input DataFrame."params"
: None (not used in the current implementation)."indexes"
: Indexes of the folds used during cross-validation."ys"
: True target values during cross-validation."yhats"
: Predicted target values during cross-validation."predicted"
: The predictions made by the best estimator on the training data (if selfTrain=True).
- Return type
- wip.mltrainer.apply_naive_model(y_values, test_set_date)[source]#
Compute metrics for a naive model using the mean of training data.
This function applies a naive model where predictions on the test set are based on the average of the training data. It then computes and returns the Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE), and the R-squared (R2) score of the predictions against the actual values.
- Parameters
y_values (
pandas.Series
) – Time series data where the index is of a datetime type.test_set_date (
datetime-like
) – The date used to split the data into training and test sets. Data after this date is considered as a test set and data before this date is considered as a training set.
- Returns
Notes
This naive model solely relies on the average of the training data for predictions on the test set. This means the model does not capture any temporal trends or seasonality present in the time series data.
Examples
>>> y = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('20200101', periods=5)) >>> _test_set_date = '2020-01-04' >>> apply_naive_model(y, _test_set_date) (9.0, 0.6, nan)
- wip.mltrainer.get_output_filepaths(outputs_folder: str | Path = None) Tuple[Path, ...] [source]#
Return a tuple of output file paths.
This function operates in two modes, depending on the value of
outputs_folder
: 1. Ifoutputs_folder
isNone
, the default file paths are used.The default file paths are found inside
wip.constants
.If
outputs_folder
is provided, new file paths in that folder are returned. Function creates theoutputs_folder
directory if it doesn’t exist.
- Parameters
outputs_folder (
str
orPath
, optional) – Path to the folder where the output files will be created. If not provided (default isNone
), the default file paths will be returned.- Returns
A tuple of
Path
objects representing the output file paths. These are the output file paths returned, in order:models_results
scalers
models_coefficients
models_features
datasets
- Return type
Tuple[Path
,]
- wip.mltrainer.get_prediction(datasets: dict) Tuple[Dict, Dict, Dict, DataFrame, Dict] [source]#
Get prediction results for the given dataset type.
Trains and evaluate a model on the given dataset type, and return the prediction results, scalers, column limits, concatenated NIVE data, and target data dictionary.
- Parameters
datasets (
Dict[str
,pd.DataFrame]
) – Dictionary containing the datasets.- Returns
results (
dict
) – Prediction results for each model group.scalers (
dict
) –sklearn.preprocessing.MinMaxScalers
for each column.limits (
dict
) – Limits for each column in thedatasets
dictionary.nive_concat (
pd.DataFrame
) – Pandas DataFrame, with concatenated data from columns that start with “NIVE”.df_target_dict (
dict
) – Dictionary containing target data for each model group.
- Return type
Notes
The script has some hard-coded column names and specific operations that mightn’t be suitable for other use cases.
The script also has some unused code commented out. It is recommended to remove or review the commented code before using this function in a production environment.
New in version 0.1.0: Added
datasets
to the function’s an input parameter. Before the function referred to thedatasets
dictionary indirectly, assuming it existed in the global variables.This makes its usage less transparent, and can cause problems when the function is used in a different context.
- wip.mltrainer.train_ml_models(datasets_filepath=None, df_sql_filepath=None, outputs_folder=None)[source]#
Train machine learning models and save results.
This function trains the machine learning models for each step of the pelletizing process, using datasets stored in the given joblib files or default file paths. The function applies preprocessing steps, filters the data, performs model training, and calculates metrics.
Finally, it saves the results as joblib files in the specified output folder or default folder.
- Parameters
datasets_filepath (
str | Path | None
) – Path to the file containing datasets for training. If not provided (default isNone
), the default dataset file path fromwip.constants.DATASETS_FILEPATH
is used.df_sql_filepath (
str
orPath
, optional) – Path to the file containing the original data used for generating thedatasets
dictionary. If not provided (default isNone
), the default data file path fromwip.constants.DF_SQL_FILEPATH
is used.outputs_folder (
str
orPath
, optional) – Path to the folder where model results and outputs will be saved. If not provided (default isNone
), the default output folder fromwip.constants
is used.
Notes
Preprocessing steps applied to clean the data and replace specific values.
Machine learning models trained using the Ridge regression algorithm.
Cross-validation performed with
cv_n
folds andcv_size
as the test-set size.Specific filters applied to the datasets based on model names.
New in version 0.1.0: Added the option to specify the inputs and outputs filepaths.
New in version 2.3.0: Added the
skip_transformations
parameter to allow for skipping the initial transformations and outliers’ removal processes, given that the filesdatasets_after_cleaning.joblib
anddf_sql_after_cleaning.joblib
exist, and that the code is being executed outside DataBricks.Changed in version 2.4.0:
Made changes to the function to allow for running the code inside DataBricks.
Changed in version 2.8.4:
Included all binary columns found inside the datasets to the parameter
skip_columns
passed to the functionauto_clean_datasets
.Removed training of the baseline regression models, since they’re not being used by any later processes.
Changed in version 2.10.0:
Move the data transformation steps to a new function called
clean_data
insidewip.datatools.ml_filters
. This function is now called at the end of “preprocessamento” workflow.Add timestamp column to
models_df
DataFrame, containing the KPIs of the trained models. New versions of themodels_df
DataFrame are now appended to the existing KPIs CSV file, to allow tracking of the model’s performances through time.