ml_train¶
BigQuery Model Training and Execution Module.
This module provides functionality for creating, executing, and managing ‘ARIMA_PLUS_XREG’ models using Google BigQuery. The module includes functions to generate SQL queries for creating models, executing these queries with retries, and evaluating model performance.
Functions¶
create_model_query
: Generate an SQL query to create an ARIMA_PLUS_XREG model for a specified item and its associated data.execute_query_with_retries
: Execute a given SQL query with retry logic in case of failure.create_models_for_items
: Create ‘ARIMA_PLUS_XREG’ models for a list of items by executing generated SQL queries.train_arima_models
: Train ARIMA models for specified columns, executes the corresponding SQL queries, and stores the model metrics in BigQuery tables.
Notes
This module is designed to work with Google BigQuery and requires a valid BigQuery client instance. The models generated by this module are intended for forecasting time series data, with options to handle holiday effects, step changes, and data cleaning.
See also
Google
, BigQuery
- iowa_forecast.ml_train.create_model_query(item_name: str, timestamp_col: str = 'date', time_series_data_col: str = 'total_amount_sold', model_name: str = 'bqmlforecast.arima_plus_xreg_model', train_table_name: str = 'bqmlforecast.training_data', test_table_name: str = 'bqmlforecast.test_data', **kwargs) str [source]¶
Generate a BigQuery ‘CREATE MODEL’ query for a specified item.
This function constructs an SQL query to create an ARIMA_PLUS_XREG model in BigQuery, tailored for the provided item and its associated data.
- Parameters:
item_name (
str
) – Name of the item for which the model is to be created.timestamp_col (
str
, default"date"
) – The column name representing the timestamp in the dataset.time_series_data_col (
str
, default"total_amount_sold"
) – The column name representing the time series data.model_name (
str
, default"bqmlforecast.arima_plus_xreg_model"
) – The base name for the model.train_table_name (
str
, default"bqmlforecast.training_data"
) – The name of the table containing training data.test_table_name (
str | None
, default"bqmlforecast.test_data"
) – The name of the table containing test data.**kwargs (
Any
) –Additional keyword arguments such as:
- holiday_regionstr, default=”US”
The holiday region to be used by the model.
- auto_arimabool, default=True
Whether to enable AUTO_ARIMA.
- adjust_step_changesbool, default=True
Whether to adjust for step changes in the data.
- clean_spikes_and_dipsbool, default=True
Whether to clean spikes and dips in the data.
- Returns:
str
– A SQL query string for creating the specified model.- Return type:
- iowa_forecast.ml_train.include_test_on_model_train(item_name: str, timestamp_col: str, train_table_name: str, test_table_name: str | None = None) str [source]¶
Include test data in the model training process.
This function generates an SQL query component to union test data with training data if a test table is specified.
- Parameters:
item_name (
str
) – The name of the item being modeled.timestamp_col (
str
) – The column name representing the timestamp in the dataset.train_table_name (
str
) – The name of the table containing training data.test_table_name (
str
orNone
, optional) – The name of the table containing test data. If None, no test data is included.
- Returns:
str
– An SQL query string component to include test data.- Return type:
- iowa_forecast.ml_train.include_test_on_arima_model_train(column: str, time_series_timestamp_col: str, time_series_id_col: str, train_table_name: str, test_table_name: str | None = None) str [source]¶
Include test data in the uni-variate ARIMA model training process.
This function generates an SQL query component to union test data with training data if a test table is specified.
- Parameters:
column (
str
) – The name of the feature being modeled.time_series_timestamp_col (
str
) – The column name representing the timestamp in the dataset.time_series_id_col (
str
) – The column name representing the identifier.train_table_name (
str
) – The name of the table containing training data.test_table_name (
str
orNone
, optional) – The name of the table containing test data. If None, no test data is included.
- Returns:
str
– An SQL query string component to include test data.- Return type:
- iowa_forecast.ml_train.execute_query_with_retries(client: Client, query: str, max_retries: int = 3) None [source]¶
Execute a BigQuery SQL query with retries on failure.
This function executes a given SQL query using a BigQuery client. If the query fails, it will automatically retry up to
max_retries
times, with an increasing delay between each attempt.- Parameters:
- Raises:
Exception – Raises an exception if all retry attempts fail.
- Return type:
None
Notes
The delay between retries increases linearly by 120 seconds multiplied by the current attempt number.
Examples
Execute a query with the default number of retries:
>>> client = bigquery.Client() >>> query = "SELECT * FROM `my_dataset.my_table`" >>> execute_query_with_retries(client, query)
- iowa_forecast.ml_train.create_models_for_items(client: bigquery.Client, items_list: List[str], max_items: int | None = None, timestamp_col: str = 'date', time_series_data_col: str = 'total_amount_sold', model_name: str = 'bqmlforecast.arima_plus_xreg_model', train_table_name: str = 'bqmlforecast.training_data', test_table_name: str | None = 'bqmlforecast.test_data', **kwargs) None [source]¶
Create
'ARIMA_PLUS_XREG'
models for a list of items.This function generates and executes a
'CREATE MODEL'
query for each item in the provided list. The models are created using the specified training and test tables in BigQuery.- Parameters:
client (
bigquery.Client
) – Instance of the BigQuery client used to execute queries.items_list (
List[str]
) – List of item names for which models are to be created.max_items (
int
orNone
, defaultNone
) – Maximum number of items to process. If None, all items are processed. See the ‘Notes’ section for more information.timestamp_col (
str
, default"date"
) – The column name representing the timestamp in the dataset.time_series_data_col (
str
, default"total_amount_sold"
) – The column name representing the time series data.model_name (
str
, default"bqmlforecast.arima_plus_xreg_model"
) – The base name for the models.train_table_name (
str
, default"bqmlforecast.training_data"
) – The name of the table containing training data.test_table_name (
str | None
, default"bqmlforecast.test_data"
) – The name of the table containing test data. IfNone
, then only the data fromtrain_table_name
is used for training the model. See the ‘Notes’ section for more information.**kwargs (
Any
) –Additional keyword arguments such as:
- holiday_regionstr, default=”US”
The holiday region to be used by the models.
- auto_arimabool, default=True
Whether to enable
'AUTO_ARIMA'
.- adjust_step_changesbool, default=True
Whether to adjust for step changes in the data.
- clean_spikes_and_dipsbool, default=True
Whether to clean spikes and dips in the data.
- Return type:
None
Notes
Not specifying a value for
max_items
requires you to use a Google Cloud account with billing enabled. If you’re not using a Google Cloud account with billing enabled, then you should limit the number of items to a value smaller than or equal to 4.Important
If using a Google Cloud account with billing enabled, running this code might incur charges.
If you are evaluating the model, you shouldn’t use all available data to train the model. Therefore, if you’re evaluating the model, consider setting the parameter
test_table_name
toNone
. Doing so will cause the model to be trained using only the specified data from thetrain_table_name
which in turn will allow you to use the data fromtest_table_name
for evaluation.
- iowa_forecast.ml_train.train_arima_models(client: bigquery.Client, columns: List[str], model: str = 'bqmlforecast.arima_model', train_table_name: str = 'bqmlforecast.training_data', test_table_name: str | None = 'bqmlforecast.test_data', model_metrics_table_name: str | None = 'bqmlforecast.arima_model_metrics', time_series_timestamp_col: str = 'date', time_series_id_col: str = 'item_name', confidence_level: float = 0.9, horizon: int = 7, use_test_data_on_train: bool = True, **kwargs)[source]¶
Train ARIMA models for a list of columns and store their metrics.
This function generates and executes
'CREATE MODEL'
queries for ARIMA models using the specified columns, and evaluates their performance by creating tables of model metrics.These ARIMA models will then be used to generate the future feature values used for forecasting the liquor sales.
- Parameters:
client (
bigquery.Client
) – Instance of the BigQuery client used to execute queries.columns (
List[str]
) – List of columns to be used for creating ARIMA models.model (
str
, default"bqmlforecast.arima_model"
) – The base name for the ARIMA models.train_table_name (
str
, default"bqmlforecast.training_data"
) – The name of the table containing training data.test_table_name (
str | None
, default"bqmlforecast.test_data"
) – The name of the table containing test data.model_metrics_table_name (
str
orNone
, default"bqmlforecast.arima_model_metrics"
) – The base name for the tables where model metrics will be stored.time_series_timestamp_col (
str
, default"date"
) – The column name representing the timestamp in the dataset.time_series_id_col (
str
, default"item_name"
) – The column name representing the identifier for the time series.confidence_level (
float
, default0.9
) – The confidence level used in the model evaluation.horizon (
int
, default7
) – The number of time steps (days) to forecast.use_test_data_on_train (
bool
, defaultTrue
) – Whether to use test data during model training.