load_data¶
BigQuery Data Loading and Feature Engineering Module.
This module provides functions to load, process, and prepare data for time series forecasting using Google BigQuery. The module includes utilities for creating training datasets, generating forecast features, and handling date offsets and item filters.
Functions¶
date_offset
: generate a pandas DateOffset based on the given frequency and value.get_item_names_filter
: generate a “WHERE” clause component to filter values from column"item_name"
.get_min_datapoints_filter
: generate a “WHERE” clause to filter items that have at leastmin_size
observations.get_training_data
: retrieve data from BigQuery and create a training data view.get_year_weather_query
: generate an SQL query to retrieve weather data for a specific year and state.get_weather_query
: generate an SQL query to retrieve weather data for a given date range.create_forecast_features_query
: generate an SQL query to create forecast features by joining forecast tables.create_future_data
: create future data features using training, test, and forecast data.create_future_feature_table
: create a forecast table for a specific model and save the forecast results.create_future_feature_tables
: create forecast tables for multiple columns and save the forecast results.
Notes
This module is designed to work with Google BigQuery and requires a valid BigQuery client instance. The functions in this module focus on preparing data for time series forecasting in various business contexts.
See also
Google
, BigQuery
- iowa_forecast.load_data.get_item_names_filter(items_list: List[str] | str) str [source]¶
Generate a
"WHERE"
clause component to filter values from column"item_name"
.- Parameters:
items_list (
List[str] | str
) – Item name or names to add to the"WHERE"
clause component.- Returns:
str
– The"WHERE"
clause component that can be used to filter values from column"item_name"
.- Return type:
Examples
>>> print(get_item_names_filter("FIVE O'CLOCK VODKA")) (item_name = "FIVE O'CLOCK VODKA") >>> print(get_item_names_filter(['FIREBALL CINNAMON WHISKEY', 'BLACK VELVET'])) (item_name = "FIREBALL CINNAMON WHISKEY" OR item_name = "BLACK VELVET")
- iowa_forecast.load_data.get_min_datapoints_filter(min_size: int) str [source]¶
Generate a
"WHERE"
clause to filter items that have at leastmin_size
observations.
- iowa_forecast.load_data.get_training_data(client: bigquery.Client, table_name: str = 'bqmlforecast.training_data', start_date: str | None = None, end_date: str | None = None, offset: int | None = None, freq: str = 'years', min_datapoints_rate: float = 0.75, items_list: str | List[str] | None = None, base_table: str = 'bigquery-public-data.iowa_liquor_sales.sales') pd.DataFrame [source]¶
Retrieve data from BigQuery and create a training data view.
This function constructs an SQL query to create a view in BigQuery, filters sales data based on the specified date range and conditions, and retrieves the filtered data as a pandas DataFrame.
- Parameters:
client (
bigquery.Client
) – BigQuery client object.table_name (
str
, default'bqmlforecast.training_data'
) – The name of the table to store the training data view.start_date (
str
, optional) – The start date for filtering data in ‘YYYY-MM-DD’ format. IfNone
, the start date is determined by one of the following ways:end_date (
str
, optional) –The end date for filtering data in ‘YYYY-MM-DD’ format. If
None
, the end date is determined by one of the following ways:If
offset
is notNone
: thenend_date
equalsstart_date
+offset
If
offset
isNone
: thenend_date
is today’s date.
offset (
int
, optional) – The offset value to calculate the start or end date.freq (
str
, default'years'
) – The frequency type for the offset calculation.min_datapoints_rate (
float
, default0.75
) – The fraction of days betweenend_date
andstart_date
that each item from ‘item_name’ column should have of data points to be considered in the created table.items_list (
str | List[str]
, optional) – A list of item names or single item name used for filtering the data.base_table (
str
, default"bigquery-public-data.iowa_liquor_sales.sales"
) – Base table to extract data from.
- Returns:
pd.DataFrame
– Apandas.DataFrame
containing the filtered sales data.- Raises:
ValueError – If none of
start_date
,end_date
, oroffset
are specified. Ifstart_date
>end_date
.- Return type:
pd.DataFrame
- iowa_forecast.load_data.get_year_weather_query(year: int, state: str = 'IA') str [source]¶
Generate an SQL query to retrieve weather data for a specific year and state.
- iowa_forecast.load_data.get_weather_query(start_date: str, end_date: str, state: str = 'IA') str [source]¶
Generate an SQL query to retrieve weather data for a given date range.
- Parameters:
- Returns:
str
– SQL query string to retrieve the weather data.- Return type:
- iowa_forecast.load_data.create_forecast_features_query(client: Client, dataset_id: str = 'bqmlforecast', forecast_tables_pattern: str = 'forecast_*') str [source]¶
Generate an SQL query to create forecast features by joining forecast tables.
- Parameters:
- Returns:
str
– SQL query string to create forecast features.- Return type:
Examples
>>> client = bigquery.Client() >>> query = create_forecast_features_query(client) >>> print(query) SELECT CAST(t1.forecast_timestamp AS DATE) as date, t1.item_name, 0 AS total_amount_sold, t2.forecast_value AS temp, t3.forecast_value AS rainfall, t4.forecast_value AS snowfall, fw.rainfall, fw.snowfall, fw.temperature FROM `forecast_temp` AS t1 INNER JOIN `forecast_rainfall` AS t2 ON t1.forecast_timestamp = t2.forecast_timestamp AND t1.item_name = t2.item_name INNER JOIN `forecast_snowfall` AS t3 ON t2.forecast_timestamp = t3.forecast_timestamp AND t2.item_name = t3.item_name LEFT JOIN future_weather_data AS fw ON date = fw.date
Notes
This function constructs an SQL query by joining multiple forecast tables and a weather data table to create a comprehensive forecast features dataset.
- iowa_forecast.load_data.create_future_data(client: Client, train_table_name: str = 'bqmlforecast.training_data', test_table_name: str = 'bqmlforecast.test_data', forecast_table_name: str = 'bqmlforecast.forecast_data', horizon: int = 7, state: str = 'IA', dataset_id: str = 'bqmlforecast')[source]¶
Create future data features using training, test, and forecast data.
This function generates an SQL query to create a future dataset based on the training, test, and forecast data. It includes weather information and lag features.
- Parameters:
client (
bigquery.Client
) – BigQuery client used to connect to the service.train_table_name (
str
, default"bqmlforecast.training_data"
) – The name of the training data table.test_table_name (
str
, default"bqmlforecast.test_data"
) – The name of the test data table.forecast_table_name (
str
, default"bqmlforecast.forecast_data"
) – The name of the table to store the forecast data.horizon (
int
, default7
) – The number of days into the future to create data for.state (
str
, default"IA"
) – The state code for weather data retrieval.dataset_id (
str
, default"bqmlforecast"
) – The dataset ID where the forecast tables are located.
- iowa_forecast.load_data.create_future_feature_table(client: Client, table_name: str, model_name: str, confidence_level: float = 0.9, horizon: int = 7)[source]¶
Create a forecast table for a specific model and save the forecast results.
- Parameters:
client (
bigquery.Client
) – BigQuery client used to connect to the service.table_name (
str
) – The name of the table to save the forecast results.model_name (
str
) – The name of the model used for forecasting.confidence_level (
float
, default0.9
) – The confidence level for the forecast.horizon (
int
, default7
) – The number of days into the future to forecast.
- iowa_forecast.load_data.create_future_feature_tables(client: Client, columns: List[str], model: str = 'bqmlforecast.arima_model', table_base_name: str = 'bqmlforecast.forecast', confidence_level: float = 0.9, horizon: int = 7)[source]¶
Create forecast tables for multiple columns and save the forecast results.
- Parameters:
client (
bigquery.Client
) – BigQuery client used to connect to the service.columns (
List[str]
) – A list of column names to forecast.model (
str
, default"bqmlforecast.arima_model"
) – The base name of the model used for forecasting.table_base_name (
str
, default"bqmlforecast.forecast"
) – The base name for the tables to store the forecast results.confidence_level (
float
, default0.9
) – The confidence level for the forecast.horizon (
int
, default7
) – The number of days into the future to forecast.