load_data

BigQuery Data Loading and Feature Engineering Module.

This module provides functions to load, process, and prepare data for time series forecasting using Google BigQuery. The module includes utilities for creating training datasets, generating forecast features, and handling date offsets and item filters.

Functions

  • date_offset: generate a pandas DateOffset based on the given frequency and value.

  • get_item_names_filter: generate a “WHERE” clause component to filter values from column "item_name".

  • get_min_datapoints_filter: generate a “WHERE” clause to filter items that have at least min_size observations.

  • get_training_data: retrieve data from BigQuery and create a training data view.

  • get_year_weather_query: generate an SQL query to retrieve weather data for a specific year and state.

  • get_weather_query: generate an SQL query to retrieve weather data for a given date range.

  • create_forecast_features_query: generate an SQL query to create forecast features by joining forecast tables.

  • create_future_data: create future data features using training, test, and forecast data.

  • create_future_feature_table: create a forecast table for a specific model and save the forecast results.

  • create_future_feature_tables: create forecast tables for multiple columns and save the forecast results.

Notes

This module is designed to work with Google BigQuery and requires a valid BigQuery client instance. The functions in this module focus on preparing data for time series forecasting in various business contexts.

See also

Google, BigQuery

iowa_forecast.load_data.get_item_names_filter(items_list: List[str] | str) str[source]

Generate a "WHERE" clause component to filter values from column "item_name".

Parameters:

items_list (List[str] | str) – Item name or names to add to the "WHERE" clause component.

Returns:

str – The "WHERE" clause component that can be used to filter values from column "item_name".

Return type:

str

Examples

>>> print(get_item_names_filter("FIVE O'CLOCK VODKA"))
(item_name = "FIVE O'CLOCK VODKA")
>>> print(get_item_names_filter(['FIREBALL CINNAMON WHISKEY', 'BLACK VELVET']))
(item_name = "FIREBALL CINNAMON WHISKEY" OR item_name = "BLACK VELVET")
iowa_forecast.load_data.get_min_datapoints_filter(min_size: int) str[source]

Generate a "WHERE" clause to filter items that have at least min_size observations.

Parameters:

min_size (int) – Minimum number of observations to use as value for the "WHERE" clause.

Returns:

str – The "WHERE" clause component.

Return type:

str

iowa_forecast.load_data.get_training_data(client: bigquery.Client, table_name: str = 'bqmlforecast.training_data', start_date: str | None = None, end_date: str | None = None, offset: int | None = None, freq: str = 'years', min_datapoints_rate: float = 0.75, items_list: str | List[str] | None = None, base_table: str = 'bigquery-public-data.iowa_liquor_sales.sales') pd.DataFrame[source]

Retrieve data from BigQuery and create a training data view.

This function constructs an SQL query to create a view in BigQuery, filters sales data based on the specified date range and conditions, and retrieves the filtered data as a pandas DataFrame.

Parameters:
  • client (bigquery.Client) – BigQuery client object.

  • table_name (str, default 'bqmlforecast.training_data') – The name of the table to store the training data view.

  • start_date (str, optional) – The start date for filtering data in ‘YYYY-MM-DD’ format. If None, the start date is determined by one of the following ways:

  • end_date (str, optional) –

    The end date for filtering data in ‘YYYY-MM-DD’ format. If None, the end date is determined by one of the following ways:

    • If offset is not None: then end_date equals start_date + offset

    • If offset is None: then end_date is today’s date.

  • offset (int, optional) – The offset value to calculate the start or end date.

  • freq (str, default 'years') – The frequency type for the offset calculation.

  • min_datapoints_rate (float, default 0.75) – The fraction of days between end_date and start_date that each item from ‘item_name’ column should have of data points to be considered in the created table.

  • items_list (str | List[str], optional) – A list of item names or single item name used for filtering the data.

  • base_table (str, default "bigquery-public-data.iowa_liquor_sales.sales") – Base table to extract data from.

Returns:

pd.DataFrame – A pandas.DataFrame containing the filtered sales data.

Raises:

ValueError – If none of start_date, end_date, or offset are specified. If start_date > end_date.

Return type:

pd.DataFrame

iowa_forecast.load_data.get_year_weather_query(year: int, state: str = 'IA') str[source]

Generate an SQL query to retrieve weather data for a specific year and state.

Parameters:
  • year (int) – The year for which to retrieve weather data.

  • state (str, default "IA") – The state code for which to retrieve weather data.

Returns:

str – SQL query string to retrieve the weather data.

Return type:

str

iowa_forecast.load_data.get_weather_query(start_date: str, end_date: str, state: str = 'IA') str[source]

Generate an SQL query to retrieve weather data for a given date range.

Parameters:
  • start_date (str) – The start date for the weather data in ‘YYYY-MM-DD’ format.

  • end_date (str) – The end date for the weather data in ‘YYYY-MM-DD’ format.

  • state (str, default "IA") – The state code for which to retrieve weather data.

Returns:

str – SQL query string to retrieve the weather data.

Return type:

str

iowa_forecast.load_data.create_forecast_features_query(client: Client, dataset_id: str = 'bqmlforecast', forecast_tables_pattern: str = 'forecast_*') str[source]

Generate an SQL query to create forecast features by joining forecast tables.

Parameters:
  • client (bigquery.Client) – BigQuery client used to connect to the service.

  • dataset_id (str, optional) – ID of the dataset where the forecast tables are located, by default “bqmlforecast”.

  • forecast_tables_pattern (str, optional) – Pattern to match forecast tables, by default “forecast_*”.

Returns:

str – SQL query string to create forecast features.

Return type:

str

Examples

>>> client = bigquery.Client()
>>> query = create_forecast_features_query(client)
>>> print(query)
SELECT
    CAST(t1.forecast_timestamp AS DATE) as date,
    t1.item_name,
    0 AS total_amount_sold,
    t2.forecast_value AS temp,
    t3.forecast_value AS rainfall,
    t4.forecast_value AS snowfall,
    fw.rainfall,
    fw.snowfall,
    fw.temperature
FROM
    `forecast_temp` AS t1
INNER JOIN `forecast_rainfall` AS t2
    ON t1.forecast_timestamp = t2.forecast_timestamp
    AND t1.item_name = t2.item_name
INNER JOIN `forecast_snowfall` AS t3
    ON t2.forecast_timestamp = t3.forecast_timestamp
    AND t2.item_name = t3.item_name
LEFT JOIN future_weather_data AS fw
    ON date = fw.date

Notes

This function constructs an SQL query by joining multiple forecast tables and a weather data table to create a comprehensive forecast features dataset.

iowa_forecast.load_data.create_future_data(client: Client, train_table_name: str = 'bqmlforecast.training_data', test_table_name: str = 'bqmlforecast.test_data', forecast_table_name: str = 'bqmlforecast.forecast_data', horizon: int = 7, state: str = 'IA', dataset_id: str = 'bqmlforecast')[source]

Create future data features using training, test, and forecast data.

This function generates an SQL query to create a future dataset based on the training, test, and forecast data. It includes weather information and lag features.

Parameters:
  • client (bigquery.Client) – BigQuery client used to connect to the service.

  • train_table_name (str, default "bqmlforecast.training_data") – The name of the training data table.

  • test_table_name (str, default "bqmlforecast.test_data") – The name of the test data table.

  • forecast_table_name (str, default "bqmlforecast.forecast_data") – The name of the table to store the forecast data.

  • horizon (int, default 7) – The number of days into the future to create data for.

  • state (str, default "IA") – The state code for weather data retrieval.

  • dataset_id (str, default "bqmlforecast") – The dataset ID where the forecast tables are located.

iowa_forecast.load_data.create_future_feature_table(client: Client, table_name: str, model_name: str, confidence_level: float = 0.9, horizon: int = 7)[source]

Create a forecast table for a specific model and save the forecast results.

Parameters:
  • client (bigquery.Client) – BigQuery client used to connect to the service.

  • table_name (str) – The name of the table to save the forecast results.

  • model_name (str) – The name of the model used for forecasting.

  • confidence_level (float, default 0.9) – The confidence level for the forecast.

  • horizon (int, default 7) – The number of days into the future to forecast.

iowa_forecast.load_data.create_future_feature_tables(client: Client, columns: List[str], model: str = 'bqmlforecast.arima_model', table_base_name: str = 'bqmlforecast.forecast', confidence_level: float = 0.9, horizon: int = 7)[source]

Create forecast tables for multiple columns and save the forecast results.

Parameters:
  • client (bigquery.Client) – BigQuery client used to connect to the service.

  • columns (List[str]) – A list of column names to forecast.

  • model (str, default "bqmlforecast.arima_model") – The base name of the model used for forecasting.

  • table_base_name (str, default "bqmlforecast.forecast") – The base name for the tables to store the forecast results.

  • confidence_level (float, default 0.9) – The confidence level for the forecast.

  • horizon (int, default 7) – The number of days into the future to forecast.