data_checks#

Module contains data quality check functions.

Data quality functions:

  • constant_tags_summary: save a summary of tags with constant values to an Excel file.

  • find_last_non_missing_columns: for every column in a dictionary of dataframes, find the first non-null element of each column, in respect to the last existing index.

The other functions present in this module are intended to be used internally by the functions listed above.

wip.datatools.data_checks.constant_tags_summary(df_sql: pd.DataFrame, datasets: Dict[str, pd.DataFrame], save_path: str | Path, max_constant_columns: int | None = 100, window: str = '30D')[source]#

Identify and save summary of tags with constant values to an Excel or CSV file.

This function finds the tags in the df_sql dataframe which have constant values over the specified rolling window. It then saves a summary to an Excel or CSV file, depending on whether the function is being executed inside Databricks or locally. If the number of constant tags exceeds the limit specified by the parameter :param:`max_constant_columns`, a ValueError exception is raised and the regression models training process gets interrupted.

This function is intended to be used as a quality gate to the process. If there’s something wrong with the input data, it’s better to interrupt the execution and inform the error, rather than keep going and generate results that are completely wrong.

Parameters
  • df_sql (pd.DataFrame) – Input dataframe which contains time series data.

  • datasets (Dict[str, pd.DataFrame]) – Dictionary of dataframes. The last column of each dataframe represents target columns to be checked for constancy in df_sql.

  • save_path (str | Path) – The filepath where the Excel file with the summary of constant tags will be saved.

  • max_constant_columns (int | None, default 100) – The maximum number of constant columns allowed. If the number of constant columns exceeds this limit, a ValueError is raised. If set to None, this function only raises an exception if all columns in df_sql have constant or missing values.

  • window (str, default "30D") – Rolling window specification for which standard deviation is computed, default is “30D” (30 days).

Raises

ValueError – If the number of constant tags exceeds the max_constant_columns limit.

Notes

This function uses the get_constant_tags function to determine the tags with constant values. The resulting summary contains the number of constant values, the percentage of constant values, the number of days the value remains constant, and a boolean flag indicating if the tag is a target.

wip.datatools.data_checks.create_errors_dict(datasets: Dict[str, pd.DataFrame], df_sql: pd.DataFrame, window: str = '30D') Dict[str, List[int | str | int | bool]][source]#

Generate a dictionary summary of tags with constant values.

This function evaluates tags in the df_sql which have constant values over the specified rolling window and creates a summary dictionary. Each key in the dictionary corresponds to a tag, and its associated value is a list containing the number of constant values, the percentage of constant values in relation to the total, the number of days the value remains constant, and a boolean indicating if the tag is a target based on datasets.

Parameters
  • datasets (Dict[str, pd.DataFrame]) – Dictionary of dataframes. The last column of each dataframe represents target columns to be checked for constancy in df_sql.

  • df_sql (pd.DataFrame) – Input dataframe which contains time series data.

  • window (str, default "30D") – Rolling window specification for which standard deviation is computed.

Returns

Dictionary where keys are tags from df_sql and the values are lists containing: - Index of the tag’s last change. - Percentage position of the index within the dataframe. - Number of days since the last change. - Boolean indicating if the tag is in the target columns of datasets.

Return type

dict

Notes

The function leverages the get_constant_tags function to identify the tags with constant values.

wip.datatools.data_checks.find_last_non_missing_column(tag_name: str, df_sql_raw: DataFrame) Tuple[datetime, datetime, Timedelta][source]#

Find the last non-missing value in a specific column of a pandas DataFrame.

Parameters
  • tag_name (str) – The name of the column to search for the last non-missing value.

  • df_sql_raw (pd.DataFrame) – The pandas DataFrame to search within.

Returns

Function returns the following information:

  • The last non-missing date in the specified column.

  • The maximum datetime value in the DataFrame index.

  • The difference between last_datetime and last_non_null_date.

Return type

Tuple[datetime, datetime, pd.Timedelta]

Examples

>>> tag_name = "column_name"
>>> df_sql_raw = pd.DataFrame({"column_name": [1, 2, None, 4, None, 6]})
>>> last_non_null_date, last_datetime, date_diff = find_last_non_missing_column(
...     tag_name, df_sql_raw
... )
>>> print(last_non_null_date)  # Output: 4
4
>>> print(last_datetime)  # Output: 6
6
>>> print(date_diff)  # Output: 2 days
2
wip.datatools.data_checks.find_last_non_missing_columns(datasets: Dict[str, DataFrame], df_sql_raw: DataFrame) DataFrame[source]#

Find the last non-missing dates for each column from datasets dataframes.

The function returns a new DataFrame that contains each tag name, last non-null date, and the difference between the last non-null date and the last datetime for each column.

Parameters
  • datasets (Dict[str, pd.DataFrame]) – A dictionary where the keys are dataset names and the values are pandas DataFrames. Each DataFrame represents a dataset and contains columns with tag names.

  • df_sql_raw (pd.DataFrame) – A pandas DataFrame that represents the raw SQL data. It contains columns with tag names.

Returns

A pandas DataFrame that contains the tag name, last non-null date, and date difference for each tag name in df_sql_raw that is also present in the columns of the DataFrames in datasets.

Return type

pd.DataFrame

wip.datatools.data_checks.get_constant_tags(df_sql: DataFrame, datasets: Dict[str, DataFrame], window: str = '30D') List[str][source]#

Return tags with constant values over a rolling window.

For the input df_sql, this function identifies columns/tags which have standard deviation of zero over the specified rolling window. It then returns those tags which are also present in the provided datasets.

Parameters
  • df_sql (pd.DataFrame) – Input dataframe which contains time series data.

  • datasets (Dict[str, pd.DataFrame]) – Dictionary of dataframes. Each dataframe contains columns which are potential tags to be checked for constancy in df_sql.

  • window (str, optional) – Rolling window specification for which standard deviation is computed, default is “30D” (30 days).

Returns

List of tags that have constant values over the specified rolling window in df_sql and are present in datasets.

Return type

List[str]