data_checks#
Module contains data quality check functions.
Data quality functions:
constant_tags_summary
: save a summary of tags with constant values to an Excel file.find_last_non_missing_columns
: for every column in a dictionary of dataframes, find the first non-null element of each column, in respect to the last existing index.
The other functions present in this module are intended to be used internally by the functions listed above.
- wip.datatools.data_checks.constant_tags_summary(df_sql: pd.DataFrame, datasets: Dict[str, pd.DataFrame], save_path: str | Path, max_constant_columns: int | None = 100, window: str = '30D')[source]#
Identify and save summary of tags with constant values to an Excel or CSV file.
This function finds the tags in the
df_sql
dataframe which have constant values over the specified rolling window. It then saves a summary to an Excel or CSV file, depending on whether the function is being executed inside Databricks or locally. If the number of constant tags exceeds the limit specified by the parameter :param:`max_constant_columns`, aValueError
exception is raised and the regression models training process gets interrupted.This function is intended to be used as a quality gate to the process. If there’s something wrong with the input data, it’s better to interrupt the execution and inform the error, rather than keep going and generate results that are completely wrong.
- Parameters
df_sql (
pd.DataFrame
) – Input dataframe which contains time series data.datasets (
Dict[str
,pd.DataFrame]
) – Dictionary of dataframes. The last column of each dataframe represents target columns to be checked for constancy indf_sql
.save_path (
str | Path
) – The filepath where the Excel file with the summary of constant tags will be saved.max_constant_columns (
int | None
, default100
) – The maximum number of constant columns allowed. If the number of constant columns exceeds this limit, aValueError
is raised. If set to None, this function only raises an exception if all columns indf_sql
have constant or missing values.window (
str
, default"30D"
) – Rolling window specification for which standard deviation is computed, default is “30D” (30 days).
- Raises
ValueError – If the number of constant tags exceeds the
max_constant_columns
limit.
Notes
This function uses the
get_constant_tags
function to determine the tags with constant values. The resulting summary contains the number of constant values, the percentage of constant values, the number of days the value remains constant, and a boolean flag indicating if the tag is a target.
- wip.datatools.data_checks.create_errors_dict(datasets: Dict[str, pd.DataFrame], df_sql: pd.DataFrame, window: str = '30D') Dict[str, List[int | str | int | bool]] [source]#
Generate a dictionary summary of tags with constant values.
This function evaluates tags in the
df_sql
which have constant values over the specified rolling window and creates a summary dictionary. Each key in the dictionary corresponds to a tag, and its associated value is a list containing the number of constant values, the percentage of constant values in relation to the total, the number of days the value remains constant, and a boolean indicating if the tag is a target based ondatasets
.- Parameters
datasets (
Dict[str
,pd.DataFrame]
) – Dictionary of dataframes. The last column of each dataframe represents target columns to be checked for constancy indf_sql
.df_sql (
pd.DataFrame
) – Input dataframe which contains time series data.window (
str
, default"30D"
) – Rolling window specification for which standard deviation is computed.
- Returns
Dictionary where keys are tags from
df_sql
and the values are lists containing: - Index of the tag’s last change. - Percentage position of the index within the dataframe. - Number of days since the last change. - Boolean indicating if the tag is in the target columns ofdatasets
.- Return type
Notes
The function leverages the
get_constant_tags
function to identify the tags with constant values.
- wip.datatools.data_checks.find_last_non_missing_column(tag_name: str, df_sql_raw: DataFrame) Tuple[datetime, datetime, Timedelta] [source]#
Find the last non-missing value in a specific column of a pandas DataFrame.
- Parameters
tag_name (
str
) – The name of the column to search for the last non-missing value.df_sql_raw (
pd.DataFrame
) – The pandas DataFrame to search within.
- Returns
Function returns the following information:
The last non-missing date in the specified column.
The maximum datetime value in the DataFrame index.
The difference between last_datetime and last_non_null_date.
- Return type
Tuple[datetime
,datetime
,pd.Timedelta]
Examples
>>> tag_name = "column_name" >>> df_sql_raw = pd.DataFrame({"column_name": [1, 2, None, 4, None, 6]}) >>> last_non_null_date, last_datetime, date_diff = find_last_non_missing_column( ... tag_name, df_sql_raw ... ) >>> print(last_non_null_date) # Output: 4 4 >>> print(last_datetime) # Output: 6 6 >>> print(date_diff) # Output: 2 days 2
- wip.datatools.data_checks.find_last_non_missing_columns(datasets: Dict[str, DataFrame], df_sql_raw: DataFrame) DataFrame [source]#
Find the last non-missing dates for each column from
datasets
dataframes.The function returns a new DataFrame that contains each tag name, last non-null date, and the difference between the last non-null date and the last datetime for each column.
- Parameters
datasets (
Dict[str
,pd.DataFrame]
) – A dictionary where the keys are dataset names and the values are pandas DataFrames. Each DataFrame represents a dataset and contains columns with tag names.df_sql_raw (
pd.DataFrame
) – A pandas DataFrame that represents the raw SQL data. It contains columns with tag names.
- Returns
A pandas DataFrame that contains the tag name, last non-null date, and date difference for each tag name in
df_sql_raw
that is also present in the columns of the DataFrames in datasets.- Return type
pd.DataFrame
- wip.datatools.data_checks.get_constant_tags(df_sql: DataFrame, datasets: Dict[str, DataFrame], window: str = '30D') List[str] [source]#
Return tags with constant values over a rolling window.
For the input
df_sql
, this function identifies columns/tags which have standard deviation of zero over the specified rolling window. It then returns those tags which are also present in the provided datasets.- Parameters
df_sql (
pd.DataFrame
) – Input dataframe which contains time series data.datasets (
Dict[str
,pd.DataFrame]
) – Dictionary of dataframes. Each dataframe contains columns which are potential tags to be checked for constancy indf_sql
.window (
str
, optional) – Rolling window specification for which standard deviation is computed, default is “30D” (30 days).
- Returns
List of tags that have constant values over the specified rolling window in
df_sql
and are present indatasets
.- Return type
List[str]