outlier_removal#

The outlier_removal module provides functions for removing outliers from data.

Functions:

identify_and_replace_outliers: Identify and replace outliers in a DataFrame’s numeric columns.
convert_columns_to_numeric: Convert dataframe columns to numeric values.
knn_impute: Impute missing values in a DataFrame using KNN imputation.
winsorize_outliers_except_last: Winsorize outliers in all columns except the last column.
cleaner: Handle outliers and perform KNN imputation.

Notes

The main function of this module is the cleaner function. The other functions act as auxiliary functions that are used inside the cleaner function.

wip.datatools.outlier_removal.cleaner(datasets: Dict[str, DataFrame], threshold: float = 1.5, threshold_winsorize: float = 0.05, threshold_remove: float = 1.5, n_neighbors: int = 30) → Dict[str, DataFrame][source]#

Handle outliers and perform KNN imputation.

Parameters

datasets (Dict[str, pd.DataFrame]) – Dictionary of dataset names to DataFrames.
threshold (float, default 1.5) – IQR multiplier for outlier identification.
threshold_winsorize (float, default 0.05) – Data fraction for winsorization at both tails.
threshold_remove (float, default 1.5) – IQR multiplier to remove outliers from the last column.
n_neighbors (int, default 30) – Number of neighbors for KNN imputation.

Returns

Dictionary of cleaned DataFrames.

Return type

Dict[str, pd.DataFrame]

wip.datatools.outlier_removal.convert_columns_to_numeric(df: DataFrame) → DataFrame[source]#

Convert all columns in the DataFrame to numeric, coercing when necessary.

Non-convertible values are set to NaN, then all NaNs in a column are filled with 0. This ensures the DataFrame is suitable for numerical operations and algorithms that require numeric input.

Parameters: df (pd.DataFrame) – The DataFrame to convert.
Returns: The DataFrame with all columns converted to numeric types.
Return type: pd.DataFrame

wip.datatools.outlier_removal.identify_and_replace_outliers(df: pd.DataFrame, columns: List[str] | None = None, exclude_columns: List[str] | None = None, threshold: float = 1.5, q1: float = 0.25, q2: float = 0.75) → pd.DataFrame[source]#

Identify and replace outliers in a DataFrame’s numeric columns.

This function goes through each numeric column in a pandas DataFrame and replaces values that fall outside the interquartile range (IQR) defined threshold with the nearest value within the IQR threshold. The IQR is calculated for each column using specified quartiles, typically Q1 (25th percentile) and Q3 (75th percentile). Values below Q1 - (IQR * threshold) or above Q3 + (IQR * threshold) are considered outliers and are replaced.

Parameters

df (pd.DataFrame) – The DataFrame containing the data to process.
columns (List[str] | None, default None) – A list of column names to include in the outlier removal process.
exclude_columns (List[str] | None, default None) – A list of column names to exclude from the outlier removal process.
threshold (float, default 1.5) – The multiplier for IQR to define the cut-off beyond which values are considered outliers.
q1 (float, default 0.25) – The lower quartile to calculate IQR. Default is 0.25 (25th percentile).
q2 (float, default 0.75) – The upper quartile to calculate IQR. Default is 0.75 (75th percentile).

Returns

DataFrame with outliers replaced by the nearest value within the acceptable range as defined by the IQR threshold.

Return type

pd.DataFrame

Raises

ValueError – If either q1 or q2 is not a numeric value between 0 and 1. If the value of q1 is greater than the value of q2.

Examples

>>> import pandas as pd
>>> data = {'value': [1, 2, 3, 4, 5, 100]}
>>> df = pd.DataFrame(data)
>>> cleaned_df = identify_and_replace_outliers(df)
>>> print(cleaned_df)
   value
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    8.5

In the above example, the last value from df had its value replaced from 100 to 8.5

wip.datatools.outlier_removal.knn_impute(df: DataFrame, n_neighbors: int = 30) → DataFrame[source]#

Impute missing values in a DataFrame using KNN imputation.

Parameters

df (pd.DataFrame) – The pandas.DataFrame to impute.
n_neighbors (int, default 30) – The number of neighboring samples to use for imputation.

Returns

A DataFrame with missing values imputed.

Return type

pd.DataFrame

wip.datatools.outlier_removal.winsorize_outliers_except_last(df: DataFrame, threshold: float = 0.05) → DataFrame[source]#

Winsorize outliers in all columns except the last, excluding ‘floticor’ and ‘status’.

Parameters

df (pd.DataFrame) – The pandas.DataFrame to process.
threshold (float, default 0.05) – A fraction of the data to winsorize on both tails.

Returns

The pandas.DataFrame with the columns winsorized.

Return type

pd.DataFrame