outlier_removal#
The outlier_removal
module provides functions for removing outliers from data.
Functions:
identify_and_replace_outliers
: Identify and replace outliers in a DataFrame’s numeric columns.convert_columns_to_numeric
: Convert dataframe columns to numeric values.knn_impute
: Impute missing values in a DataFrame using KNN imputation.winsorize_outliers_except_last
: Winsorize outliers in all columns except the last column.cleaner
: Handle outliers and perform KNN imputation.
Notes
The main function of this module is the cleaner
function.
The other functions act as auxiliary functions that are used inside the
cleaner
function.
- wip.datatools.outlier_removal.cleaner(datasets: Dict[str, DataFrame], threshold: float = 1.5, threshold_winsorize: float = 0.05, threshold_remove: float = 1.5, n_neighbors: int = 30) Dict[str, DataFrame] [source]#
Handle outliers and perform KNN imputation.
- Parameters
datasets (
Dict[str
,pd.DataFrame]
) – Dictionary of dataset names to DataFrames.threshold (
float
, default1.5
) – IQR multiplier for outlier identification.threshold_winsorize (
float
, default0.05
) – Data fraction for winsorization at both tails.threshold_remove (
float
, default1.5
) – IQR multiplier to remove outliers from the last column.n_neighbors (
int
, default30
) – Number of neighbors for KNN imputation.
- Returns
Dictionary of cleaned DataFrames.
- Return type
Dict[str
,pd.DataFrame]
- wip.datatools.outlier_removal.convert_columns_to_numeric(df: DataFrame) DataFrame [source]#
Convert all columns in the DataFrame to numeric, coercing when necessary.
Non-convertible values are set to NaN, then all NaNs in a column are filled with 0. This ensures the DataFrame is suitable for numerical operations and algorithms that require numeric input.
- Parameters
df (
pd.DataFrame
) – The DataFrame to convert.- Returns
The DataFrame with all columns converted to numeric types.
- Return type
pd.DataFrame
- wip.datatools.outlier_removal.identify_and_replace_outliers(df: pd.DataFrame, columns: List[str] | None = None, exclude_columns: List[str] | None = None, threshold: float = 1.5, q1: float = 0.25, q2: float = 0.75) pd.DataFrame [source]#
Identify and replace outliers in a DataFrame’s numeric columns.
This function goes through each numeric column in a pandas DataFrame and replaces values that fall outside the interquartile range (IQR) defined threshold with the nearest value within the IQR threshold. The IQR is calculated for each column using specified quartiles, typically Q1 (25th percentile) and Q3 (75th percentile). Values below Q1 - (IQR * threshold) or above Q3 + (IQR * threshold) are considered outliers and are replaced.
- Parameters
df (
pd.DataFrame
) – The DataFrame containing the data to process.columns (
List[str] | None
, defaultNone
) – A list of column names to include in the outlier removal process.exclude_columns (
List[str] | None
, defaultNone
) – A list of column names to exclude from the outlier removal process.threshold (
float
, default1.5
) – The multiplier for IQR to define the cut-off beyond which values are considered outliers.q1 (
float
, default0.25
) – The lower quartile to calculate IQR. Default is 0.25 (25th percentile).q2 (
float
, default0.75
) – The upper quartile to calculate IQR. Default is 0.75 (75th percentile).
- Returns
DataFrame with outliers replaced by the nearest value within the acceptable range as defined by the IQR threshold.
- Return type
pd.DataFrame
- Raises
ValueError – If either
q1
orq2
is not a numeric value between 0 and 1. If the value ofq1
is greater than the value ofq2
.
Examples
>>> import pandas as pd >>> data = {'value': [1, 2, 3, 4, 5, 100]} >>> df = pd.DataFrame(data) >>> cleaned_df = identify_and_replace_outliers(df) >>> print(cleaned_df) value 0 1.0 1 2.0 2 3.0 3 4.0 4 5.0 5 8.5
In the above example, the last value from
df
had its value replaced from 100 to 8.5
- wip.datatools.outlier_removal.knn_impute(df: DataFrame, n_neighbors: int = 30) DataFrame [source]#
Impute missing values in a DataFrame using KNN imputation.
- Parameters
df (
pd.DataFrame
) – Thepandas.DataFrame
to impute.n_neighbors (
int
, default30
) – The number of neighboring samples to use for imputation.
- Returns
A DataFrame with missing values imputed.
- Return type
pd.DataFrame
- wip.datatools.outlier_removal.winsorize_outliers_except_last(df: DataFrame, threshold: float = 0.05) DataFrame [source]#
Winsorize outliers in all columns except the last, excluding ‘floticor’ and ‘status’.
- Parameters
df (
pd.DataFrame
) – Thepandas.DataFrame
to process.threshold (
float
, default0.05
) – A fraction of the data to winsorize on both tails.
- Returns
The
pandas.DataFrame
with the columns winsorized.- Return type
pd.DataFrame