binarize

binarize(data, n_bins=4, thresh_infreq=0.01, name_infreq='-OTHER', one_hot=True)

The binarize function prepares data for correlate, which is used for analyzing correlationfunnel plots.

Binarization does the following:

  1. Takes in a pandas DataFrame or DataFrameGroupBy object, converts non-numeric columns to categorical,
  2. Replaces boolean columns with integers,
  3. Checks for data type and missing values,
  4. fixes low cardinality numeric data,
  5. fixes high skew numeric data, and
  6. finally applies a transformation to create a new DataFrame with binarized data.

Parameters

Name Type Description Default
data Union[pd.DataFrame, pd.core.groupby.generic.DataFrameGroupBy] The data parameter is the input data that you want to binarize. It can be either a pandas DataFrame or a DataFrameGroupBy object. required
n_bins int The n_bins parameter specifies the number of bins to use when binarizing numeric data. It is used in the create_recipe function to determine the number of bins for each numeric column. pd.qcut() is used to bin the numeric data. 4
thresh_infreq float The thresh_infreq parameter is a float that represents the threshold for infrequent categories. Categories that have a frequency below this threshold will be grouped together and labeled with the name specified in the name_infreq parameter. By default, the threshold is set to 0.01. 0.01
name_infreq str The name_infreq parameter is used to specify the name that will be assigned to the category representing infrequent values in a column. This is applicable when performing binarization on non-numeric columns. By default, the name assigned is β€œ-OTHER”. '-OTHER'
one_hot bool The one_hot parameter is a boolean flag that determines whether or not to perform one-hot encoding on the categorical variables after binarization. If one_hot is set to True, the categorical variables will be one-hot encoded, creating binary columns for each unique category. True

Returns

Type Description
The function binarize returns the transformed data after applying various data preprocessing steps such as converting non-numeric columns to categorical, replacing boolean columns with integers, fixing low cardinality numeric data, fixing high skew numeric data, and creating a recipe for binarization.

See Also

  • correlate() : Calculates the correlation between a target variable and all other variables in a pandas DataFrame.

Examples

# NON-TIMESERIES EXAMPLE ----

import pandas as pd
import numpy as np
import pytimetk as tk

# Set a random seed for reproducibility
np.random.seed(0)

# Define the number of rows for your DataFrame
num_rows = 200

# Create fake data for the columns
data = {
    'Age': np.random.randint(18, 65, size=num_rows),
    'Gender': np.random.choice(['Male', 'Female'], size=num_rows),
    'Marital_Status': np.random.choice(['Single', 'Married', 'Divorced'], size=num_rows),
    'City': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'], size=num_rows),
    'Years_Playing': np.random.randint(0, 30, size=num_rows),
    'Average_Income': np.random.randint(20000, 100000, size=num_rows),
    'Member_Status': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], size=num_rows),
    'Number_Children': np.random.randint(0, 5, size=num_rows),
    'Own_House_Flag': np.random.choice([True, False], size=num_rows),
    'Own_Car_Count': np.random.randint(0, 3, size=num_rows),
    'PersonId': range(1, num_rows + 1),  # Add a PersonId column as a row count
    'Client': np.random.choice(['A', 'B'], size=num_rows)  # Add a Client column with random values 'A' or 'B'
}

# Create a DataFrame
df = pd.DataFrame(data)

# Binarize the data
df_binarized = df.binarize(n_bins=4, thresh_infreq=0.01, name_infreq="-OTHER", one_hot=True)

df_binarized.glimpse()    
<class 'pandas.core.frame.DataFrame'>: 200 rows of 42 columns
Age__18.0_29.0:                   uint8             [0, 1, 1, 1, 0, 1, 0 ...
Age__29.0_39.0:                   uint8             [0, 0, 0, 0, 0, 0, 1 ...
Age__39.0_53.0:                   uint8             [0, 0, 0, 0, 0, 0, 0 ...
Age__53.0_64.0:                   uint8             [1, 0, 0, 0, 1, 0, 0 ...
Years_Playing__0.0_7.0:           uint8             [0, 1, 0, 0, 0, 0, 0 ...
Years_Playing__7.0_15.0:          uint8             [0, 0, 1, 0, 1, 0, 1 ...
Years_Playing__15.0_22.0:         uint8             [1, 0, 0, 0, 0, 1, 0 ...
Years_Playing__22.0_29.0:         uint8             [0, 0, 0, 1, 0, 0, 0 ...
Average_Income__20131.0_40110.2:  uint8             [0, 0, 1, 0, 0, 0, 0 ...
Average_Income__40110.2_60649.5:  uint8             [0, 0, 0, 1, 1, 0, 1 ...
Average_Income__60649.5_79904.8:  uint8             [0, 1, 0, 0, 0, 0, 0 ...
Average_Income__79904.8_99131.0:  uint8             [1, 0, 0, 0, 0, 1, 0 ...
PersonId__1.0_50.8:               uint8             [1, 1, 1, 1, 1, 1, 1 ...
PersonId__50.8_100.5:             uint8             [0, 0, 0, 0, 0, 0, 0 ...
PersonId__100.5_150.2:            uint8             [0, 0, 0, 0, 0, 0, 0 ...
PersonId__150.2_200.0:            uint8             [0, 0, 0, 0, 0, 0, 0 ...
Gender__Female:                   uint8             [1, 0, 0, 0, 1, 0, 1 ...
Gender__Male:                     uint8             [0, 1, 1, 1, 0, 1, 0 ...
Marital_Status__Divorced:         uint8             [0, 0, 0, 0, 0, 0, 0 ...
Marital_Status__Married:          uint8             [1, 1, 0, 0, 1, 0, 0 ...
Marital_Status__Single:           uint8             [0, 0, 1, 1, 0, 1, 1 ...
City__Chicago:                    uint8             [0, 0, 1, 0, 0, 1, 0 ...
City__Houston:                    uint8             [0, 0, 0, 0, 0, 0, 1 ...
City__Los Angeles:                uint8             [0, 0, 0, 0, 0, 0, 0 ...
City__Miami:                      uint8             [0, 1, 0, 0, 0, 0, 0 ...
City__New York:                   uint8             [1, 0, 0, 1, 1, 0, 0 ...
Member_Status__Bronze:            uint8             [1, 0, 1, 0, 0, 0, 0 ...
Member_Status__Gold:              uint8             [0, 0, 0, 0, 0, 1, 1 ...
Member_Status__Platinum:          uint8             [0, 0, 0, 1, 0, 0, 0 ...
Member_Status__Silver:            uint8             [0, 1, 0, 0, 1, 0, 0 ...
Number_Children__0:               uint8             [0, 0, 1, 0, 0, 0, 0 ...
Number_Children__1:               uint8             [0, 0, 0, 0, 0, 0, 1 ...
Number_Children__2:               uint8             [0, 0, 0, 1, 0, 0, 0 ...
Number_Children__3:               uint8             [0, 1, 0, 0, 0, 1, 0 ...
Number_Children__4:               uint8             [1, 0, 0, 0, 1, 0, 0 ...
Own_House_Flag__0:                uint8             [1, 1, 0, 0, 1, 0, 1 ...
Own_House_Flag__1:                uint8             [0, 0, 1, 1, 0, 1, 0 ...
Own_Car_Count__0:                 uint8             [0, 1, 0, 0, 1, 0, 0 ...
Own_Car_Count__1:                 uint8             [0, 0, 0, 1, 0, 1, 1 ...
Own_Car_Count__2:                 uint8             [1, 0, 1, 0, 0, 0, 0 ...
Client__A:                        uint8             [1, 1, 1, 1, 1, 1, 1 ...
Client__B:                        uint8             [0, 0, 0, 0, 0, 0, 0 ...
df_correlated = df_binarized.correlate(target='Member_Status__Platinum')
df_correlated.head(10)
feature bin correlation
28 Member_Status Platinum 1.000000
26 Member_Status Bronze -0.341351
29 Member_Status Silver -0.332799
27 Member_Status Gold -0.298637
30 Number_Children 0 0.205230
8 Average_Income 20131.0_40110.2 -0.156593
0 Age 18.0_29.0 -0.135522
11 Average_Income 79904.8_99131.0 0.115743
33 Number_Children 3 -0.112216
7 Years_Playing 22.0_29.0 -0.106763
# Interactive
df_correlated.plot_correlation_funnel(
    engine='plotly', 
    height=600
)
# Static
df_correlated.plot_correlation_funnel(
    engine ='plotnine', 
    height = 900
)

<Figure Size: (700 x 900)>