binarize

binarize(data, n_bins=4, thresh_infreq=0.01, name_infreq='-OTHER', one_hot=True)

The binarize function prepares data for correlate, which is used for analyzing correlationfunnel plots.

Binarization does the following:

Takes in a pandas DataFrame or DataFrameGroupBy object, converts non-numeric columns to categorical,
Replaces boolean columns with integers,
Checks for data type and missing values,
fixes low cardinality numeric data,
fixes high skew numeric data, and
finally applies a transformation to create a new DataFrame with binarized data.

Parameters

Name	Type	Description	Default
data	Union[pd.DataFrame, pd.core.groupby.generic.DataFrameGroupBy]	The `data` parameter is the input data that you want to binarize. It can be either a pandas DataFrame or a DataFrameGroupBy object.	required
n_bins	int	The `n_bins` parameter specifies the number of bins to use when binarizing numeric data. It is used in the `create_recipe` function to determine the number of bins for each numeric column. `pd.qcut()` is used to bin the numeric data.	`4`
thresh_infreq	float	The `thresh_infreq` parameter is a float that represents the threshold for infrequent categories. Categories that have a frequency below this threshold will be grouped together and labeled with the name specified in the `name_infreq` parameter. By default, the threshold is set to 0.01.	`0.01`
name_infreq	str	The `name_infreq` parameter is used to specify the name that will be assigned to the category representing infrequent values in a column. This is applicable when performing binarization on non-numeric columns. By default, the name assigned is “-OTHER”.	`'-OTHER'`
one_hot	bool	The `one_hot` parameter is a boolean flag that determines whether or not to perform one-hot encoding on the categorical variables after binarization. If `one_hot` is set to `True`, the categorical variables will be one-hot encoded, creating binary columns for each unique category.	`True`

Returns

Name	Type	Description
	The function `binarize` returns the transformed data after applying various data preprocessing	steps such as converting non-numeric columns to categorical, replacing boolean columns with integers, fixing low cardinality numeric data, fixing high skew numeric data, and creating a recipe for binarization.

Examples

# NON-TIMESERIES EXAMPLE ----

import pandas as pd
import numpy as np
import pytimetk as tk

# Set a random seed for reproducibility
np.random.seed(0)

# Define the number of rows for your DataFrame
num_rows = 200

# Create fake data for the columns
data = {
    'Age': np.random.randint(18, 65, size=num_rows),
    'Gender': np.random.choice(['Male', 'Female'], size=num_rows),
    'Marital_Status': np.random.choice(['Single', 'Married', 'Divorced'], size=num_rows),
    'City': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'], size=num_rows),
    'Years_Playing': np.random.randint(0, 30, size=num_rows),
    'Average_Income': np.random.randint(20000, 100000, size=num_rows),
    'Member_Status': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], size=num_rows),
    'Number_Children': np.random.randint(0, 5, size=num_rows),
    'Own_House_Flag': np.random.choice([True, False], size=num_rows),
    'Own_Car_Count': np.random.randint(0, 3, size=num_rows),
    'PersonId': range(1, num_rows + 1),  # Add a PersonId column as a row count
    'Client': np.random.choice(['A', 'B'], size=num_rows)  # Add a Client column with random values 'A' or 'B'
}

# Create a DataFrame
df = pd.DataFrame(data)

# Binarize the data
df_binarized = df.binarize(n_bins=4, thresh_infreq=0.01, name_infreq="-OTHER", one_hot=True)

df_binarized.glimpse()

<class 'pandas.core.frame.DataFrame'>: 200 rows of 42 columns
Age__18.0_29.0:                   int64             [0, 1, 1, 1, 0, 1, 0 ...
Age__29.0_39.0:                   int64             [0, 0, 0, 0, 0, 0, 1 ...
Age__39.0_53.0:                   int64             [0, 0, 0, 0, 0, 0, 0 ...
Age__53.0_64.0:                   int64             [1, 0, 0, 0, 1, 0, 0 ...
Years_Playing__0.0_7.0:           int64             [0, 1, 0, 0, 0, 0, 0 ...
Years_Playing__7.0_15.0:          int64             [0, 0, 1, 0, 1, 0, 1 ...
Years_Playing__15.0_22.0:         int64             [1, 0, 0, 0, 0, 1, 0 ...
Years_Playing__22.0_29.0:         int64             [0, 0, 0, 1, 0, 0, 0 ...
Average_Income__20131.0_40110.2:  int64             [0, 0, 1, 0, 0, 0, 0 ...
Average_Income__40110.2_60649.5:  int64             [0, 0, 0, 1, 1, 0, 1 ...
Average_Income__60649.5_79904.8:  int64             [0, 1, 0, 0, 0, 0, 0 ...
Average_Income__79904.8_99131.0:  int64             [1, 0, 0, 0, 0, 1, 0 ...
PersonId__1.0_50.8:               int64             [1, 1, 1, 1, 1, 1, 1 ...
PersonId__50.8_100.5:             int64             [0, 0, 0, 0, 0, 0, 0 ...
PersonId__100.5_150.2:            int64             [0, 0, 0, 0, 0, 0, 0 ...
PersonId__150.2_200.0:            int64             [0, 0, 0, 0, 0, 0, 0 ...
Gender__Female:                   int64             [1, 0, 0, 0, 1, 0, 1 ...
Gender__Male:                     int64             [0, 1, 1, 1, 0, 1, 0 ...
Marital_Status__Divorced:         int64             [0, 0, 0, 0, 0, 0, 0 ...
Marital_Status__Married:          int64             [1, 1, 0, 0, 1, 0, 0 ...
Marital_Status__Single:           int64             [0, 0, 1, 1, 0, 1, 1 ...
City__Chicago:                    int64             [0, 0, 1, 0, 0, 1, 0 ...
City__Houston:                    int64             [0, 0, 0, 0, 0, 0, 1 ...
City__Los Angeles:                int64             [0, 0, 0, 0, 0, 0, 0 ...
City__Miami:                      int64             [0, 1, 0, 0, 0, 0, 0 ...
City__New York:                   int64             [1, 0, 0, 1, 1, 0, 0 ...
Member_Status__Bronze:            int64             [1, 0, 1, 0, 0, 0, 0 ...
Member_Status__Gold:              int64             [0, 0, 0, 0, 0, 1, 1 ...
Member_Status__Platinum:          int64             [0, 0, 0, 1, 0, 0, 0 ...
Member_Status__Silver:            int64             [0, 1, 0, 0, 1, 0, 0 ...
Number_Children__0:               int64             [0, 0, 1, 0, 0, 0, 0 ...
Number_Children__1:               int64             [0, 0, 0, 0, 0, 0, 1 ...
Number_Children__2:               int64             [0, 0, 0, 1, 0, 0, 0 ...
Number_Children__3:               int64             [0, 1, 0, 0, 0, 1, 0 ...
Number_Children__4:               int64             [1, 0, 0, 0, 1, 0, 0 ...
Own_House_Flag__0:                int64             [1, 1, 0, 0, 1, 0, 1 ...
Own_House_Flag__1:                int64             [0, 0, 1, 1, 0, 1, 0 ...
Own_Car_Count__0:                 int64             [0, 1, 0, 0, 1, 0, 0 ...
Own_Car_Count__1:                 int64             [0, 0, 0, 1, 0, 1, 1 ...
Own_Car_Count__2:                 int64             [1, 0, 1, 0, 0, 0, 0 ...
Client__A:                        int64             [1, 1, 1, 1, 1, 1, 1 ...
Client__B:                        int64             [0, 0, 0, 0, 0, 0, 0 ...

df_correlated = df_binarized.correlate(target='Member_Status__Platinum')
df_correlated.head(10)

	feature	bin	correlation
28	Member_Status	Platinum	1.000000
26	Member_Status	Bronze	-0.341351
29	Member_Status	Silver	-0.332799
27	Member_Status	Gold	-0.298637
30	Number_Children	0	0.205230
8	Average_Income	20131.0_40110.2	-0.156593
0	Age	18.0_29.0	-0.135522
11	Average_Income	79904.8_99131.0	0.115743
33	Number_Children	3	-0.112216
7	Years_Playing	22.0_29.0	-0.106763

# Interactive
df_correlated.plot_correlation_funnel(
    engine='plotly',
    height=600
)

# Static
df_correlated.plot_correlation_funnel(
    engine ='plotnine',
    height = 900
)

<Figure Size: (700 x 900)>

Parameters

Returns

See Also

Examples