The binarize function prepares data for correlate, which is used for analyzing correlationfunnel plots.
Binarization does the following:
Takes in a pandas DataFrame or DataFrameGroupBy object, converts non-numeric columns to categorical,
Replaces boolean columns with integers,
Checks for data type and missing values,
fixes low cardinality numeric data,
fixes high skew numeric data, and
finally applies a transformation to create a new DataFrame with binarized data.
Parameters
Name
Type
Description
Default
data
DataFrame or GroupBy(pandas or polars)
The data parameter is the input data that you want to binarize. It can be either a pandas/polars DataFrame or a grouped object.
required
n_bins
int
The n_bins parameter specifies the number of bins to use when binarizing numeric data. It is used in the create_recipe function to determine the number of bins for each numeric column. pd.qcut() is used to bin the numeric data.
4
thresh_infreq
float
The thresh_infreq parameter is a float that represents the threshold for infrequent categories. Categories that have a frequency below this threshold will be grouped together and labeled with the name specified in the name_infreq parameter. By default, the threshold is set to 0.01.
0.01
name_infreq
str
The name_infreq parameter is used to specify the name that will be assigned to the category representing infrequent values in a column. This is applicable when performing binarization on non-numeric columns. By default, the name assigned is β-OTHERβ.
'-OTHER'
one_hot
bool
The one_hot parameter is a boolean flag that determines whether or not to perform one-hot encoding on the categorical variables after binarization. If one_hot is set to True, the categorical variables will be one-hot encoded, creating binary columns for each unique category.
True
engine
(pandas, polars, auto)
Execution engine. "pandas" (default) performs the computation using pandas. "polars" converts the result to a polars DataFrame on return. "auto" infers the engine from the input data.
"pandas"
Returns
Name
Type
Description
The function binarize returns the transformed data after applying various data preprocessing
steps such as converting non-numeric columns to categorical, replacing boolean columns with integers, fixing low cardinality numeric data, fixing high skew numeric data, and creating a recipe for binarization. The concrete DataFrame type matches the engine used to process the data.
See Also
correlate() : Calculates the correlation between a target variable and all other variables in a pandas DataFrame.
Examples
# NON-TIMESERIES EXAMPLE ----import pandas as pdimport numpy as npimport pytimetk as tk# Set a random seed for reproducibilitynp.random.seed(0)# Define the number of rows for your DataFramenum_rows =200# Create fake data for the columnsdata = {'Age': np.random.randint(18, 65, size=num_rows),'Gender': np.random.choice(['Male', 'Female'], size=num_rows),'Marital_Status': np.random.choice(['Single', 'Married', 'Divorced'], size=num_rows),'City': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'], size=num_rows),'Years_Playing': np.random.randint(0, 30, size=num_rows),'Average_Income': np.random.randint(20000, 100000, size=num_rows),'Member_Status': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], size=num_rows),'Number_Children': np.random.randint(0, 5, size=num_rows),'Own_House_Flag': np.random.choice([True, False], size=num_rows),'Own_Car_Count': np.random.randint(0, 3, size=num_rows),'PersonId': range(1, num_rows +1), # Add a PersonId column as a row count'Client': np.random.choice(['A', 'B'], size=num_rows) # Add a Client column with random values 'A' or 'B'}# Create a DataFramedf = pd.DataFrame(data)# Binarize the datadf_binarized = df.binarize(n_bins=4, thresh_infreq=0.01, name_infreq="-OTHER", one_hot=True)df_binarized.glimpse()