The data parameter is the input data that you want to binarize. It can be either a pandas DataFrame or a DataFrameGroupBy object.
required
n_bins
int
The n_bins parameter specifies the number of bins to use when binarizing numeric data. It is used in the create_recipe function to determine the number of bins for each numeric column. pd.qcut() is used to bin the numeric data.
4
thresh_infreq
float
The thresh_infreq parameter is a float that represents the threshold for infrequent categories. Categories that have a frequency below this threshold will be grouped together and labeled with the name specified in the name_infreq parameter. By default, the threshold is set to 0.01.
0.01
name_infreq
str
The name_infreq parameter is used to specify the name that will be assigned to the category representing infrequent values in a column. This is applicable when performing binarization on non-numeric columns. By default, the name assigned is β-OTHERβ.
'-OTHER'
one_hot
bool
The one_hot parameter is a boolean flag that determines whether or not to perform one-hot encoding on the categorical variables after binarization. If one_hot is set to True, the categorical variables will be one-hot encoded, creating binary columns for each unique category.
True
Returns
Name
Type
Description
The function binarize returns the transformed data after applying various data preprocessing
steps such as converting non-numeric columns to categorical, replacing boolean columns with integers, fixing low cardinality numeric data, fixing high skew numeric data, and creating a recipe for binarization.
See Also
correlate() : Calculates the correlation between a target variable and all other variables in a pandas DataFrame.
Examples
# NON-TIMESERIES EXAMPLE ----import pandas as pdimport numpy as npimport pytimetk as tk# Set a random seed for reproducibilitynp.random.seed(0)# Define the number of rows for your DataFramenum_rows =200# Create fake data for the columnsdata = {'Age': np.random.randint(18, 65, size=num_rows),'Gender': np.random.choice(['Male', 'Female'], size=num_rows),'Marital_Status': np.random.choice(['Single', 'Married', 'Divorced'], size=num_rows),'City': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'], size=num_rows),'Years_Playing': np.random.randint(0, 30, size=num_rows),'Average_Income': np.random.randint(20000, 100000, size=num_rows),'Member_Status': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], size=num_rows),'Number_Children': np.random.randint(0, 5, size=num_rows),'Own_House_Flag': np.random.choice([True, False], size=num_rows),'Own_Car_Count': np.random.randint(0, 3, size=num_rows),'PersonId': range(1, num_rows +1), # Add a PersonId column as a row count'Client': np.random.choice(['A', 'B'], size=num_rows) # Add a Client column with random values 'A' or 'B'}# Create a DataFramedf = pd.DataFrame(data)# Binarize the datadf_binarized = df.binarize(n_bins=4, thresh_infreq=0.01, name_infreq="-OTHER", one_hot=True)df_binarized.glimpse()