The correlate function calculates the correlation between a target variable and all other variables in a pandas DataFrame, and returns the results sorted by absolute correlation in descending order.
Parameters
Name
Type
Description
Default
data
DataFrame or GroupBy(pandas or polars)
The data parameter is the input data that you want to calculate correlations for. It can be either a pandas/polars DataFrame or a grouped DataFrame obtained from a groupby operation.
required
target
str
The target parameter is a string that represents the column name in the DataFrame for which you want to calculate the correlation with other columns.
required
method
str
The method parameter in the correlate function is used to specify the method for calculating the correlation coefficient. The available options for the method parameter are: * pearson : standard correlation coefficient * kendall : Kendall Tau correlation coefficient * spearman : Spearman rank correlation
= 'pearson'
engine
(pandas, polars, auto)
Execution engine. "pandas" (default) performs the computation using pandas. "polars" converts the result to a polars DataFrame on return. "auto" infers the engine from the input data.
"pandas"
Returns
Name
Type
Description
The function correlate returns a DataFrame with two columns: 'feature' and 'correlation'. The
โfeatureโ column contains the names of the features in the input data, and the โcorrelationโ column contains the correlation coefficients between each feature and the target variable. The DataFrame is sorted in descending order based on the absolute correlation values. The concrete type matches the engine used to process the data.
See Also
binarize() : Prepares data for correlate, which is used for analyzing correlationfunnel plots.
Examples
# NON-TIMESERIES EXAMPLE ----import pandas as pdimport numpy as npimport pytimetk as tk# Set a random seed for reproducibilitynp.random.seed(0)# Define the number of rows for your DataFramenum_rows =200# Create fake data for the columnsdata = {'Age': np.random.randint(18, 65, size=num_rows),'Gender': np.random.choice(['Male', 'Female'], size=num_rows),'Marital_Status': np.random.choice(['Single', 'Married', 'Divorced'], size=num_rows),'City': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'], size=num_rows),'Years_Playing': np.random.randint(0, 30, size=num_rows),'Average_Income': np.random.randint(20000, 100000, size=num_rows),'Member_Status': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], size=num_rows),'Number_Children': np.random.randint(0, 5, size=num_rows),'Own_House_Flag': np.random.choice([True, False], size=num_rows),'Own_Car_Count': np.random.randint(0, 3, size=num_rows),'PersonId': range(1, num_rows +1), # Add a PersonId column as a row count'Client': np.random.choice(['A', 'B'], size=num_rows) # Add a Client column with random values 'A' or 'B'}# Create a DataFramedf = pd.DataFrame(data)# Binarize the datadf_binarized = df.binarize(n_bins=4, thresh_infreq=0.01, name_infreq="-OTHER", one_hot=True)df_binarized.glimpse()