correlate

correlate(data, target, method='pearson')

The correlate function calculates the correlation between a target variable and all other variables in a pandas DataFrame, and returns the results sorted by absolute correlation in descending order.

Parameters

Name Type Description Default
data Union[pd.DataFrame, pd.core.groupby.generic.DataFrameGroupBy] The data parameter is the input data that you want to calculate correlations for. It can be either a pandas DataFrame or a grouped DataFrame obtained from a groupby operation. required
target str The target parameter is a string that represents the column name in the DataFrame for which you want to calculate the correlation with other columns. required
method str The method parameter in the correlate function is used to specify the method for calculating the correlation coefficient. The available options for the method parameter are: * pearson : standard correlation coefficient * kendall : Kendall Tau correlation coefficient * spearman : Spearman rank correlation = 'pearson'

Returns

Type Description
The function correlate returns a DataFrame with two columns: ‘feature’ and ‘correlation’. The ‘feature’ column contains the names of the features in the input data, and the ‘correlation’ column contains the correlation coefficients between each feature and the target variable. The DataFrame is sorted in descending order based on the absolute correlation values.

See Also

  • binarize() : Prepares data for correlate, which is used for analyzing correlationfunnel plots.

Examples

# NON-TIMESERIES EXAMPLE ----

import pandas as pd
import numpy as np
import pytimetk as tk

# Set a random seed for reproducibility
np.random.seed(0)

# Define the number of rows for your DataFrame
num_rows = 200

# Create fake data for the columns
data = {
    'Age': np.random.randint(18, 65, size=num_rows),
    'Gender': np.random.choice(['Male', 'Female'], size=num_rows),
    'Marital_Status': np.random.choice(['Single', 'Married', 'Divorced'], size=num_rows),
    'City': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'], size=num_rows),
    'Years_Playing': np.random.randint(0, 30, size=num_rows),
    'Average_Income': np.random.randint(20000, 100000, size=num_rows),
    'Member_Status': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], size=num_rows),
    'Number_Children': np.random.randint(0, 5, size=num_rows),
    'Own_House_Flag': np.random.choice([True, False], size=num_rows),
    'Own_Car_Count': np.random.randint(0, 3, size=num_rows),
    'PersonId': range(1, num_rows + 1),  # Add a PersonId column as a row count
    'Client': np.random.choice(['A', 'B'], size=num_rows)  # Add a Client column with random values 'A' or 'B'
}

# Create a DataFrame
df = pd.DataFrame(data)

# Binarize the data
df_binarized = df.binarize(n_bins=4, thresh_infreq=0.01, name_infreq="-OTHER", one_hot=True)

df_binarized.glimpse()    
<class 'pandas.core.frame.DataFrame'>: 200 rows of 42 columns
Age__18.0_29.0:                   uint8             [0, 1, 1, 1, 0, 1, 0 ...
Age__29.0_39.0:                   uint8             [0, 0, 0, 0, 0, 0, 1 ...
Age__39.0_53.0:                   uint8             [0, 0, 0, 0, 0, 0, 0 ...
Age__53.0_64.0:                   uint8             [1, 0, 0, 0, 1, 0, 0 ...
Years_Playing__0.0_7.0:           uint8             [0, 1, 0, 0, 0, 0, 0 ...
Years_Playing__7.0_15.0:          uint8             [0, 0, 1, 0, 1, 0, 1 ...
Years_Playing__15.0_22.0:         uint8             [1, 0, 0, 0, 0, 1, 0 ...
Years_Playing__22.0_29.0:         uint8             [0, 0, 0, 1, 0, 0, 0 ...
Average_Income__20131.0_40110.2:  uint8             [0, 0, 1, 0, 0, 0, 0 ...
Average_Income__40110.2_60649.5:  uint8             [0, 0, 0, 1, 1, 0, 1 ...
Average_Income__60649.5_79904.8:  uint8             [0, 1, 0, 0, 0, 0, 0 ...
Average_Income__79904.8_99131.0:  uint8             [1, 0, 0, 0, 0, 1, 0 ...
PersonId__1.0_50.8:               uint8             [1, 1, 1, 1, 1, 1, 1 ...
PersonId__50.8_100.5:             uint8             [0, 0, 0, 0, 0, 0, 0 ...
PersonId__100.5_150.2:            uint8             [0, 0, 0, 0, 0, 0, 0 ...
PersonId__150.2_200.0:            uint8             [0, 0, 0, 0, 0, 0, 0 ...
Gender__Female:                   uint8             [1, 0, 0, 0, 1, 0, 1 ...
Gender__Male:                     uint8             [0, 1, 1, 1, 0, 1, 0 ...
Marital_Status__Divorced:         uint8             [0, 0, 0, 0, 0, 0, 0 ...
Marital_Status__Married:          uint8             [1, 1, 0, 0, 1, 0, 0 ...
Marital_Status__Single:           uint8             [0, 0, 1, 1, 0, 1, 1 ...
City__Chicago:                    uint8             [0, 0, 1, 0, 0, 1, 0 ...
City__Houston:                    uint8             [0, 0, 0, 0, 0, 0, 1 ...
City__Los Angeles:                uint8             [0, 0, 0, 0, 0, 0, 0 ...
City__Miami:                      uint8             [0, 1, 0, 0, 0, 0, 0 ...
City__New York:                   uint8             [1, 0, 0, 1, 1, 0, 0 ...
Member_Status__Bronze:            uint8             [1, 0, 1, 0, 0, 0, 0 ...
Member_Status__Gold:              uint8             [0, 0, 0, 0, 0, 1, 1 ...
Member_Status__Platinum:          uint8             [0, 0, 0, 1, 0, 0, 0 ...
Member_Status__Silver:            uint8             [0, 1, 0, 0, 1, 0, 0 ...
Number_Children__0:               uint8             [0, 0, 1, 0, 0, 0, 0 ...
Number_Children__1:               uint8             [0, 0, 0, 0, 0, 0, 1 ...
Number_Children__2:               uint8             [0, 0, 0, 1, 0, 0, 0 ...
Number_Children__3:               uint8             [0, 1, 0, 0, 0, 1, 0 ...
Number_Children__4:               uint8             [1, 0, 0, 0, 1, 0, 0 ...
Own_House_Flag__0:                uint8             [1, 1, 0, 0, 1, 0, 1 ...
Own_House_Flag__1:                uint8             [0, 0, 1, 1, 0, 1, 0 ...
Own_Car_Count__0:                 uint8             [0, 1, 0, 0, 1, 0, 0 ...
Own_Car_Count__1:                 uint8             [0, 0, 0, 1, 0, 1, 1 ...
Own_Car_Count__2:                 uint8             [1, 0, 1, 0, 0, 0, 0 ...
Client__A:                        uint8             [1, 1, 1, 1, 1, 1, 1 ...
Client__B:                        uint8             [0, 0, 0, 0, 0, 0, 0 ...
df_correlated = df_binarized.correlate(target='Member_Status__Platinum')
df_correlated
feature bin correlation
28 Member_Status Platinum 1.000000
26 Member_Status Bronze -0.341351
29 Member_Status Silver -0.332799
27 Member_Status Gold -0.298637
30 Number_Children 0 0.205230
8 Average_Income 20131.0_40110.2 -0.156593
0 Age 18.0_29.0 -0.135522
11 Average_Income 79904.8_99131.0 0.115743
33 Number_Children 3 -0.112216
7 Years_Playing 22.0_29.0 -0.106763
19 Marital_Status Married -0.104562
41 Client B 0.103842
40 Client A -0.103842
9 Average_Income 40110.2_60649.5 0.088509
12 PersonId 1.0_50.8 0.088509
38 Own_Car_Count 1 0.087769
22 City Houston 0.086124
13 PersonId 50.8_100.5 -0.074892
2 Age 39.0_53.0 0.074739
39 Own_Car_Count 2 -0.071738
31 Number_Children 1 -0.069054
25 City New York -0.055757
18 Marital_Status Divorced 0.055724
1 Age 29.0_39.0 0.054374
20 Marital_Status Single 0.050286
34 Number_Children 4 -0.047760
15 PersonId 150.2_200.0 -0.047659
10 Average_Income 60649.5_79904.8 -0.047659
5 Years_Playing 7.0_15.0 0.040717
14 PersonId 100.5_150.2 0.034042
6 Years_Playing 15.0_22.0 0.034042
21 City Chicago -0.032799
4 Years_Playing 0.0_7.0 0.028391
16 Gender Female 0.020215
17 Gender Male -0.020215
35 Own_House_Flag 0 0.017336
36 Own_House_Flag 1 -0.017336
37 Own_Car_Count 0 -0.016373
3 Age 53.0_64.0 0.012002
24 City Miami 0.010662
23 City Los Angeles -0.004911
32 Number_Children 2 0.002104
# Interactive
df_correlated.plot_correlation_funnel(
    engine='plotly', 
    height=400
)
# Static
fig = df_correlated.plot_correlation_funnel(
    engine ='plotnine', 
    height = 600
)
fig

<Figure Size: (700 x 600)>