correlate

correlate(data, target, method='pearson')

The correlate function calculates the correlation between a target variable and all other variables in a pandas DataFrame, and returns the results sorted by absolute correlation in descending order.

Parameters

Name	Type	Description	Default
data	Union[pd.DataFrame, pd.core.groupby.generic.DataFrameGroupBy]	The `data` parameter is the input data that you want to calculate correlations for. It can be either a pandas DataFrame or a grouped DataFrame obtained from a groupby operation.	required
target	str	The `target` parameter is a string that represents the column name in the DataFrame for which you want to calculate the correlation with other columns.	required
method	str	The `method` parameter in the `correlate` function is used to specify the method for calculating the correlation coefficient. The available options for the `method` parameter are: * pearson : standard correlation coefficient * kendall : Kendall Tau correlation coefficient * spearman : Spearman rank correlation	`= 'pearson'`

Returns

Name	Type	Description
	The function `correlate` returns a DataFrame with two columns: 'feature' and 'correlation'. The	‘feature’ column contains the names of the features in the input data, and the ‘correlation’ column contains the correlation coefficients between each feature and the target variable. The DataFrame is sorted in descending order based on the absolute correlation values.

Examples

# NON-TIMESERIES EXAMPLE ----

import pandas as pd
import numpy as np
import pytimetk as tk

# Set a random seed for reproducibility
np.random.seed(0)

# Define the number of rows for your DataFrame
num_rows = 200

# Create fake data for the columns
data = {
    'Age': np.random.randint(18, 65, size=num_rows),
    'Gender': np.random.choice(['Male', 'Female'], size=num_rows),
    'Marital_Status': np.random.choice(['Single', 'Married', 'Divorced'], size=num_rows),
    'City': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'], size=num_rows),
    'Years_Playing': np.random.randint(0, 30, size=num_rows),
    'Average_Income': np.random.randint(20000, 100000, size=num_rows),
    'Member_Status': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], size=num_rows),
    'Number_Children': np.random.randint(0, 5, size=num_rows),
    'Own_House_Flag': np.random.choice([True, False], size=num_rows),
    'Own_Car_Count': np.random.randint(0, 3, size=num_rows),
    'PersonId': range(1, num_rows + 1),  # Add a PersonId column as a row count
    'Client': np.random.choice(['A', 'B'], size=num_rows)  # Add a Client column with random values 'A' or 'B'
}

# Create a DataFrame
df = pd.DataFrame(data)

# Binarize the data
df_binarized = df.binarize(n_bins=4, thresh_infreq=0.01, name_infreq="-OTHER", one_hot=True)

df_binarized.glimpse()

<class 'pandas.core.frame.DataFrame'>: 200 rows of 42 columns
Age__18.0_29.0:                   int64             [0, 1, 1, 1, 0, 1, 0 ...
Age__29.0_39.0:                   int64             [0, 0, 0, 0, 0, 0, 1 ...
Age__39.0_53.0:                   int64             [0, 0, 0, 0, 0, 0, 0 ...
Age__53.0_64.0:                   int64             [1, 0, 0, 0, 1, 0, 0 ...
Years_Playing__0.0_7.0:           int64             [0, 1, 0, 0, 0, 0, 0 ...
Years_Playing__7.0_15.0:          int64             [0, 0, 1, 0, 1, 0, 1 ...
Years_Playing__15.0_22.0:         int64             [1, 0, 0, 0, 0, 1, 0 ...
Years_Playing__22.0_29.0:         int64             [0, 0, 0, 1, 0, 0, 0 ...
Average_Income__20131.0_40110.2:  int64             [0, 0, 1, 0, 0, 0, 0 ...
Average_Income__40110.2_60649.5:  int64             [0, 0, 0, 1, 1, 0, 1 ...
Average_Income__60649.5_79904.8:  int64             [0, 1, 0, 0, 0, 0, 0 ...
Average_Income__79904.8_99131.0:  int64             [1, 0, 0, 0, 0, 1, 0 ...
PersonId__1.0_50.8:               int64             [1, 1, 1, 1, 1, 1, 1 ...
PersonId__50.8_100.5:             int64             [0, 0, 0, 0, 0, 0, 0 ...
PersonId__100.5_150.2:            int64             [0, 0, 0, 0, 0, 0, 0 ...
PersonId__150.2_200.0:            int64             [0, 0, 0, 0, 0, 0, 0 ...
Gender__Female:                   int64             [1, 0, 0, 0, 1, 0, 1 ...
Gender__Male:                     int64             [0, 1, 1, 1, 0, 1, 0 ...
Marital_Status__Divorced:         int64             [0, 0, 0, 0, 0, 0, 0 ...
Marital_Status__Married:          int64             [1, 1, 0, 0, 1, 0, 0 ...
Marital_Status__Single:           int64             [0, 0, 1, 1, 0, 1, 1 ...
City__Chicago:                    int64             [0, 0, 1, 0, 0, 1, 0 ...
City__Houston:                    int64             [0, 0, 0, 0, 0, 0, 1 ...
City__Los Angeles:                int64             [0, 0, 0, 0, 0, 0, 0 ...
City__Miami:                      int64             [0, 1, 0, 0, 0, 0, 0 ...
City__New York:                   int64             [1, 0, 0, 1, 1, 0, 0 ...
Member_Status__Bronze:            int64             [1, 0, 1, 0, 0, 0, 0 ...
Member_Status__Gold:              int64             [0, 0, 0, 0, 0, 1, 1 ...
Member_Status__Platinum:          int64             [0, 0, 0, 1, 0, 0, 0 ...
Member_Status__Silver:            int64             [0, 1, 0, 0, 1, 0, 0 ...
Number_Children__0:               int64             [0, 0, 1, 0, 0, 0, 0 ...
Number_Children__1:               int64             [0, 0, 0, 0, 0, 0, 1 ...
Number_Children__2:               int64             [0, 0, 0, 1, 0, 0, 0 ...
Number_Children__3:               int64             [0, 1, 0, 0, 0, 1, 0 ...
Number_Children__4:               int64             [1, 0, 0, 0, 1, 0, 0 ...
Own_House_Flag__0:                int64             [1, 1, 0, 0, 1, 0, 1 ...
Own_House_Flag__1:                int64             [0, 0, 1, 1, 0, 1, 0 ...
Own_Car_Count__0:                 int64             [0, 1, 0, 0, 1, 0, 0 ...
Own_Car_Count__1:                 int64             [0, 0, 0, 1, 0, 1, 1 ...
Own_Car_Count__2:                 int64             [1, 0, 1, 0, 0, 0, 0 ...
Client__A:                        int64             [1, 1, 1, 1, 1, 1, 1 ...
Client__B:                        int64             [0, 0, 0, 0, 0, 0, 0 ...

df_correlated = df_binarized.correlate(target='Member_Status__Platinum')
df_correlated

	feature	bin	correlation
28	Member_Status	Platinum	1.000000
26	Member_Status	Bronze	-0.341351
29	Member_Status	Silver	-0.332799
27	Member_Status	Gold	-0.298637
30	Number_Children	0	0.205230
8	Average_Income	20131.0_40110.2	-0.156593
0	Age	18.0_29.0	-0.135522
11	Average_Income	79904.8_99131.0	0.115743
33	Number_Children	3	-0.112216
7	Years_Playing	22.0_29.0	-0.106763
19	Marital_Status	Married	-0.104562
41	Client	B	0.103842
40	Client	A	-0.103842
9	Average_Income	40110.2_60649.5	0.088509
12	PersonId	1.0_50.8	0.088509
38	Own_Car_Count	1	0.087769
22	City	Houston	0.086124
13	PersonId	50.8_100.5	-0.074892
2	Age	39.0_53.0	0.074739
39	Own_Car_Count	2	-0.071738
31	Number_Children	1	-0.069054
25	City	New York	-0.055757
18	Marital_Status	Divorced	0.055724
1	Age	29.0_39.0	0.054374
20	Marital_Status	Single	0.050286
34	Number_Children	4	-0.047760
10	Average_Income	60649.5_79904.8	-0.047659
15	PersonId	150.2_200.0	-0.047659
5	Years_Playing	7.0_15.0	0.040717
14	PersonId	100.5_150.2	0.034042
6	Years_Playing	15.0_22.0	0.034042
21	City	Chicago	-0.032799
4	Years_Playing	0.0_7.0	0.028391
16	Gender	Female	0.020215
17	Gender	Male	-0.020215
35	Own_House_Flag	0	0.017336
36	Own_House_Flag	1	-0.017336
37	Own_Car_Count	0	-0.016373
3	Age	53.0_64.0	0.012002
24	City	Miami	0.010662
23	City	Los Angeles	-0.004911
32	Number_Children	2	0.002104

# Interactive
df_correlated.plot_correlation_funnel(
    engine='plotly',
    height=400
)

# Static
fig = df_correlated.plot_correlation_funnel(
    engine ='plotnine',
    height = 600
)
fig

<Figure Size: (700 x 600)>

Parameters

Returns

See Also

Examples