Correlation Funnel

1 Here’s what you make in this Guide:

We will demonstrate how Correlation Funnel to analyze Expedia Hotel Bookings and which features correlate to a customer making a booking through their website:

Correlation Funnel

2 About Correlation Funnel

Correlationfunnel is designed for getting quick insights during the exploratory data analysis phase of data science projects. It’s not specific to time series, although it can help in time series analysis. We’ll see how we can use the correlation funnel workflow to identify the following quick business decisions:

  • We can begin to make business decisions like focusing less on packages and increasing advertising on continent ID 2 with trips to high-conversion search destinations.

  • These actions will help increase Return on Advertising Spend (ROAS) and Return on Investment (ROI). All from a simple Correlation Funnel Visualization.

3 Correlation Funnel Workflow in Python

Pytimetk uses the following methods for applying the Correlation Funnel:

  1. binarize(): Used to convert categorical, string and numeric data to binary 0/1 data.
  2. correlate(): Used to correlate all binary features to one of the “target” binary features.
  3. plot_correlation_funnel(): Used to visualize the correlations as a funnel, where the top features typically form a funnel shape.

3.1 Setup

To set up, import the following packages and the expedia_df dataset, Expedia Hotel Time Series Dataset.

# Libraries
import pandas as pd 
import pytimetk as tk

# Data
expedia_df = tk.load_dataset("expedia", parse_dates = ['date_time'])
expedia_df.glimpse()
<class 'pandas.core.frame.DataFrame'>: 100000 rows of 24 columns
date_time:                  datetime64[ns]    [Timestamp('2013-07-25 17: ...
site_name:                  int64             [2, 2, 2, 2, 2, 37, 2, 2,  ...
posa_continent:             int64             [3, 3, 3, 3, 3, 1, 3, 3, 3 ...
user_location_country:      int64             [66, 66, 66, 66, 66, 69, 6 ...
user_location_region:       int64             [174, 174, 174, 220, 351,  ...
user_location_city:         int64             [35675, 31320, 16292, 1760 ...
orig_destination_distance:  float64           [0.1203, 108.2251, 763.142 ...
user_id:                    int64             [44735, 794319, 761732, 69 ...
is_mobile:                  int64             [0, 0, 1, 0, 0, 0, 0, 0, 0 ...
is_package:                 int64             [0, 0, 0, 0, 0, 0, 0, 0, 0 ...
channel:                    int64             [9, 3, 1, 9, 1, 9, 9, 9, 9 ...
srch_ci:                    object            ['2013-07-26', '2014-11-27 ...
srch_co:                    object            ['2013-07-27', '2014-11-29 ...
srch_adults_cnt:            int64             [1, 2, 2, 2, 2, 2, 2, 2, 2 ...
srch_children_cnt:          int64             [0, 0, 0, 0, 0, 0, 0, 0, 0 ...
srch_rm_cnt:                int64             [1, 1, 1, 1, 1, 1, 1, 1, 1 ...
srch_destination_id:        int64             [5465, 11620, 23808, 40658 ...
srch_destination_type_id:   int64             [3, 1, 6, 5, 1, 6, 1, 5, 6 ...
is_booking:                 int64             [1, 0, 0, 0, 0, 0, 0, 0, 0 ...
cnt:                        int64             [1, 2, 3, 1, 2, 7, 1, 1, 1 ...
hotel_continent:            int64             [2, 2, 2, 2, 2, 6, 4, 2, 4 ...
hotel_country:              int64             [50, 50, 50, 50, 50, 204,  ...
hotel_market:               int64             [1230, 369, 1144, 930, 637 ...
hotel_cluster:              int64             [47, 83, 93, 48, 33, 15, 9 ...

3.2 Data Preparation

To prepare the dataset, we will first perform data preparation:

  1. Add time series features based on the date_time timestamp column.
  2. We will drop any zero variance features
  3. Drop additional columns that are not an acceptable data type (i.e. not numeric, categorical, or string) or contain missing values
  4. Convert numeric columns that start with “hotel_” that are actually categorical “ID” columns to string
expedia_ts_features_df = expedia_df \
    .augment_timeseries_signature('date_time') \
    .drop_zero_variance() \
    .drop(columns=['date_time', 'orig_destination_distance', 'srch_ci', 'srch_co']) \
    .transform_columns(
        columns = [r"hotel_.*"],
        transform_func = lambda x: x.astype(str)
    )
    
expedia_ts_features_df.glimpse()
<class 'pandas.core.frame.DataFrame'>: 100000 rows of 46 columns
site_name:                 int64             [2, 2, 2, 2, 2, 37, 2, 2, 2 ...
posa_continent:            int64             [3, 3, 3, 3, 3, 1, 3, 3, 3, ...
user_location_country:     int64             [66, 66, 66, 66, 66, 69, 66 ...
user_location_region:      int64             [174, 174, 174, 220, 351, 7 ...
user_location_city:        int64             [35675, 31320, 16292, 17605 ...
user_id:                   int64             [44735, 794319, 761732, 696 ...
is_mobile:                 int64             [0, 0, 1, 0, 0, 0, 0, 0, 0, ...
is_package:                int64             [0, 0, 0, 0, 0, 0, 0, 0, 0, ...
channel:                   int64             [9, 3, 1, 9, 1, 9, 9, 9, 9, ...
srch_adults_cnt:           int64             [1, 2, 2, 2, 2, 2, 2, 2, 2, ...
srch_children_cnt:         int64             [0, 0, 0, 0, 0, 0, 0, 0, 0, ...
srch_rm_cnt:               int64             [1, 1, 1, 1, 1, 1, 1, 1, 1, ...
srch_destination_id:       int64             [5465, 11620, 23808, 40658, ...
srch_destination_type_id:  int64             [3, 1, 6, 5, 1, 6, 1, 5, 6, ...
is_booking:                int64             [1, 0, 0, 0, 0, 0, 0, 0, 0, ...
cnt:                       int64             [1, 2, 3, 1, 2, 7, 1, 1, 1, ...
hotel_continent:           object            ['2', '2', '2', '2', '2', ' ...
hotel_country:             object            ['50', '50', '50', '50', '5 ...
hotel_market:              object            ['1230', '369', '1144', '93 ...
hotel_cluster:             object            ['47', '83', '93', '48', '3 ...
date_time_index_num:       int64             [1374773055, 1414939784, 14 ...
date_time_year:            int64             [2013, 2014, 2014, 2014, 20 ...
date_time_year_iso:        UInt32            [2013, 2014, 2014, 2014, 20 ...
date_time_yearstart:       uint8             [0, 0, 0, 0, 0, 0, 0, 0, 0, ...
date_time_yearend:         uint8             [0, 0, 0, 0, 0, 0, 0, 0, 0, ...
date_time_half:            int64             [2, 2, 1, 1, 2, 2, 1, 2, 1, ...
date_time_quarter:         int64             [3, 4, 2, 1, 3, 4, 1, 3, 2, ...
date_time_quarteryear:     object            ['2013Q3', '2014Q4', '2014Q ...
date_time_quarterstart:    uint8             [0, 0, 0, 0, 0, 0, 0, 0, 0, ...
date_time_quarterend:      uint8             [0, 0, 0, 0, 0, 0, 0, 0, 0, ...
date_time_month:           int64             [7, 11, 5, 2, 8, 12, 3, 9,  ...
date_time_month_lbl:       object            ['July', 'November', 'May', ...
date_time_monthstart:      uint8             [0, 0, 0, 0, 0, 0, 0, 0, 0, ...
date_time_monthend:        uint8             [0, 0, 0, 0, 0, 0, 0, 0, 0, ...
date_time_yweek:           UInt32            [30, 44, 21, 9, 33, 50, 12, ...
date_time_mweek:           int64             [4, 1, 4, 4, 2, 2, 3, 3, 2, ...
date_time_wday:            int64             [4, 7, 4, 3, 3, 2, 2, 1, 4, ...
date_time_wday_lbl:        object            ['Thursday', 'Sunday', 'Thu ...
date_time_mday:            int64             [25, 2, 22, 26, 13, 9, 18,  ...
date_time_qday:            int64             [25, 33, 52, 57, 44, 70, 77 ...
date_time_yday:            int64             [206, 306, 142, 57, 225, 34 ...
date_time_weekend:         int64             [0, 1, 0, 0, 0, 0, 0, 0, 0, ...
date_time_hour:            int64             [17, 14, 12, 14, 11, 7, 21, ...
date_time_minute:          int64             [24, 49, 50, 1, 15, 21, 40, ...
date_time_second:          int64             [15, 44, 53, 2, 40, 31, 29, ...
date_time_am_pm:           object            ['pm', 'pm', 'am', 'pm', 'a ...

3.3 3-Step Correlation Funnel Workflow

Next, we will perform the Correlation Funnel workflow to explore the Expedia Hotel Time Series dataset. There are 3 steps:

  1. Binarize: Convert the data to binary 0/1
  2. Correlate: Detect relationships between the binary features and one of the columns (called the target)
  3. Visualize the Correlation Funnel: Plotting allows us to assess the top features and their relationship to the target.

Step 1: Binarize

Use binarize() to convert the raw data to binary 0/1. Binarization happens as follows:

  1. Numeric Data: Numeric data is Quantile Binned using the pd.qcut() function. The default is 4 bins, which bins numeric data into a maximum of 4 discrete bins. Fewer bins can be returned if there is insufficient data for 4 bins. The number of bins is controlled with the n_bins parameter.

  2. Categorical / String Data: Categorical data is first processed to determine the most frequent categories. Categories that are sparse are lumped into an “OTHER” category. The lumping can be controlled with the thresh_infreq.

expedia_ts_binarized_df = expedia_ts_features_df.binarize(thresh_infreq = 0.05)

expedia_ts_binarized_df.glimpse()
<class 'pandas.core.frame.DataFrame'>: 100000 rows of 155 columns
site_name__2.0_15.0:                             uint8             [1, 1 ...
site_name__15.0_53.0:                            uint8             [0, 0 ...
user_location_country__0.0_66.0:                 uint8             [1, 1 ...
user_location_country__66.0_71.0:                uint8             [0, 0 ...
user_location_country__71.0_239.0:               uint8             [0, 0 ...
user_location_region__0.0_174.0:                 uint8             [1, 1 ...
user_location_region__174.0_314.0:               uint8             [0, 0 ...
user_location_region__314.0_385.0:               uint8             [0, 0 ...
user_location_region__385.0_1021.0:              uint8             [0, 0 ...
user_location_city__0.0_13087.0:                 uint8             [0, 0 ...
user_location_city__13087.0_27655.0:             uint8             [0, 0 ...
user_location_city__27655.0_42563.0:             uint8             [1, 1 ...
user_location_city__42563.0_56507.0:             uint8             [0, 0 ...
user_id__13.0_299759.8:                          uint8             [1, 0 ...
user_id__299759.8_605161.5:                      uint8             [0, 0 ...
user_id__605161.5_911811.5:                      uint8             [0, 1 ...
user_id__911811.5_1198780.0:                     uint8             [0, 0 ...
channel__0.0_2.0:                                uint8             [0, 0 ...
channel__2.0_9.0:                                uint8             [1, 1 ...
channel__9.0_10.0:                               uint8             [0, 0 ...
srch_adults_cnt__0.0_2.0:                        uint8             [1, 1 ...
srch_adults_cnt__2.0_9.0:                        uint8             [0, 0 ...
srch_children_cnt__0.0_9.0:                      uint8             [1, 1 ...
srch_rm_cnt__0.0_1.0:                            uint8             [1, 1 ...
srch_rm_cnt__1.0_8.0:                            uint8             [0, 0 ...
srch_destination_id__1.0_8267.0:                 uint8             [1, 0 ...
srch_destination_id__8267.0_9147.0:              uint8             [0, 0 ...
srch_destination_id__9147.0_18998.0:             uint8             [0, 1 ...
srch_destination_id__18998.0_65104.0:            uint8             [0, 0 ...
srch_destination_type_id__1.0_5.0:               uint8             [1, 1 ...
srch_destination_type_id__5.0_9.0:               uint8             [0, 0 ...
cnt__1.0_2.0:                                    uint8             [1, 1 ...
cnt__2.0_72.0:                                   uint8             [0, 0 ...
date_time_index_num__1357516842.0_1382867237.5:  uint8             [1, 0 ...
date_time_index_num__1382867237.5_1401387689.0:  uint8             [0, 0 ...
date_time_index_num__1401387689.0_1410981206.0:  uint8             [0, 0 ...
date_time_index_num__1410981206.0_1420070302.0:  uint8             [0, 1 ...
date_time_month__1.0_5.0:                        uint8             [0, 0 ...
date_time_month__5.0_7.0:                        uint8             [1, 0 ...
date_time_month__7.0_10.0:                       uint8             [0, 0 ...
date_time_month__10.0_12.0:                      uint8             [0, 1 ...
date_time_yweek__1.0_17.0:                       uint8             [0, 0 ...
date_time_yweek__17.0_30.0:                      uint8             [1, 0 ...
date_time_yweek__30.0_41.0:                      uint8             [0, 0 ...
date_time_yweek__41.0_52.0:                      uint8             [0, 1 ...
date_time_mday__1.0_8.0:                         uint8             [0, 1 ...
date_time_mday__8.0_16.0:                        uint8             [0, 0 ...
date_time_mday__16.0_23.0:                       uint8             [0, 0 ...
date_time_mday__23.0_31.0:                       uint8             [1, 0 ...
date_time_qday__1.0_24.0:                        uint8             [0, 0 ...
date_time_qday__24.0_48.0:                       uint8             [1, 1 ...
date_time_qday__48.0_70.0:                       uint8             [0, 0 ...
date_time_qday__70.0_92.0:                       uint8             [0, 0 ...
date_time_yday__1.0_121.0:                       uint8             [0, 0 ...
date_time_yday__121.0_209.0:                     uint8             [1, 0 ...
date_time_yday__209.0_286.0:                     uint8             [0, 0 ...
date_time_yday__286.0_365.0:                     uint8             [0, 1 ...
date_time_hour__0.0_10.0:                        uint8             [0, 0 ...
date_time_hour__10.0_14.0:                       uint8             [0, 1 ...
date_time_hour__14.0_18.0:                       uint8             [1, 0 ...
date_time_hour__18.0_23.0:                       uint8             [0, 0 ...
date_time_minute__0.0_15.0:                      uint8             [0, 0 ...
date_time_minute__15.0_30.0:                     uint8             [1, 0 ...
date_time_minute__30.0_45.0:                     uint8             [0, 0 ...
date_time_minute__45.0_59.0:                     uint8             [0, 1 ...
date_time_second__0.0_15.0:                      uint8             [1, 0 ...
date_time_second__15.0_30.0:                     uint8             [0, 0 ...
date_time_second__30.0_45.0:                     uint8             [0, 1 ...
date_time_second__45.0_59.0:                     uint8             [0, 0 ...
posa_continent__1:                               uint8             [0, 0 ...
posa_continent__2:                               uint8             [0, 0 ...
posa_continent__3:                               uint8             [1, 1 ...
posa_continent__-OTHER:                          uint8             [0, 0 ...
is_mobile__0:                                    uint8             [1, 1 ...
is_mobile__1:                                    uint8             [0, 0 ...
is_package__0:                                   uint8             [1, 1 ...
is_package__1:                                   uint8             [0, 0 ...
is_booking__0:                                   uint8             [0, 1 ...
is_booking__1:                                   uint8             [1, 0 ...
hotel_continent__-OTHER:                         uint8             [0, 0 ...
hotel_continent__2:                              uint8             [1, 1 ...
hotel_continent__3:                              uint8             [0, 0 ...
hotel_continent__4:                              uint8             [0, 0 ...
hotel_continent__6:                              uint8             [0, 0 ...
hotel_country__-OTHER:                           uint8             [0, 0 ...
hotel_country__50:                               uint8             [1, 1 ...
hotel_country__8:                                uint8             [0, 0 ...
hotel_market__-OTHER:                            uint8             [1, 1 ...
hotel_cluster__-OTHER:                           uint8             [1, 1 ...
date_time_year__2013:                            uint8             [1, 0 ...
date_time_year__2014:                            uint8             [0, 1 ...
date_time_year_iso__2013:                        uint8             [1, 0 ...
date_time_year_iso__2014:                        uint8             [0, 1 ...
date_time_year_iso__-OTHER:                      uint8             [0, 0 ...
date_time_yearstart__0:                          uint8             [1, 1 ...
date_time_yearstart__-OTHER:                     uint8             [0, 0 ...
date_time_yearend__0:                            uint8             [1, 1 ...
date_time_yearend__-OTHER:                       uint8             [0, 0 ...
date_time_half__1:                               uint8             [0, 0 ...
date_time_half__2:                               uint8             [1, 1 ...
date_time_quarter__1:                            uint8             [0, 0 ...
date_time_quarter__2:                            uint8             [0, 0 ...
date_time_quarter__3:                            uint8             [1, 0 ...
date_time_quarter__4:                            uint8             [0, 1 ...
date_time_quarteryear__2013Q1:                   uint8             [0, 0 ...
date_time_quarteryear__2013Q2:                   uint8             [0, 0 ...
date_time_quarteryear__2013Q3:                   uint8             [1, 0 ...
date_time_quarteryear__2013Q4:                   uint8             [0, 0 ...
date_time_quarteryear__2014Q1:                   uint8             [0, 0 ...
date_time_quarteryear__2014Q2:                   uint8             [0, 0 ...
date_time_quarteryear__2014Q3:                   uint8             [0, 0 ...
date_time_quarteryear__2014Q4:                   uint8             [0, 1 ...
date_time_quarterstart__0:                       uint8             [1, 1 ...
date_time_quarterstart__-OTHER:                  uint8             [0, 0 ...
date_time_quarterend__0:                         uint8             [1, 1 ...
date_time_quarterend__-OTHER:                    uint8             [0, 0 ...
date_time_month_lbl__April:                      uint8             [0, 0 ...
date_time_month_lbl__August:                     uint8             [0, 0 ...
date_time_month_lbl__December:                   uint8             [0, 0 ...
date_time_month_lbl__February:                   uint8             [0, 0 ...
date_time_month_lbl__January:                    uint8             [0, 0 ...
date_time_month_lbl__July:                       uint8             [1, 0 ...
date_time_month_lbl__June:                       uint8             [0, 0 ...
date_time_month_lbl__March:                      uint8             [0, 0 ...
date_time_month_lbl__May:                        uint8             [0, 0 ...
date_time_month_lbl__November:                   uint8             [0, 1 ...
date_time_month_lbl__October:                    uint8             [0, 0 ...
date_time_month_lbl__September:                  uint8             [0, 0 ...
date_time_monthstart__0:                         uint8             [1, 1 ...
date_time_monthstart__-OTHER:                    uint8             [0, 0 ...
date_time_monthend__0:                           uint8             [1, 1 ...
date_time_monthend__-OTHER:                      uint8             [0, 0 ...
date_time_mweek__1:                              uint8             [0, 1 ...
date_time_mweek__2:                              uint8             [0, 0 ...
date_time_mweek__3:                              uint8             [0, 0 ...
date_time_mweek__4:                              uint8             [1, 0 ...
date_time_mweek__5:                              uint8             [0, 0 ...
date_time_wday__1:                               uint8             [0, 0 ...
date_time_wday__2:                               uint8             [0, 0 ...
date_time_wday__3:                               uint8             [0, 0 ...
date_time_wday__4:                               uint8             [1, 0 ...
date_time_wday__5:                               uint8             [0, 0 ...
date_time_wday__6:                               uint8             [0, 0 ...
date_time_wday__7:                               uint8             [0, 1 ...
date_time_wday_lbl__Friday:                      uint8             [0, 0 ...
date_time_wday_lbl__Monday:                      uint8             [0, 0 ...
date_time_wday_lbl__Saturday:                    uint8             [0, 0 ...
date_time_wday_lbl__Sunday:                      uint8             [0, 1 ...
date_time_wday_lbl__Thursday:                    uint8             [1, 0 ...
date_time_wday_lbl__Tuesday:                     uint8             [0, 0 ...
date_time_wday_lbl__Wednesday:                   uint8             [0, 0 ...
date_time_weekend__0:                            uint8             [1, 0 ...
date_time_weekend__1:                            uint8             [0, 1 ...
date_time_am_pm__am:                             uint8             [0, 0 ...
date_time_am_pm__pm:                             uint8             [1, 1 ...

Step 2: Correlate the data

Next, we use correlate() to calculate strength of the relationship. The main parameter is target, which should be selected based on the business goal.

In this case, we can create a business goal to understand what relates to a website visit count greater than 2. We will select the column: is_booking__1 as the target. This is because we want to know what relates to a hotel room booking via the website search data.

This returns a 3 column data frame containing:

  1. feature: The name of the features
  2. bin: The bin that corresponds to a bin inside the features
  3. correlation: The strength of the relationship (0 to 1) and the direction of the relationship (+/-)
expedia_ts_correlate_df = expedia_ts_binarized_df.correlate('is_booking__1')

expedia_ts_correlate_df
feature bin correlation
77 is_booking 0 -1.000000
78 is_booking 1 1.000000
32 cnt 2.0_72.0 -0.099372
31 cnt 1.0_2.0 0.099372
75 is_package 0 0.075930
... ... ... ...
131 date_time_monthend -OTHER 0.000182
108 date_time_quarteryear 2014Q1 -0.000041
22 srch_children_cnt 0.0_9.0 NaN
87 hotel_market -OTHER NaN
88 hotel_cluster -OTHER NaN

155 rows × 3 columns

Step 3: Plot the Correlation funnel

It’s in this step where we can visualize review the correlations and determine which features relate to the target, the strength of the relationship (magnitude between 0 and 1), and the direction of the relationship (+/-).

expedia_ts_correlate_df.plot_correlation_funnel(
    engine = 'plotly',
    height = 800
)

4 Business insights from the Correlation Funnel

Within just a few lines of code, we can quickly assess the business insights. We see:

  1. Count of website visits seems counterintuitive: Lower counts are more likely to convert to a booking.

  2. Is package is also counterintuitive. Bundled hotel packages seem to convert less than non-bundled packages.

  3. Hotel continent ID = 2 has a higher conversion rate than other hotel contients

  4. Search destination and destination types appear to have higher booking conversions.

5 Business Value (Better Decisions)

From these insights we can begin to make business decisions like focusing less on packages and increasing advertising on continent ID 2 with trips to high-conversion search destinations. These actions should help increase Return on advertising spend (ROAS) and Return on Investment (ROI). All from a simple Correlation Funnel Visualization.

6 More Coming Soon…

We are in the early stages of development. But it’s obvious the potential for pytimetk now in Python. 🐍