Speed Up Exploratory Data Analysis (EDA) with correlationfunnel

The goal of correlationfunnel is to help data scientist’s speed up Exploratory Data Analysis (EDA). EDA can be an incredibly time consuming process.

Problem

Traditional approaches to EDA are labor intense where the data scientist reviews each of the features (predictors) in the data set for relationship to the target (i.e. goal or response). This process of manually building many visualizations and searching for relationships can take hours.

Solution

Correlation Analysis on data that has been preprocessed (more on this shortly) can drastically speed up EDA by identifying key features that relate to the target. The key is getting the features into the “right format”. This is where correlationfunnel helps.

The correlationfunnel package includes a streamlined 3-step process for preparing data and performing visual Correlation Analysis. The visualization produced uncovers insights by elevating high-correlation features and loweribng low-correlation features. The shape looks like a funnel (hence the name “Correlation Funnel”), making it very efficient to understand which features are most likely to provide business insights and lend well to a machine learning model.

Main Benefits

  1. Speeds Up Exploratory Data Analysis - You can drastically increase the speed at which you perform Exploratory Data Analysis (EDA) by using Correlation Analysis to focus on key features (rather than investigating all features).

  2. Improves Feature Selection - Using correlation to determine if you have good features prior to spending significant time developing Machine Learning Models.

  3. Gets You To Business Insights Faster - Understanding how features are related to a target variable can help you develop the story in the data (aka business insights).

Correlation Funnel Process

The Correlation Funnel process uses 3 functions:

  1. Transform the data into a binary format with binarize() - This step prepares semi-processed data for an optimal format (binary) for correlation analysis

  2. Perform correlation analysis using correlate() - This step correlates the “binarized” data (binary features) with the target

  3. Visualize the feature-target relationships using plot_correlation_funnel() - This step produces the visualization from which we can get business insights

Example - Customer Churn

We’ll step through an example of understanding what features are related to Customer Churn.

Load the necessary libraries.

library(correlationfunnel)
library(dplyr)

Get the customer_churn_tbl dataset. The dataset contains a number of features related to a telecommunications company’s customer-base and whether or not the customer has churned. The target is “Churn”.

data("customer_churn_tbl")

customer_churn_tbl %>% glimpse()
#> Observations: 7,043
#> Variables: 21
#> $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795…
#> $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Fe…
#> $ SeniorCitizen    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No"…
#> $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No"…
#> $ tenure           <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58,…
#> $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", …
#> $ MultipleLines    <chr> "No phone service", "No", "No", "No phone servi…
#> $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fib…
#> $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Y…
#> $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "N…
#> $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "N…
#> $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No"…
#> $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No…
#> $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No"…
#> $ Contract         <chr> "Month-to-month", "One year", "Month-to-month",…
#> $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", …
#> $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed che…
#> $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10…
#> $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50…
#> $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "N…

Step 1 - Prepare Data as Binary Features

We use the binarize() function to produce a feature set of binary (0/1) variables. Numeric data are binned (using n_bins) into categorical data, then all categorical data is one-hot encoded to produce binary features. To prevent low frequency categories (high cardinality categories) from increasing the dimensionality (width of the resulting data frame), we use thresh_infreq = 0.01 and name_infreq = "OTHER" to group excess categories.

customer_churn_binarized_tbl <- customer_churn_tbl %>%
  select(-customerID) %>%
  mutate(TotalCharges = ifelse(is.na(TotalCharges), MonthlyCharges, TotalCharges)) %>%
  binarize(n_bins = 5, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE)

customer_churn_binarized_tbl
#> # A tibble: 7,043 x 60
#>    gender__Female gender__Male SeniorCitizen__0 SeniorCitizen__1
#>             <dbl>        <dbl>            <dbl>            <dbl>
#>  1              1            0                1                0
#>  2              0            1                1                0
#>  3              0            1                1                0
#>  4              0            1                1                0
#>  5              1            0                1                0
#>  6              1            0                1                0
#>  7              0            1                1                0
#>  8              1            0                1                0
#>  9              1            0                1                0
#> 10              0            1                1                0
#> # … with 7,033 more rows, and 56 more variables: Partner__No <dbl>,
#> #   Partner__Yes <dbl>, Dependents__No <dbl>, Dependents__Yes <dbl>,
#> #   `tenure__-Inf_6` <dbl>, tenure__6_20 <dbl>, tenure__20_40 <dbl>,
#> #   tenure__40_60 <dbl>, tenure__60_Inf <dbl>, PhoneService__No <dbl>,
#> #   PhoneService__Yes <dbl>, MultipleLines__No <dbl>,
#> #   MultipleLines__No_phone_service <dbl>, MultipleLines__Yes <dbl>,
#> #   InternetService__DSL <dbl>, InternetService__Fiber_optic <dbl>,
#> #   InternetService__No <dbl>, OnlineSecurity__No <dbl>,
#> #   OnlineSecurity__No_internet_service <dbl>, OnlineSecurity__Yes <dbl>,
#> #   OnlineBackup__No <dbl>, OnlineBackup__No_internet_service <dbl>,
#> #   OnlineBackup__Yes <dbl>, DeviceProtection__No <dbl>,
#> #   DeviceProtection__No_internet_service <dbl>,
#> #   DeviceProtection__Yes <dbl>, TechSupport__No <dbl>,
#> #   TechSupport__No_internet_service <dbl>, TechSupport__Yes <dbl>,
#> #   StreamingTV__No <dbl>, StreamingTV__No_internet_service <dbl>,
#> #   StreamingTV__Yes <dbl>, StreamingMovies__No <dbl>,
#> #   StreamingMovies__No_internet_service <dbl>,
#> #   StreamingMovies__Yes <dbl>, `Contract__Month-to-month` <dbl>,
#> #   Contract__One_year <dbl>, Contract__Two_year <dbl>,
#> #   PaperlessBilling__No <dbl>, PaperlessBilling__Yes <dbl>,
#> #   `PaymentMethod__Bank_transfer_(automatic)` <dbl>,
#> #   `PaymentMethod__Credit_card_(automatic)` <dbl>,
#> #   PaymentMethod__Electronic_check <dbl>,
#> #   PaymentMethod__Mailed_check <dbl>, `MonthlyCharges__-Inf_25.05` <dbl>,
#> #   MonthlyCharges__25.05_58.83 <dbl>, MonthlyCharges__58.83_79.1 <dbl>,
#> #   MonthlyCharges__79.1_94.25 <dbl>, MonthlyCharges__94.25_Inf <dbl>,
#> #   `TotalCharges__-Inf_265.32` <dbl>, TotalCharges__265.32_939.78 <dbl>,
#> #   TotalCharges__939.78_2043.71 <dbl>,
#> #   TotalCharges__2043.71_4471.44 <dbl>, TotalCharges__4471.44_Inf <dbl>,
#> #   Churn__No <dbl>, Churn__Yes <dbl>

Step 3 - Plot the Correlation Funnel

Finally, we visualize the correlation using the plot_correlation_funnel() function.

Business Insights

We can see that the following features are correlated with Churn:

  • “Month to Month” Contract Type
  • No Online Security
  • No Tech Support
  • Customer tenure less than 6 months
  • Fiber Optic internet service
  • Pays with electronic check

We can also see that the following features are correlated with Staying (No Churn):

  • “Two Year” Contract Type
  • Customer Purchases Online Security
  • Customer Purchases Tech Support
  • Customer tenure greater than 60 months (5 years)
  • DSL internet service
  • Pays with automatic credit card

We can then develop a strategy to retain high risk customers:

  • Promotions for 2 Year Contract, Online Security, and Tech Support
  • Loyalty Bonuses to incentivize tenure
  • Incentives for setting up an automatic credit card payment

Conclusion

The correlationfunnel package provides a 3-step workflow that streamlines the EDA process, helps with feature selection, and improves the ease of obtaining Business Insights.

More Information

To learn about the inner-workings of and key considerations for use of correlationfunnel, please read the Key Considerations and FAQs.