binarize returns the binary data coverted from data in normal (numeric and categorical) format.

binarize(data, n_bins = 4, thresh_infreq = 0.01,
  name_infreq = "-OTHER", one_hot = TRUE)

Arguments

data

A tibble or data.frame

n_bins

The number of bins to for converting continuous (numeric features) into discrete features (bins)

thresh_infreq

The threshold for converting categorical (character or factor features) into an "Other" Category.

name_infreq

The name for infrequently appearing categories to be lumped into. Set to "-OTHER" by default.

one_hot

If set to TRUE, binarization returns number of new columns = number of levels. If FALSE, binarization returns number of new columns = number of levels - 1 (dummy encoding).

Value

A tbl

Details

The Goal

The binned format helps correlation analysis to identify non-linear trends between a predictor (binned values) and a response (the target)What Binarize Does

The binarize() function takes data in a "normal" format and converts to a binary format that is useful as a preparation step before using correlate():

Numeric Features: The "Normal Data" format has numeric features that are continuous values in numeric format (double or integer). The binarize() function converts these to bins (categories) and then discretizes the bins using a one-hot encoding process.

Categorical Features: The "Normal Data" format has categorical features that are character or factor format. The binarize() function converts these to binary features using a one-hot encoding process.

Examples

library(dplyr)
#> #> Attaching package: ‘dplyr’
#> The following object is masked from ‘package:testthat’: #> #> matches
#> The following objects are masked from ‘package:stats’: #> #> filter, lag
#> The following objects are masked from ‘package:base’: #> #> intersect, setdiff, setequal, union
library(correlationfunnel) marketing_campaign_tbl %>% select(-ID) %>% binarize()
#> # A tibble: 45,211 x 74 #> `AGE__-Inf_33` AGE__33_39 AGE__39_48 AGE__48_Inf JOB__admin. `JOB__blue-coll… #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 0 0 0 1 0 0 #> 2 0 0 1 0 0 0 #> 3 1 0 0 0 0 0 #> 4 0 0 1 0 0 1 #> 5 1 0 0 0 0 0 #> 6 0 1 0 0 0 0 #> 7 1 0 0 0 0 0 #> 8 0 0 1 0 0 0 #> 9 0 0 0 1 0 0 #> 10 0 0 1 0 0 0 #> # … with 45,201 more rows, and 68 more variables: JOB__entrepreneur <dbl>, #> # JOB__housemaid <dbl>, JOB__management <dbl>, JOB__retired <dbl>, #> # `JOB__self-employed` <dbl>, JOB__services <dbl>, JOB__student <dbl>, #> # JOB__technician <dbl>, JOB__unemployed <dbl>, `JOB__-OTHER` <dbl>, #> # MARITAL__divorced <dbl>, MARITAL__married <dbl>, MARITAL__single <dbl>, #> # EDUCATION__primary <dbl>, EDUCATION__secondary <dbl>, #> # EDUCATION__tertiary <dbl>, EDUCATION__unknown <dbl>, DEFAULT__no <dbl>, #> # DEFAULT__yes <dbl>, `BALANCE__-Inf_72` <dbl>, BALANCE__72_448 <dbl>, #> # BALANCE__448_1428 <dbl>, BALANCE__1428_Inf <dbl>, HOUSING__no <dbl>, #> # HOUSING__yes <dbl>, LOAN__no <dbl>, LOAN__yes <dbl>, #> # CONTACT__cellular <dbl>, CONTACT__telephone <dbl>, CONTACT__unknown <dbl>, #> # `DAY__-Inf_8` <dbl>, DAY__8_16 <dbl>, DAY__16_21 <dbl>, DAY__21_Inf <dbl>, #> # MONTH__apr <dbl>, MONTH__aug <dbl>, MONTH__feb <dbl>, MONTH__jan <dbl>, #> # MONTH__jul <dbl>, MONTH__jun <dbl>, MONTH__mar <dbl>, MONTH__may <dbl>, #> # MONTH__nov <dbl>, MONTH__oct <dbl>, MONTH__sep <dbl>, #> # `MONTH__-OTHER` <dbl>, `DURATION__-Inf_103` <dbl>, DURATION__103_180 <dbl>, #> # DURATION__180_319 <dbl>, DURATION__319_Inf <dbl>, `CAMPAIGN__-Inf_2` <dbl>, #> # CAMPAIGN__2_3 <dbl>, CAMPAIGN__3_Inf <dbl>, `PDAYS__-1` <dbl>, #> # `PDAYS__-OTHER` <dbl>, PREVIOUS__0 <dbl>, PREVIOUS__1 <dbl>, #> # PREVIOUS__2 <dbl>, PREVIOUS__3 <dbl>, PREVIOUS__4 <dbl>, PREVIOUS__5 <dbl>, #> # `PREVIOUS__-OTHER` <dbl>, POUTCOME__failure <dbl>, POUTCOME__other <dbl>, #> # POUTCOME__success <dbl>, POUTCOME__unknown <dbl>, TERM_DEPOSIT__no <dbl>, #> # TERM_DEPOSIT__yes <dbl>