vignettes/TT-04-use-with-dplyr.Rmd
TT-04-use-with-dplyr.Rmd
tibbletime
on its own has useful functions for
manipulating time-based tibbles, but one of its most useful features is
how nicely it plays with dplyr
. Traditionally, performing
grouped analysis over a time period with dplyr
(like
quarterly / monthly summaries) is doable, but it could be easier and
typically requires use of the lubridate
package along with
the creation of multiple columns to group on. Below is an example of a
monthly summary from a daily dataset.
library(tibbletime)
library(dplyr)
library(lubridate)
series <- create_series('2013' ~ '2017', 'day', class = "Date") %>%
mutate(var = rnorm(n()))
series
## # A time tibble: 1,826 × 2
## # Index: date
## date var
## <date> <dbl>
## 1 2013-01-01 -2.27
## 2 2013-01-02 -0.242
## 3 2013-01-03 0.0600
## 4 2013-01-04 -0.633
## 5 2013-01-05 -1.06
## 6 2013-01-06 -0.776
## 7 2013-01-07 0.0703
## 8 2013-01-08 0.771
## 9 2013-01-09 -0.854
## 10 2013-01-10 -1.09
## # … with 1,816 more rows
series %>%
mutate(year = year(date), month = month(date)) %>%
group_by(year, month) %>%
summarise(mean_var = mean(var))
## # A tibble: 60 × 3
## year month mean_var
## <dbl> <dbl> <dbl>
## 1 2013 1 0.0399
## 2 2013 2 -0.202
## 3 2013 3 0.127
## 4 2013 4 0.188
## 5 2013 5 0.133
## 6 2013 6 0.263
## 7 2013 7 -0.0726
## 8 2013 8 0.126
## 9 2013 9 0.0959
## 10 2013 10 -0.0349
## # … with 50 more rows
This gets more difficult the more granular you go. Getting 5-minute summaries from minute or second data requires grouping on year, month, day, hour and minute columns.
With tibbletime
, rather than creating new columns to
group on, you manipulate your original date column into something that
corresponds to the period you are summarising at. The
tibbletime
way to do this is with
collapse_by()
.
## # A time tibble: 60 × 2
## # Index: date
## date mean_var
## <date> <dbl>
## 1 2013-01-31 0.0399
## 2 2013-02-28 -0.202
## 3 2013-03-31 0.127
## 4 2013-04-30 0.188
## 5 2013-05-31 0.133
## 6 2013-06-30 0.263
## 7 2013-07-31 -0.0726
## 8 2013-08-31 0.126
## 9 2013-09-30 0.0959
## 10 2013-10-31 -0.0349
## # … with 50 more rows
While collapse_by()
directly manipulates the index
column, the lower level collapse_index()
function can be
used inside of a call to mutate()
to modify the index
column and then save it as a new column. This can be useful if you don’t
want to lose the original index column.
This works for more granular series too. Below we aggregate 5 second level data up to hourly. This is working with a faily sizable ~19 million row data set.
second_series <- create_series('2013' ~ '2015', '5 second')
second_series %>%
mutate(var = rnorm(n())) %>%
collapse_by("hour") %>%
group_by(date) %>%
summarise(mean_var = mean(var))
## # A time tibble: 26,280 × 2
## # Index: date
## date mean_var
## <dttm> <dbl>
## 1 2013-01-01 00:59:55 -0.0132
## 2 2013-01-01 01:59:55 0.0442
## 3 2013-01-01 02:59:55 -0.0729
## 4 2013-01-01 03:59:55 0.0492
## 5 2013-01-01 04:59:55 -0.00102
## 6 2013-01-01 05:59:55 0.0994
## 7 2013-01-01 06:59:55 -0.00455
## 8 2013-01-01 07:59:55 0.0373
## 9 2013-01-01 08:59:55 0.0458
## 10 2013-01-01 09:59:55 -0.0191
## # … with 26,270 more rows
One really powerful benefit of working in the tidyverse
is being able to manipulate multiple series at once. Essentially we can
create multiple layers of groupings, one for the stocks we are working
with (like Facebook and Apple), and one for the period you want to
summarise your data at (daily, yearly, etc).
Below we use create_series()
to create two dummy hourly
price series, combine them, and calculate the OHLC (Open, High, Low,
Close) prices per day by first collapsing to daily with
collapse_by()
to have something to group on.
set.seed(123)
# Create price series of hourly movements for apple and facebook stock.
apple <- create_series('2014' ~ '2016', period = '1 hour') %>%
mutate(price = 100 + cumsum(rnorm(n(), mean = 0, sd = .5)))
facebook <- create_series('2014' ~ '2016', period = '1 hour') %>%
mutate(price = 150 + cumsum(rnorm(n(), mean = 0, sd = .5)))
# Bind them together and create a symbol column to group on
price_series <- bind_rows(list(apple = apple, facebook = facebook), .id = "symbol") %>%
as_tbl_time(date) %>%
group_by(symbol)
# Collapse to daily and transform to OHLC (Open, High, Low, Close), a
# common financial transformation
price_series %>%
collapse_by("day") %>%
group_by(symbol, date) %>%
summarise(
open = first(price),
high = max(price),
low = min(price),
close = last(price)
) %>%
slice(1:5)
## `summarise()` has grouped output by 'symbol'. You can override using the
## `.groups` argument.
## # A time tibble: 10 × 6
## # Index: date
## # Groups: symbol [2]
## symbol date open high low close
## <chr> <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 apple 2014-01-01 23:00:00 99.7 102. 99.6 99.9
## 2 apple 2014-01-02 23:00:00 99.6 102. 98.7 101.
## 3 apple 2014-01-03 23:00:00 101. 103. 101. 101.
## 4 apple 2014-01-04 23:00:00 102. 104. 101. 103.
## 5 apple 2014-01-05 23:00:00 104. 105. 101. 101.
## 6 facebook 2014-01-01 23:00:00 150. 152. 149. 149.
## 7 facebook 2014-01-02 23:00:00 149. 151. 148. 149.
## 8 facebook 2014-01-03 23:00:00 150. 150. 147. 148.
## 9 facebook 2014-01-04 23:00:00 148. 150. 144. 145.
## 10 facebook 2014-01-05 23:00:00 145. 146. 144. 146.