Packages for Getting Started with Time Series Analysis in R

A. Motivation

During the recent RStudio Conference, an attendee asked the panel about the lack of support provided by the tidyverse in relation to time series data. As someone who has spent the majority of their career on time series problems, this was somewhat surprising because R already has a great suite of tools for visualizing, manipulating, and modeling time series data. I can understand the desire for a ‘tidyverse approved’ tool for time series analysis, but it seemed like perhaps the issue was a lack of familiarity with the available toolage. Therefore, I wanted to put together a list of the packages and tools that I use most frequently in my work. For those unfamiliar with time series analysis, this could a good place to start investigating R’s current capabilities.

B. Background

Time series data refers to a sequence of measurements that are made over time at regular or irregular intervals with each observation being a single dimension. An example of low dimensional time series is daily wind temperature from 01/01/2001 through 12/31/2005. High dimensional time series is characterized by a larger number of observations, so an example could be the daily wind temperature from 01/01/1980 through 12/31/2010. In either case, the goal of the analysis could lead one to perform regression, clustering, forecasting, or even classification.

C. Data For Examples

To run the code in this post, you will need to access the following data through the unix terminal. It will download a csv file from the City of Chicago website that contains information on reported incidents of crime that occurred in the city of Chicago from 2001 to present.

$ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > chicago_crime_data.csv

Import the data into R and get the aggregate number of reported incidents of theft by day.

library(data.table)
dat = fread("chicago_crime_data.csv")
colnames(dat) = gsub(" ", "_", tolower(colnames(dat)))
dat[, date2 := as.Date(date, format="%m/%d/%Y")]
mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
mydat[1:6]

D. Data Representation

The first set of packages that one should be aware of is related to data storage. One could use data frames, tibbles, or data tables, but there are already a number of data structures that are optimized for representing time series data. The fundamental time series object is “ts”. However, the “ts” class has a number of limitations, and so it is usually best to work with the extensible time series (“xts”) object.

D1. xts

The xts package offers a number of great tools for data manipulation and aggregation. At it’s core is the xts object, which is essentially a matrix object that can represent time series data at different time increments. Xts is a subclass of the zoo object, and that provides it with a lot of functionality.

Here are some functions in xts that are worth investigating:

library(xts)
# create a xts object 
mydat2 = as.xts(mydat)
mydat2
plot.xts(mydat2)
# filter by date 
mydat2["2015"]  ## 2015
mydat2["201501"]  ## Jan 2015
mydat2["20150101/20150105"]  ## Jan 01 to Jan 05 2015 
# replace all valuues from Aug 25 onwards with 0
mydat2["20170825/"] <- 0
mydat2["20170821/"]
# get the last one month 
last(mydat2, "1 month")
# get stats by time frame
apply.monthly(mydat2, sum)
apply.monthly(mydat2, quantile)
period.apply(mydat2, endpoints(mydat2,on='months'), sum)
period.apply(mydat2, endpoints(mydat2,on='months'), quantile)

E. Dates

R has a maddening array of date and time classes. Be it yearmon, POSIXct, POSIXlt, chron, or something else, each has specific strengths and weaknesses. In general, I find myself using the lubridate package as it simplifies many of the complexities associated with date-times in R.

E1. lubridate

The lubridate package provides a lot of functionality for parsing and formatting dates, comparing different times, extracting the components of a date-time, and so forth.

library(lubridate)
ymd("2010-01-01")
mdy("01-01-2010")
ymd_h("2010-01-01 10")
ymd_hm("2010-01-01 10:02")
ymd_hms("2010-01-01 10:02:30")

F. Time Series Regression

Distributed lag models (error correction models) are a core component of doing time series analysis. They are many instances where we want to regress an outcome variable at the current time against values of various regressors at current and previous times. dynlm and ardl (wrapper for dynlm) are solid for this type of analysis. Another common task when working with distributed lag models involves using dynamic simulations to understand estimated outcomes in different scenarios. dynsim provides a coherent solution for simulation and visualization of those estimated values of the target variable.

F1. dynlm / ardl

Here is a brief example of how dynlm can be utilized. In what follows, I have created a new variable and lagged it by one day. So the model attempts to regress incidents or reported theft based on the weather from the previous day.

library(dynlm)
mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
mydat[, weather := sample(c(20:90), dim(mydat), replace=TRUE)]
mydat[, weather_lag := shift(weather, 1, type = 'lag')]
mod = dynlm(N ~ L(weather), data = mydat)
summary(mod)

F2. dynsim

Here is a brief example of how dynlm can be utilized. In what follows, I have created a new variable and lagged it by one day. I’ve used the dynsim to product two dynamic simulations and plotted them.

library(dynsim)
mydat3 = mydat[1:10000]
mod = lm(N ~ weather_lag, data = mydat3)
Scen1 <- data.frame(weather_lag = min(mydat2$weather_lag, na.rm=T))
Scen2 <- data.frame(weather_lag = max(mydat2$weather_lag, na.rm=T))
ScenComb <- list(Scen1, Scen2)
Sim1 <- dynsim(obj = mod, ldv = 'weather_lag', scen = ScenComb, n = 20)
dynsimGG(Sim1)

G. Forecasting

G1. forecast

The forecast package is the most used package in R for time series forecasting. It contains functions for performing decomposition and forecasting with exponential smoothing, arima, moving average models, and so forth. For aggregated data that is fairly low dimensional, one of the techniques present in this package should provide an adequate forecasting model given that the assumptions hold.

Here is a quick example of how to use the auto.arima function in R. In general, automatic forecasting tools should be used with caution, but it is a good place to explore time series data.

library(forecast)
mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
fit = auto.arima(mydat[,.(N)])
pred = forecast(fit, 200)
plot(pred)

G2. smooth

The smooth package provides functions to perform even more variations of exponential smoothing, moving average models, and various seasonal arima techniques. The smooth and forecast package are usually more than adequate for most forecasting problems that pertain to low dimensional data.

Here is a basic example that uses the automatic complex exponential smoothing function:

library(smooth)
mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
fit = auto.ces(mydat[,N])
pred = forecast(fit, 200)
plot(pred)

So for those of you getting introduced to the R programming language, these are a list extremely useful packages for time series analysis that you will want to get some exposure to.

Have questions, comments, interesting consulting projects, or work that needs done, feel free to contact me at mathewanalytics@gmail.com

Data.Table by Example – Part 3

For this final post, I will cover some advanced topics and discuss how to use data tables within user generated functions. Once again, let’s use the Chicago crime data.

dat = fread("rows.csv")
names(dat) <- gsub(" ", "_", names(dat))
dat[, c("value1", "value2", "value3") := sample(1:50, nrow(dat), replace=TRUE)]
dat[1:3]

Let’s start by subseting the data. The following code takes the first 50000 rows within the dat dataset, selects four columns, creates three new columns pertaining to the data, and then removes the original date column. The output was saved as to new variable and the user can see the first few columns of the new data table using brackets or head function.

ddat = dat[1:50000, .(Date, value1, value2, value3)][,
               c("year", "month", "day") :=
                        .(year(mdy_hms(Date)),
                          month(mdy_hms(Date)),
                          day(mdy_hms(Date)))][,-c("Date")]

ddat[1:3]  # same as head(ddat, 3)

We can now do some intermediate calculations and suppress their output by using braces.

unique(ddat$month)
ddat[, { avg_val1 = mean(value1)
         new_val1 = mean(abs(value2-avg_val1))
         new_val2 = new_val1^2 }, by=month][order(month)]

In this simple sample case, we have taken the mean value1 for each month, subtracted it from value2, and then squared that result. The output show the final calculation in the brackets, which is the result from squaring. Note that I also ordered the results by month with the chaining process.

That’s all very nice and convenient, but what if we want to create user defined functions that essentially automate these tasks for future work. Let us start with a function for extracting each component from the date column.

ddat = dat[1:50000, .(Date, value1, value2, value3)]

add_engineered_dates <- function(mydt, date_col="Date"){

   new_dt = copy(mydt)

   new_dt[, paste0(date_col, "_year") := year(mdy_hms(get(date_col)))]
   new_dt[, paste0(date_col, "_month") := month(mdy_hms(get(date_col)))]
   new_dt[, paste0(date_col, "_day") := day(mdy_hms(get(date_col)))] 

   new_dt[, c(date_col) := NULL]

   new_dt[]
}

We’ve created a new functions that takes two arguments, the data table variable name and the name of the column containing the date values. The first step is to copy the data table so that we are not directly making changes to the original data. Because the goal is to add three new columns that extract the year, month, and day, from the date column, we’ve used the lubridate package to define the date column and then extract the desired values. Furthermore, each new column has been labeled so that it distinctly represents the original date column name and the component that it contains. The final step was to remove the original date column, so we’ve set the date column to NULL. The final line prints the results back to the screen. You can recognize from the code above that the get function is used to take a variable name that represents column names and extract those columns within the function.

result = add_engineered_dates(ddat)
result

Let’s now apply the previous code with braces to find the squared difference between value2 and the mean value1 metric.

result[, { avg_val1 = mean(value1)
           new_val1 = mean(abs(value2-avg_val1))
           new_val2 = new_val1^2 }, by=.(Date_year,Date_month)][
                  order(Date_year,Date_month)][1:12]

Perfect!

I’m hoping that these three posts have convinced beginners to the R language that the data.table package is beautiful and worth checking out. The syntax may be scary at first, but it offers an array of powerful tools that everyone should become familiar with. So pull your sleeves up and have fun.

If you have any comments or would like to cover other specific topics, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn

Data.Table by Example – Part 2

In part one, I provided an initial walk through of some nice features that are available within the data.table package. In particular, we saw how to filter data and get a count of rows by the date.

dat = fread("rows.csv")
names(dat) <- gsub(" ", "_", names(dat))
dat[1:3]

Let us now add a few columns to our dataset on reported crimes in the city of Chicago. There are many ways to do do this but they involve the use of the := operator. Since data.table updates values by reference, we do not need to save the results as another variable. This is a very desirable feature.

dat[, c("value1", "value2", "value3") := sample(1:50, nrow(dat), replace=TRUE)]

dat[, `:=`(value1 = sample(1:50, nrow(dat), replace=TRUE),
           value2 = sample(1:50, nrow(dat), replace=TRUE),
           value3 = sample(1:50, nrow(dat), replace=TRUE))]

You can also just use the traditional base R solution for adding new columns as data.tables are also data frames.

dat$value1 <- sample(1:50, nrow(dat), replace=TRUE)
dat$value2 <- sample(1:50, nrow(dat), replace=TRUE)
dat$value3 <- sample(1:50, nrow(dat), replace=TRUE)

In any case, we now have three new columns with randomly selected values between 1 and 50. We can now look to summarize these values and see how they differ across the primary arrest type and other categorical variables.

dat[, .(mean = mean(value1, na.rm = TRUE),
        median = median(value1, na.rm = TRUE),
        min = min(value1, na.rm = TRUE),
        max = max(value1, na.rm = TRUE))]

dat[Primary_Type=="PROSTITUTION",
              .(mean = mean(value1, na.rm = TRUE),
                median = median(value1, na.rm = TRUE),
                min = min(value1, na.rm = TRUE),
                max = max(value1, na.rm = TRUE))]

The above code allows us to get the mean, median, min, and max values from the value1 column for all rows and for when the primary type is prostitution. To get these summaries across multiple columns, we can use lapply, .SD, and .SDcols. Notice that in the second line below, we passed primary type to the by operator so that we get mean value for each of the arrest type categories.

dat[, lapply(.SD, mean, na.rm=TRUE), .SDcols=c("value1", "value2", "value3")]

dat[, lapply(.SD, mean, na.rm=TRUE), by=Primary_Type, .SDcols=c("value1", "value2", "value3")][1:4]

But wait, that only provides me with the mean of a single value column. If you want to apply multiple functions to columns, the user is required to write a function that can then be used within lapply.

my.summary <- function(x){
                   c(mean = mean(x),
                     median = median(x),
                     min = min(x),
                     max = max(x))
}

dat[, lapply(.SD, my.summary), .SDcols=c("value1", "value2", "value3")]

dat[, lapply(.SD, my.summary), by=.(Primary_Type), .SDcols=c("value1", "value2", "value3")]

Perfect! As you can see, the syntax is concise and is very easy to work with.

In the next part of this series, I’ll cover a few advanced features like get() and {}.

If you have any questions or comments, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn 

 

Data.Table by Example – Part 1

For many years, I actively avoided the data.table package and preferred to utilize the tools available in either base R or dplyr for data aggregation and exploration. However, over the past year, I have come to realize that this was a mistake. Data tables are incredible and provide R users with a syntatically
concise and efficient data structure for working with small, medium, or large datasets. While the package is well documented, I wanted to put together a series of posts that could be useful for those who want to get introduced to the data.table package in a more task oriented format.

For this series of posts, I will be working with data that comes from the Chicago Police Department’s Citizen Law Enforcement Analysis and Reporting system. This dataset contains information on reported incidents of crime that occured in the city of Chicago from 2001 to present. You can use the wget command in the terminal to export it as a csv file.

$ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > rows.csv

This file can be found in your working directory and was saved as “rows.csv”. We will import the data into R with the fread function and look at the first few rows and structure of the data.

dat = fread("rows.csv")

dat[1:3]
str(dat)

Notice that the each of the string variables in the data set was imported as a character and not a factor. With base R functions like read.csv, we have to set the stringsAsFactors argument to TRUE if we want this result.

names(dat) <- gsub(" ", "_", names(dat))

Let’s say that we want to see the frequency distribution of several of these variables. This can be done by using .N in conjunction with the by operator.

dat[, .N, by=.(Arrest)]

In the code below, you can also see how to chain operations togther. We started by finding the count of each response in the variable, ordered the count in descending order, and then selected only those which occured more than 200,000 times.

dat[, .N, by=.(Primary_Type)][order(-N)][N>=200000]

dat[, .N, by=.(Location_Description)][order(-N)][N>=200000]

Let’s say that we want to get a count of prostitution incidents by month. To get the desired results, we will need to modify the date values, filter instances in which the primary type was “prostitution”, and then get a count by each date.

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][
         Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)][]

If you want to plot the results as a line graph, just add another chain which executes the visualization or use the maggritr %>% operator.

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][
             Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)][,
          plot(N, type="l")]

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][
            Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)] %>%
          ggplot(., aes(date2, N)) + geom_line(group=1)

I’ve obviously skipped over a lot and some of the code presented here is more verbose than needed. Even so, beginners to R will hopefully find this useful and it will pique your interest in the beauty of the data table package. Future posts will cover more of the goodies available in the data.table package such as get(), set(), {}, and so forth.

If you have any questions or comments, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn