Batch Forecasting in R

Given a data frame with multiple columns which contain time series data, let’s say that we are interested in executing an automatic forecasting algorithm on a number of columns. Furthermore, we want to train the model on a particular number of observations and assess how well they forecast future values. Based upon those testing procedures, we will estimate the full model. This is a fairly simple undertaking, but let’s walk through this task. My preference for such procedures is to loop through each column and append the results into a nested list.

First, let’s create some data.

ddat <- data.frame(date = c(seq(as.Date("2010/01/01"), as.Date("2010/03/02"), by=1)),
                      value1 = abs(round(rnorm(61), 2)),
                      value2 = abs(round(rnorm(61), 2)),
                      value3 = abs(round(rnorm(61), 2)))

We want to forecast future values of the three columns. Because we want to save the results of these models into a list, lets begin by creating a list that contains the same number of elements as our data frame.

lst.names <- c(colnames(data))
lst <- vector("list", length(lst.names))
names(lst) <- lst.names

I’ve gone ahead and written a user defined function that handles the batch forecasting process. It takes two arguments, a data frame and default argument which specifies the number of observations that will be used in the training set. The model estimates, forecasts, and diagnostic measures will be saved as a nested list and categorized under the appropriate variable name.

batch <- function(data, n_train=55){
  lst.names <- c(colnames(data))
  lst <- vector("list", length(lst.names))
  names(lst) <- lst.names    
  for( i in 2:ncol(data) ){  
    lst[[1]][["train_dates"]] <- data[1:(n_train),1]
    lst[[1]][["test_dates"]] <- data[(n_train+1):nrow(data),1]
    est <- auto.arima(data[1:n_train,i])
    fcas <- forecast(est, h=6)$mean
    acc <- accuracy(fcas, data[(n_train+1):nrow(data),i])
    fcas_upd <- data.frame(date=data[(n_train+1):nrow(data),1], forecast=fcas,                           actual=data[(n_train+1):nrow(data),i])
    lst[[i]][["estimates"]] <- est
    lst[[i]][["forecast"]] <- fcas
    lst[[i]][["forecast_f"]] <- fcas_upd
    lst[[i]][["accuracy"]] <- acc
    cond1 = diff(range(fcas[1], fcas[length(fcas)])) == 0
    cond2 = acc[,3] >= 0.025
      mfcas = forecast(ma(data[,i], order=3), h=5)        
      lst[[i]][["moving_average"]] <- mfcas
    } else {
      est2 <- auto.arima(data[,i])
      fcas2 <- forecast(est, h=5)$mean
      lst[[i]][["estimates_full"]] <- est2
      lst[[i]][["forecast_full"]] <- fcas2

This isn’t the prettiest code, but it gets the job done. Note that lst was populated within a function and won’t be available in the global environment. Instead, I chose to simply print out the contents of the list after the function is evaluated.

Statistical Reading Rainbow

For those of us who received statistical training outside of statistics departments, it often emphasized procedures over principles. This entailed that we learned about various statistical techniques and how to perform analysis in a particular statistical software, but glossed over the mechanisms and mathematical statistics underlying these practices. While that training methodology (hereby referred to as the ‘heuristic method’) has value, it has many drawbacks when the ultimate goal is to perform sound statistical analysis that is valid and thorough. Even in my current role as a data scientist at a technology company in the San Francisco Bay Area, I have had to go back and understand various procedures and metrics instead of just “doing data analysis”.

Given this realization, I have dedicated hours of time outside of work over the last couple years to “re-training” myself on many of the important concepts in both descriptive and inferential statistics. This post will give brief mention to the books that have been most useful is helping me develop a fuller understanding of the statistical sciences. These books have also helped me fill in the gaps and deficiencies from my statistical training in university and graduate school. Furthermore, these are the texts that I often revisit when I need a reference on some statistical topic of interest. This is at a minimum a a six year journey, so I have a long way to go until I am able to stand solidly in my understanding of statistics. While I am sacrificing a lot of my free time to this undertaking, it will certainly improve my knowledge and help prepare me for graduate school (PhD) in biostatistics, which I hope to attend in around five to six years.[1]

Please note that I am not taking issue with the ‘heuristic method’ of statistical training. It certainly has its place and provides students with the immediate knowledge required to satisfactorily prepare for work in private industry. In fact, I prefer the ‘heuristic method’ and still rely on straight forward rules in my day to day work as that ensures that best practices are followed and satisfactory analysis is performed. Furthermore, I certainly believe that it is superior to the hack-ey nature of data mining and data science education, but that is a different story.


Statistics in Plain English – Urdan
Clear, concise, and covers all the fundamental items that one would need to know. Everything from descriptive statistics to linear regression are covered, with many good examples. Even if you never use ANOVA or factor analysis, this is a good book to review and one that I strongly recommend to people who are interested in data science.

Principles of Statistics – Balmer
This is a classic text that offers a good treatment of probability theory, distributions, and statistical inference. The text contains a bit more math than ‘Statistics in Plain English’, so I think it should be read after completing the previous book.

Fundamentals of Modern Statistical Methods – Wilcox
This book reviews ‘traditional’ parametric statistics and provides a good overview of robust statistical methods. There is a fair amount on the historical evolution of various techniques, and I found that a bit unnecessary. But overall, this is still a solid introductory text to learn about statistical inference using robust techniques.

Regression Analysis:

Mostly Harmless Econometrics – Angrist
While I don’t regularly work with instrumental variables, generalized methods of moments, or regression discontinuity, this book is a great high level introduction to econometrics. The chapters on regression theory and quantile regression are phenomenal.

Regression Modeling Strategies – Harrell
This is my most referenced book and the one that really helped in my overall development as an applied statistician. All the important topics are covered, from imputation, regression splines, and so forth. This book includes R code for performing analysis using the RMS package. I end up citing this book quite a lot. For example, in a recent work email, I mentioned that Harrell “also says on page 61 that “narrowly distributed predictor variables will require higher sample sizes.”” Essential reading in my opinion.

Data Analysis Using Regression and Multilevel/Hierarchical Models – Gelman and Hill
The first half of this book cover statistical inference using single level models and the second half is dedicated to multilevel methods. Given that I am rarely work with panel data, I use the first half of this book a reference for things that I may need a quick refresher on. It is very accessible and has plenty of examples with R code.

Semiparametric Regression for the Social Sciences – Keele
This is one of my favorite statistical books. Well written and easy to comprehend, but still rigorous. Covers local regression, splines, and generalized additive models. There is also a solid chapter on the use of bootstrapping with semiparametric and nonparametric models.

Statistical Learning:

Statistical Learning from a Regression Perspective – Berk
As a skeptic who is wary of every hype machine, I really enjoyed Berks preface in which he discusses the “dizzying array of new statistical procedures” that have been introduced over the past several decades with “the hype of a big-budget movie.” I got this text for its treatment of topics such as boosting, bagging, random forest, and support vector machines. I will probably need to reread this book several more times before I fully comprehend everything.

Time Series:

Time Series: a Biostatistical Introduction – Diggle
The lack of quality time series books is really infuriating. Don’t get me wrong, there are some good texts on forecasting, such as the free online book from Hyndaman. However, I’ve yet to find a really good intermediate level treatment of time series analysis besides this one. Contains good coverage of repeated measurements, ARIMA modeling, and forecasting.

Bayesian Methods

Statistical Rethinking – McElreath
While I was introduced to robust techniques and nonparametric statistics in graduate school, there was nothing on Bayesian methods. Due to a fear of the topic, I avoided learning about it until this past year. This book by McElreath has been great as it is very accessible and provides code for understanding various principles. Over the next year, I am hoping to dive deeper into Bayesian techniques and this was a good first step.


Weekly R-Tips: Visualizing Predictions

Lets say that we estimated a linear regression model on time series data with lagged predictors. The goal is to estimate sales as a function of inventory, search volume, and media spend from two months ago. After using the lm function to perform linear regression, we predict sales using values from two month ago.

frmla <- sales ~ inventory + search_volume + media_spend
mod <- lm(frmla, data=dat)
pred = predict(mod, values, interval="predict") 

If this model is estimated weekly or monthly, we will eventually want to understand how well our model did in predicting actual sales from month to month. To perform this task, we must regularly maintain a spreadsheet or data structure (RDS object) with actual predicted sales figures for each time period. That data can be used to create line graphs that visualize both the actual versus predicted values.

Here is what the original spreadsheet looked like.

Screenshot from 2016-02-05 16:41:22

Transform that data into long format using whatever package you prefer.

mydat = melt(d1)

This will provide a data frame with three columns.

Screenshot from 2016-02-04 15:04:36

We can utilize the ggplot2 package to create visualizations.

ggplot(mydat, aes(Month, value, group=variable, colour=variable)) +
  geom_line(lwd=1.05) + geom_point(size=2.5) + 
  ggtitle("Sales (01/2010 to 05/2015)") +
  xlab("Date") + ylab("Sales") + ylim(0,30000) + xlab(" ") + ylab(" ") +  
  theme(legend.title=element_blank()) + xlab(" ") + 
  theme(axis.text.x=element_text(colour="black")) +
  theme(axis.text.y=element_text(colour="black")) +
  theme(legend.position=c(.4, .85))


Above is an example of what the final product could look like. Visualizing predicted against actual values is an important component of evaluating the quality of a model. Furthermore, having such visualization will be of value when interacting with business audiences and “selling” your analysis.

Applied Statistical Theory: Quantile Regression

This is part two of the ‘applied statistical theory’ series that will cover the bare essentials of various statistical techniques. As analysts, we need to know enough about what we’re doing to be dangerous and explain approaches to others. It’s not enough to say “I used X because the misclassification rate was low.”

Standard linear regression summarizes the average relationship between a set of predictors and the response variable. \beta_1 represents the change in the mean value of Y given a one unit change in X_1 . A single slope is used to describe the relationship. Therefore, linear regression only provides a partial view of the link between the response variable and predictors. This is often inadaquete when there is heterogenous variance between X and Y . In such cases, we need to examine how the relationship between X and Y changes depending on the value of Y . For example, the impact of education on income may be more pronounced for those at higher income levels than those at lower income levels. Likewise, the the affect of parental care on the mean infant birth weight can be compared to it’s effect on other quantiles of infant birth weight. Quantile regression solves for these problems by looking at changes in the different quantiles of the response. The parameter estimates for this technique represent the change in a specified quantile of the response variable produced by a one unit change in the predictor variable. One major benefit of quantile regression is that it makes no assumptions about the error distribution.



frmla <- mpg ~ .

mm = rq(frmla, data=mtcars, tau=u) # for a series of quantiles
mm = rq(frmla, data=mtcars, tau=0.50) # for the median

summ <- summary(mm, se = "boot")


Applied Statistical Theory: Belief Networks

Applied statistical theory is a new series that will cover the basic methodology and framework behind various statistical procedures. As analysts, we need to know enough about what we’re doing to be dangerous and explain approaches to others. It’s not enough to say “I used X because the misclassification rate was low.” At the same time, we don’t need to have doctoral level understanding of approach X. I’m hoping that these posts will provide a simple, succinct middle ground for understanding various statistical techniques.

Probabilistic grphical models represent the conditional dependencies between random variables through a graph structure. Nodes correspond to random variables and edges represent statistical dependencies between the variables. Two variables are said to be conditionally dependent if they have a direct impact on each others’ values. Therefore, a graph with directed edges from parent A_p and child B_c denotes a causal relationship. Two variables are conditionally independent if the link between those variables are conditional on another. For a graph with directed edges from A to B and from B to C , it would suggest that A and C are conditionally independent given variable B . Each node fits a probability distribution function that depends only on the value(s) of the variables with edges leading into the variable. For example, the probability distribution for variable C in the following graphic depends only on the value of variable B.


Let’s consider a graphical model with K = (k_1, k_2, ... , k_n) variables and a set of dependencies between the variables, A = (a_1, a_2, ... , a_n) . For each K and A , we denote a set of conditional probability distributions for each K given the parent variable. In the following directed acyclic graph, we see that P(A|B,C) = P(A|B) . This means that the probability of A is conditionally dependent only on B and the value of C does not explain the other random variables. For belief networks, inference involves computing the probability of each value of a node in a network.

There you go; the absolute basics. And below is a presentation on belief networks that I made last year.


Basic Forecasting

Forecasting refers to the process of using statistical procedures to predict future values of a time series based on historical trends. For businesses, being able gauge expected outcomes for a given time period is essential for managing marketing, planning, and finances. For example, an advertising agency may want to utilizes sales forecasts to identify which future months may require increased marketing expenditures. Companies may also use forecasts to identify which sales persons met their expected targets for a fiscal quarter.

There are a number of techniques that can be utilized to generate quantitative forecasts. Some methods are fairly simple while others are more robust and incorporate exogenous factors. Regardless of what is utilized, the first step should always be to visualize the data using a line graph. You want to consider how the metric changes over time, whether there is a distinct trend, or if there are distinct patterns that are noteworthy.

data <- structure(c(12, 20.5, 21, 15.5, 15.3, 23.5, 24.5, 21.3, 23.5,
                    28, 24, 15.5, 17.3, 25.3, 25, 36.5, 36.5, 29.6, 30.5, 28, 26,
                    21.5, 19.7, 19, 16, 20.7, 26.5, 30.6, 32.3, 29.5, 28.3, 31.3,
                    32.2, 26.4, 23.4, 16.4, 15, 16, 18, 27, 21, 49, 21, 22, 28, 36,
                    40, 3, 21, 29, 62, 65, 46, 44, 33, 62, 22, 12, 24, 3, 5, 14,
                    36, 40, 49, 7, 52, 65, 17, 5, 17, 1),
                  .Dim = c(36L, 2L), .Dimnames = list(NULL, c("Advertising", "Sales")),
                  .Tsp = c(2006, 2008.91666666667, 12), class = c("mts", "ts", "matrix"))
head(data); nrow(data)

There are several key concepts that we should be cognizant of when describing time series data. These characteristics will inform how we pre-process the data and select the appropriate modeling technique and parameters. Ultimately, the goal is to simplify the patterns in the historical data by removing known sources of variatiion and making the patterns more consistent across the entire data set. Simpler patterns will generally lead to more accurate forecasts.

Trend: A trend exists when there is a long-term increase or decrease in the data.

Seasonality: A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week.

Autocorrelation: Refers to the pheneomena whereby values of Y at time t are impacted by previous values of Y at t-i. To find the proper lag structure and the nature of auto correlated values in your data, use the autocorrelation function plot.

Stationary: A time series is said to be stationary if there is no systematic trend, no systematic change in variance, and if strictly periodic variations or seasonality do not exist

Quantitative forecasting techniques are usually based on regression analysis or time series techniques. Regression approaches examine the relationship between the forecasted variable and other explanatory variables using cross-sectional data. Time series models use hitorical data that’s been collected at regular intervals over time for the target variablle to forecast its future values. There isn’t time to cover the theory behind each of these approaches in this post, so I’ve chosen to cover high level concepts and provide code for performing time series forecasting in R. I strongly suggest understandig the statistical theory behind a technique before running the code.

First, we can use the ma function in the forecast package to perform forecasting using the moving average method. This technique estimates future values at time t by averaging values of the time series within k periods of t. When the time series is stationary, the moving average can be very effective as the observations are nearby across time.

moving_average = forecast(ma(data[1:31,1], order=3), h=5)
moving_average_accuracy = accuracy(moving_average, data[32:36])
moving_average; moving_average_accuracy
plot(moving_average, ylim=c(0,60))

The simple exponential smooting is also good when the data has no trend or seasonal patterns. Unlike a moving average, this technique gives greater weight to the most recent observations of the time series.

exp <- ses(data[1:31,1], 5, initial="simple")
exp_accuracy = accuracy(exp, data[32:36])
exp; exp_accuracy
plot(exp, ylim=c(0,60))

In the forecast package, there is an automatic forecasting function that will run through possible models and select the most appropriate model give the data. This could be an auto regressive model of the first oder (AR(1)), an ARIMA model with the right values for p, d, and q, or something else that is more appropriate.

train = data[1:31,1]
test = data[32:36,1]
arma_fit <- auto.arima(train)
arma_forecast <- forecast(arma_fit, h = 5)
arma_fit_accuracy <- accuracy(arma_forecast, test)
arma_fit; arma_forecast; arma_fit_accuracy
plot(arma_forecast, ylim=c(0,60))

There you go, a basic non-technical introduction to forecasting. This should get one familiar with the key concepts and how to perform some basic forecasting in R

Examining Website Pathing Data Using Markov Chains

A markov model can be used to examine a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Let’s define a stochastic process (X_{n}, n = 0, 1, 2, ...)  that takes on a finite number of possible values which are nonnegative integers. Each state, X_{n} , represents it’s value in time period n . If the probability of being in X_{n+1} is dependent on X_{n} , it’s refered to as the first-order Markov property. We are interested in estimating P_{ij} , which is the fixed probability that X at time i  will be followed by state j . These n step transition probabilities are calculated through the Chapman-Kolmogorov equations, which relates the joint probability distributions of different sets of coordinates on a stochastic process. Markov chains are generally represented as a state diagram or transition matrix where every row of the matrix, P , is a conditional probability mass function.

Let’s consider an example using website pathing data from an ecommerce website. The set of possible outcomes, or sample space, is defined below. For this example, S_{i} takes the values of each page on the site. This likely violates the Markov property, given that pages on an ecommerce website aren’t generally dependent on the previous page visited, but let’s proceed anyways.

S = (Home, About, Shoes, Denim, Cart)

Given a series of clickstreams, a markov chain can be fit in order to predict the next page visited. Below are the state diagram and transition matrix for this data. It suggests that from the home state, there is a 83% probability that a visit to the shoes state will be next.

Markov_Network Screenshot from 2015-09-08 20:24:28

To be honest, I’m not certain whether is the best technique to model how consumers utilize a website. Markov chains seem to be promising, but I’m a little uneasy about whether our assumptions are met in such scenarios. If you have ideas for such problems, please comment below.