Semiparametric Regression in R

A. INTRODUCTION

When building statistical models, the goal is to define a compact and parsimonious mathematical representation of some data generating process. Many of these techniques require that one make assumptions about the data or how the analysis is specified. For example, Auto Regressive Integrated Moving Average (ARIMA) models require that the time series is weakly stationary or can be made so. Furthermore, ARIMA assumes that the data has no deterministic time trends, the variance of the error term is constant, and so forth. Assumptions are generally a good thing, but there are definitely situations in which one wants to free themselves from such “constraints.”

In the context of evaluating relationships between one or more target variables and a set of explanatory variables, semiparametric regression is one such technique that provides the user with some flexibility in modeling complex data without maintaining stringent assumptions. With semiparametric regression, the goal is to develop a properly specified model that integrates the simplicity of parametric estimation with the flexibility provided by nonparametric splines.

B. LINEAR REGRESSION

A core requirement of both linear and generalized linear regression is that the user must define a set of explanatory variables and functional form that describes the relationship between the predictors and response variable. So when the goal is to build a linear regression model that explains which social and political characteristics within a country account for different levels of macroeconomic performance, we must first specify a number of variables that may explain our target variable. Furthermore, we must specify the functional form of the relationship. In both linear and generalized linear regression, this refers to how we characterize and “encode” each of the explanatory variables. So if we include the log of yearly precipitation rate as
a predictor such that $Y = B0 + B1*log(precipitation)$, the model would nonlinear in the variables (log(precipitation)) and linear in the parameters ($B1$). Linearity in the variables is not an assumption of linear or generalized linear regression. However, ignoring it could result in a mispecified model with incorrect estimates.

C. REGRESSION SPLINES

One way to ensure that the model is properly specified is through the use of nonparametric splines. Instead of assuming that we know the functional form for a regression model, the user would essentially estimate the appropriate functional form from the data. Compared to a traditional parametric model whereby we estimate one global value that represents the relationship between $X$ and $Y$, we would instead have a series of local estimates that characterize the relationship across different values of $X$. There are many different variations of splines that can utilized, including b-splines, natural splines, cubic splines, and so forth. However, numerous studies have shown that restricted cubic splines with three to five knots tend to perform best across a wide array of domains.

In our previous model where the goal was to identify how macroeconomic performance is affected by weather/precipitation, the linear regression model would look like $economy = B0 + B1*precipitation$. After incorporating a spline for the explanatory variable, we can now characterize the functional form of the model as $economy = B0 + f(precipitation)$ whereby $f$ characterized the relationship between $X$ and $Y$ conditional on $X$. An important thing to note when developing semiparametric models is that although it allow users flexibility by incorporating a smoothing term, the estimation process still requires some of the core assumptions in OLS and GLM, namely that the data keeps the requirements of independence, normally distributed errors, and constant error variance.

Before we look at how to implement semiparametric regression, it is worth noting that these types of models are often referred to as either general or generalized additive models. Furthermore, semiparametric variations of other regression models are available such as semiparametric quantile regression and even semiparametric nonlinear regression.

D. R EXAMPLE

For this post, I’m going to stick with the gam function in the mgcv package because it is usually a good starting point. Taking the previous use case, let’s create some data and construct a linear regression model that regresses economy as a function of weather. Given that the data is from a random distribution, we obviously find a weak relationship between the two variables.

mydat = data.frame(economy = rnorm(100),
weather = rnorm(100))
mod = lm(economy ~ weather, data=mydat)
summary(mod)
plot(economy ~ weather, data = mydat, cex = .8)
abline(mod, col="red")


Let us say that we know based on experience, theory, or other sources that linearity in the variables is a poor specification in this domain case. One thing that could be done is to incorporate a spline to create local estimates for our explanatory variable within the linear regression.

mod <- lm(economy ~ bs(weather, knots = 3), data=mydat)
summary(mod)
plot(economy ~ weather, data = mydat, cex = .8)
lines(mydat$weather, predict(mod), lwd=2, col='red')  The same model could also be examined using gam. Take note of the edf value, which represents how much the explanatory variable is smoothed. library(mgcv) mod <- gam(economy ~ s(weather, bs = "cr", k = 3), data = mydat, family = gaussian) summary(mod) plot(mod)  This example data is pretty silly, but this should give you some things to investigate in terms of learning how semiparametric models can help improve the inferences that you are examining. Have questions, comments, interesting work, etc? Feel free to contact me at mathewanalytics@gmail.com Up next will be a series of posts on causal inference, a topic that I’ve been trying to get a better understanding of over the past month. Advertisements Packages for Getting Started with Time Series Analysis in R A. Motivation During the recent RStudio Conference, an attendee asked the panel about the lack of support provided by the tidyverse in relation to time series data. As someone who has spent the majority of their career on time series problems, this was somewhat surprising because R already has a great suite of tools for visualizing, manipulating, and modeling time series data. I can understand the desire for a ‘tidyverse approved’ tool for time series analysis, but it seemed like perhaps the issue was a lack of familiarity with the available toolage. Therefore, I wanted to put together a list of the packages and tools that I use most frequently in my work. For those unfamiliar with time series analysis, this could a good place to start investigating R’s current capabilities. B. Background Time series data refers to a sequence of measurements that are made over time at regular or irregular intervals with each observation being a single dimension. An example of low dimensional time series is daily wind temperature from 01/01/2001 through 12/31/2005. High dimensional time series is characterized by a larger number of observations, so an example could be the daily wind temperature from 01/01/1980 through 12/31/2010. In either case, the goal of the analysis could lead one to perform regression, clustering, forecasting, or even classification. C. Data For Examples To run the code in this post, you will need to access the following data through the unix terminal. It will download a csv file from the City of Chicago website that contains information on reported incidents of crime that occurred in the city of Chicago from 2001 to present. $ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > chicago_crime_data.csv


Import the data into R and get the aggregate number of reported incidents of theft by day.

library(data.table)
dat = fread("chicago_crime_data.csv")
colnames(dat) = gsub(" ", "_", tolower(colnames(dat)))
dat[, date2 := as.Date(date, format="%m/%d/%Y")]
mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
mydat[1:6]


D. Data Representation

The first set of packages that one should be aware of is related to data storage. One could use data frames, tibbles, or data tables, but there are already a number of data structures that are optimized for representing time series data. The fundamental time series object is “ts”. However, the “ts” class has a number of limitations, and so it is usually best to work with the extensible time series (“xts”) object.

D1. xts

The xts package offers a number of great tools for data manipulation and aggregation. At it’s core is the xts object, which is essentially a matrix object that can represent time series data at different time increments. Xts is a subclass of the zoo object, and that provides it with a lot of functionality.

Here are some functions in xts that are worth investigating:

library(xts)
# create a xts object
mydat2 = as.xts(mydat)
mydat2
plot.xts(mydat2)
# filter by date
mydat2["2015"]  ## 2015
mydat2["201501"]  ## Jan 2015
mydat2["20150101/20150105"]  ## Jan 01 to Jan 05 2015
# replace all valuues from Aug 25 onwards with 0
mydat2["20170825/"] <- 0
mydat2["20170821/"]
# get the last one month
last(mydat2, "1 month")
# get stats by time frame
apply.monthly(mydat2, sum)
apply.monthly(mydat2, quantile)
period.apply(mydat2, endpoints(mydat2,on='months'), sum)
period.apply(mydat2, endpoints(mydat2,on='months'), quantile)


E. Dates

R has a maddening array of date and time classes. Be it yearmon, POSIXct, POSIXlt, chron, or something else, each has specific strengths and weaknesses. In general, I find myself using the lubridate package as it simplifies many of the complexities associated with date-times in R.

E1. lubridate

The lubridate package provides a lot of functionality for parsing and formatting dates, comparing different times, extracting the components of a date-time, and so forth.

library(lubridate)
ymd("2010-01-01")
mdy("01-01-2010")
ymd_h("2010-01-01 10")
ymd_hm("2010-01-01 10:02")
ymd_hms("2010-01-01 10:02:30")


F. Time Series Regression

Distributed lag models (error correction models) are a core component of doing time series analysis. They are many instances where we want to regress an outcome variable at the current time against values of various regressors at current and previous times. dynlm and ardl (wrapper for dynlm) are solid for this type of analysis. Another common task when working with distributed lag models involves using dynamic simulations to understand estimated outcomes in different scenarios. dynsim provides a coherent solution for simulation and visualization of those estimated values of the target variable.

F1. dynlm / ardl

Here is a brief example of how dynlm can be utilized. In what follows, I have created a new variable and lagged it by one day. So the model attempts to regress incidents or reported theft based on the weather from the previous day.

library(dynlm)
mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
mydat[, weather := sample(c(20:90), dim(mydat), replace=TRUE)]
mydat[, weather_lag := shift(weather, 1, type = 'lag')]
mod = dynlm(N ~ L(weather), data = mydat)
summary(mod)


F2. dynsim

Here is a brief example of how dynlm can be utilized. In what follows, I have created a new variable and lagged it by one day. I’ve used the dynsim to product two dynamic simulations and plotted them.

library(dynsim)
mydat3 = mydat[1:10000]
mod = lm(N ~ weather_lag, data = mydat3)
Scen1 <- data.frame(weather_lag = min(mydat2$weather_lag, na.rm=T)) Scen2 <- data.frame(weather_lag = max(mydat2$weather_lag, na.rm=T))
ScenComb <- list(Scen1, Scen2)
Sim1 <- dynsim(obj = mod, ldv = 'weather_lag', scen = ScenComb, n = 20)
dynsimGG(Sim1)


G. Forecasting

G1. forecast

The forecast package is the most used package in R for time series forecasting. It contains functions for performing decomposition and forecasting with exponential smoothing, arima, moving average models, and so forth. For aggregated data that is fairly low dimensional, one of the techniques present in this package should provide an adequate forecasting model given that the assumptions hold.

Here is a quick example of how to use the auto.arima function in R. In general, automatic forecasting tools should be used with caution, but it is a good place to explore time series data.

library(forecast)
mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
fit = auto.arima(mydat[,.(N)])
pred = forecast(fit, 200)
plot(pred)


G2. smooth

The smooth package provides functions to perform even more variations of exponential smoothing, moving average models, and various seasonal arima techniques. The smooth and forecast package are usually more than adequate for most forecasting problems that pertain to low dimensional data.

Here is a basic example that uses the automatic complex exponential smoothing function:

library(smooth)
mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
fit = auto.ces(mydat[,N])
pred = forecast(fit, 200)
plot(pred)


So for those of you getting introduced to the R programming language, these are a list extremely useful packages for time series analysis that you will want to get some exposure to.

Have questions, comments, interesting consulting projects, or work that needs done, feel free to contact me at mathewanalytics@gmail.com

Examining Website Pathing Data Using Markov Chains

A markov model can be used to examine a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Let’s define a stochastic process $(X_{n}, n = 0, 1, 2, ...)$ that takes on a finite number of possible values which are nonnegative integers. Each state, $X_{n}$, represents it’s value in time period $n$. If the probability of being in $X_{n+1}$ is dependent on $X_{n}$, it’s refered to as the first-order Markov property. We are interested in estimating $P_{ij}$, which is the fixed probability that $X$ at time $i$ will be followed by state $j$. These $n$ step transition probabilities are calculated through the Chapman-Kolmogorov equations, which relates the joint probability distributions of different sets of coordinates on a stochastic process. Markov chains are generally represented as a state diagram or transition matrix where every row of the matrix, $P$, is a conditional probability mass function.

Let’s consider an example using website pathing data from an ecommerce website. The set of possible outcomes, or sample space, is defined below. For this example, $S_{i}$ takes the values of each page on the site. This likely violates the Markov property, given that pages on an ecommerce website aren’t generally dependent on the previous page visited, but let’s proceed anyways.

$S = (Home, About, Shoes, Denim, Cart)$

Given a series of clickstreams, a markov chain can be fit in order to predict the next page visited. Below are the state diagram and transition matrix for this data. It suggests that from the home state, there is a 83% probability that a visit to the shoes state will be next.

To be honest, I’m not certain whether is the best technique to model how consumers utilize a website. Markov chains seem to be promising, but I’m a little uneasy about whether our assumptions are met in such scenarios. If you have ideas for such problems, please comment below.

Statistics Refresher

Let’s face it, a good statistics refresher is always worthwhile. There are times we all forget basic concepts and calculations. Therefore, I put together a document that could act as a statistics refresher and thought that I’d share it with the world. This is part one of a two part document that is still being completed. This refresher is based on Principles of Statistics by Balmer and Statistics in Plain English by Brightman.

The Two Concepts of Probability

Statistical Probability

• Statistical probability pertains to the relative frequency with which an event occurs in the long run.
• Example:
Let’s say we flip a coin twice. What is the probability of getting two heads?
If we flip a coin twice, there are four possible outcomes, $[(H,H), (H,T), (T,H), (T,T)]$.
Therefore, the probability of flipping two heads is $\frac{(H,H)}{N} = \frac{1}{2}*\frac{1}{2} = \frac{1}{4}$

Inductive Probability

• Inductive probability pertains to the degree of belief which is reasonable to place on a proposition given evidence.
• Example:
I’m $95\%$ certain that the answer to $1 + 1$ is between $1.5$ and $2.5$.

The Two Laws of Probability

Law of Addition

• If $A$ and $B$ are mutually exclusive events, the probability that either $A$ or $B$ will occur is equal to the sum of their separate probabilities.

$\displaystyle P(A \space or \space B) = P(A) + P(B)$

Law of Multiplication

• If $A$ and $B$ are two events, the probability that both $A$ and $B$ will occur is equal to the probability that $A$ will occur multiplied by the conditional probability that $B$ will occur given that $A$ has occured.

$P(A \space and \space B) = P(A) * P(B|A)$

Conditional Probability

• The probability of $B$ given $A$, or $P(B|A)$, is the probability that $B$ will occur if we consider only those occasionson which $A$ also occurs. This is defined as $\frac{n(A \space and \space B)}{n(A)}$.

Random Variables and Probability Distributions

Discrete Variables

• Variables which arise from counting and can only take integral values $(0, 1, 2, \ldots)$.
• A frequency distribution represents the amount of occurences for all the possible values of a variable. This can be represented in a table or graphically as a probability distribution.
• Associated with any discrete random variable, $X$, is a corresponding probability function which tells us the probability with which $X$ takes any value. The particular value that $X$ can take is characterized by $x$. Based on $x$, the probability that $X$ will take can be calculated. This measure is the probability function and is defined by $P(x)$.
• The cumulative probability function specifies the probability that $X$ is less than or equal to some particular value, $x$. This is denoted by $F(x)$. The cumulative probability function can be calculated by summing the probabilities of all values less than or equal to $x$.

$F(x) = Prob[X \leq x]$

$F(x) = P(0) + P(1) + \ldots + P(x) = \sum_{u \leq x} p(u)$

Continuous Variables

• Variables which arise from measuring and can take any value within a given range.
• Continuous variables are best graphically represented by a histogram, where the area of each rectangle represents the proportion of observations falling in that interval.
• The probability density function, $f(x)$, refers to the smooth continuous curve that is used to describe the relative likelihood a random variable to take on a given value. $f(x)$ can also be used to show the probability that the random variable will lie between $x_1$ and $x_2$.
• A continuous probability distribution can also be represented by its cumulative probability function, $f(x)$. which specified the probability that $X$ is less than or equal to $x$.
• A continuous random variable is said to be uniformly distributed between $0$ and $1$ if it is equally likely to lie anywhere in this interval but cannot lie outside it.

Multivariate Distributions

• The joint frequency distribution of two random variables is called a bivariate distribution. $P(x,y)$ denotes the probability that simultaneously $X$ will be $x$ and $Y$ will be $y$. This is expressed through a bivariate distribution table.

$P(x,y) = Prob[X == x \space and \space Y == y]$

• In a bivariate distribution table, the right hand margin sums the probabilities in different rows. It expresses the overall probability distribution of $x$, regardless of the value of $y$.

$p(x) = Prob[X == x] = \sum_{y} p(x,y)$

• In a bivariate distribution table, the bottom margin sums the probabilities in different columns. It expresses the overall probability distribution of $y$, regardless of the value of $x$.

$p(y) = Prob[Y == y] = \sum_{x} p(x,y)$

Properties of Distributions

Measures of Central Tendancy

• The mean is measured by taking the sum divided by the number of observations.

$\bar{x} = \frac{x_1 + x_2 + \ldots + x_n}{n} = \sum_{i=1}^n \frac{x_i}{n}$

• The median is the middle observation in a series of numbers. If the number of observations are even, then the two middle observations would be divided by two.
• The mode refers to the most frequent observation.
• The main question of interest is whether the sample mean, median, or mode provides the most accurate estimate of central tendancy within the population.

Measures of Dispersion

• The standard deviation of a set of observations is the square root of the average of the squared deviations from the mean. The squared deviations from the mean is called the variance.

The Shape of Distributions

• Unimodal distributions have only one peak while multimodal distributions have several peaks.
• An observation that is skewed to the right contains a few large values which results in a long tail towards the right hand side of the chart.
• An observation that is skewed to the left contains a few small values which results in a long tail towards the left hand side of the chart.
• The kurtosis of a distribution refers to the degree of peakedness of a distribution.

The Binomial, Poisson, and Exponential Distributions

Binomial Distribution

• Think of a repeated process with two possible outcome, failure ($F$) and success ($S$). After repeating the experiment $n$ times, we will have a sequence of outcomes that include both failures and successes, $SFFFSF$. The primary metric of interest is the total number of successes.
• What is the probability of obtaining $x$ successes and $n-x$ failures in $n$ repetitions of the experiment?

Poisson Distribution

• The poisson distribution is the limiting form of the binomial distribution when there are a large number of trials but only a small probability of success at each of them.

Exponential Distribution

• A continuous, positive random variable is said to follow an exponential distribution if its probability density function decreases as the values of $x$ go from $0$ to $\infty$. The probability declines from its highest levels at the initial values of $x$.

The Normal Distribution

Properties of the Normal Distribution

• The real reason for the importance of the normal distribution lies in the central limit theorem, which states that the sum of a large number of independent random variables will be approximately normally distributed regardless of their individual distributions.
• A normal distribution is defined by its mean, $\mu$, and standard deviation, $\sigma$. A change in the mean shifts the distribution along the x-axis. A change in the standard deviation flattens it or compresses it while leaving its centre in the same position. The totral area under the curve is one and the mean is at the middle and divides the area into halves.
• One standard deviation above and below the mean of a normal distribution will include 68% of the observations for that variable. For two standard deviates, that value will be 95%, and for three standard deviations, that value will be 99%.

There you have it, a quick review of basic concepts in statistics and probability. Please leave comments or suggestions below. If you’re looking to hire a marketing scientist, please contact me at mathewanalytics@gmail.com

Batch Forecasting in R

Given a data frame with multiple columns which contain time series data, let’s say that we are interested in executing an automatic forecasting algorithm on a number of columns. Furthermore, we want to train the model on a particular number of observations and assess how well they forecast future values. Based upon those testing procedures, we will estimate the full model. This is a fairly simple undertaking, but let’s walk through this task. My preference for such procedures is to loop through each column and append the results into a nested list.

First, let’s create some data.

ddat <- data.frame(date = c(seq(as.Date("2010/01/01"), as.Date("2010/03/02"), by=1)),
value1 = abs(round(rnorm(61), 2)),
value2 = abs(round(rnorm(61), 2)),
value3 = abs(round(rnorm(61), 2)))
head(ddat)
tail(ddat)

We want to forecast future values of the three columns. Because we want to save the results of these models into a list, lets begin by creating a list that contains the same number of elements as our data frame.

lst.names <- c(colnames(data))
lst <- vector("list", length(lst.names))
names(lst) <- lst.names
lst

I’ve gone ahead and written a user defined function that handles the batch forecasting process. It takes two arguments, a data frame and default argument which specifies the number of observations that will be used in the training set. The model estimates, forecasts, and diagnostic measures will be saved as a nested list and categorized under the appropriate variable name.

batch <- function(data, n_train=55){

lst.names <- c(colnames(data))
lst <- vector("list", length(lst.names))
names(lst) <- lst.names

for( i in 2:ncol(data) ){

lst[[1]][["train_dates"]] <- data[1:(n_train),1]
lst[[1]][["test_dates"]] <- data[(n_train+1):nrow(data),1]

est <- auto.arima(data[1:n_train,i])
fcas <- forecast(est, h=6)$mean acc <- accuracy(fcas, data[(n_train+1):nrow(data),i]) fcas_upd <- data.frame(date=data[(n_train+1):nrow(data),1], forecast=fcas, actual=data[(n_train+1):nrow(data),i]) lst[[i]][["estimates"]] <- est lst[[i]][["forecast"]] <- fcas lst[[i]][["forecast_f"]] <- fcas_upd lst[[i]][["accuracy"]] <- acc cond1 = diff(range(fcas[1], fcas[length(fcas)])) == 0 cond2 = acc[,3] >= 0.025 if(cond1|cond2){ mfcas = forecast(ma(data[,i], order=3), h=5) lst[[i]][["moving_average"]] <- mfcas } else { est2 <- auto.arima(data[,i]) fcas2 <- forecast(est, h=5)$mean

lst[[i]][["estimates_full"]] <- est2
lst[[i]][["forecast_full"]] <- fcas2

}
}
return(lst)
}

batch(ddat)

This isn’t the prettiest code, but it gets the job done. Note that lst was populated within a function and won’t be available in the global environment. Instead, I chose to simply print out the contents of the list after the function is evaluated.

Statistical Reading Rainbow

For those of us who received statistical training outside of statistics departments, it often emphasized procedures over principles. This entailed that we learned about various statistical techniques and how to perform analysis in a particular statistical software, but glossed over the mechanisms and mathematical statistics underlying these practices. While that training methodology (hereby referred to as the ‘heuristic method’) has value, it has many drawbacks when the ultimate goal is to perform sound statistical analysis that is valid and thorough. Even in my current role as a data scientist at a technology company in the San Francisco Bay Area, I have had to go back and understand various procedures and metrics instead of just “doing data analysis”.

Given this realization, I have dedicated hours of time outside of work over the last couple years to “re-training” myself on many of the important concepts in both descriptive and inferential statistics. This post will give brief mention to the books that have been most useful is helping me develop a fuller understanding of the statistical sciences. These books have also helped me fill in the gaps and deficiencies from my statistical training in university and graduate school. Furthermore, these are the texts that I often revisit when I need a reference on some statistical topic of interest. This is at a minimum a a six year journey, so I have a long way to go until I am able to stand solidly in my understanding of statistics. While I am sacrificing a lot of my free time to this undertaking, it will certainly improve my knowledge and help prepare me for graduate school (PhD) in biostatistics, which I hope to attend in around five to six years.[1]

Please note that I am not taking issue with the ‘heuristic method’ of statistical training. It certainly has its place and provides students with the immediate knowledge required to satisfactorily prepare for work in private industry. In fact, I prefer the ‘heuristic method’ and still rely on straight forward rules in my day to day work as that ensures that best practices are followed and satisfactory analysis is performed. Furthermore, I certainly believe that it is superior to the hack-ey nature of data mining and data science education, but that is a different story.

Fundamentals:

Statistics in Plain English – Urdan
Clear, concise, and covers all the fundamental items that one would need to know. Everything from descriptive statistics to linear regression are covered, with many good examples. Even if you never use ANOVA or factor analysis, this is a good book to review and one that I strongly recommend to people who are interested in data science.

Principles of Statistics – Balmer
This is a classic text that offers a good treatment of probability theory, distributions, and statistical inference. The text contains a bit more math than ‘Statistics in Plain English’, so I think it should be read after completing the previous book.

Fundamentals of Modern Statistical Methods – Wilcox
This book reviews ‘traditional’ parametric statistics and provides a good overview of robust statistical methods. There is a fair amount on the historical evolution of various techniques, and I found that a bit unnecessary. But overall, this is still a solid introductory text to learn about statistical inference using robust techniques.

Regression Analysis:

Mostly Harmless Econometrics – Angrist
While I don’t regularly work with instrumental variables, generalized methods of moments, or regression discontinuity, this book is a great high level introduction to econometrics. The chapters on regression theory and quantile regression are phenomenal.

Regression Modeling Strategies – Harrell
This is my most referenced book and the one that really helped in my overall development as an applied statistician. All the important topics are covered, from imputation, regression splines, and so forth. This book includes R code for performing analysis using the RMS package. I end up citing this book quite a lot. For example, in a recent work email, I mentioned that Harrell “also says on page 61 that “narrowly distributed predictor variables will require higher sample sizes.”” Essential reading in my opinion.

Data Analysis Using Regression and Multilevel/Hierarchical Models – Gelman and Hill
The first half of this book cover statistical inference using single level models and the second half is dedicated to multilevel methods. Given that I am rarely work with panel data, I use the first half of this book a reference for things that I may need a quick refresher on. It is very accessible and has plenty of examples with R code.

Semiparametric Regression for the Social Sciences – Keele
This is one of my favorite statistical books. Well written and easy to comprehend, but still rigorous. Covers local regression, splines, and generalized additive models. There is also a solid chapter on the use of bootstrapping with semiparametric and nonparametric models.

Statistical Learning:

Statistical Learning from a Regression Perspective – Berk
As a skeptic who is wary of every hype machine, I really enjoyed Berks preface in which he discusses the “dizzying array of new statistical procedures” that have been introduced over the past several decades with “the hype of a big-budget movie.” I got this text for its treatment of topics such as boosting, bagging, random forest, and support vector machines. I will probably need to reread this book several more times before I fully comprehend everything.

Time Series:

Time Series: a Biostatistical Introduction – Diggle
The lack of quality time series books is really infuriating. Don’t get me wrong, there are some good texts on forecasting, such as the free online book from Hyndaman. However, I’ve yet to find a really good intermediate level treatment of time series analysis besides this one. Contains good coverage of repeated measurements, ARIMA modeling, and forecasting.

Bayesian Methods

Statistical Rethinking – McElreath
While I was introduced to robust techniques and nonparametric statistics in graduate school, there was nothing on Bayesian methods. Due to a fear of the topic, I avoided learning about it until this past year. This book by McElreath has been great as it is very accessible and provides code for understanding various principles. Over the next year, I am hoping to dive deeper into Bayesian techniques and this was a good first step.

Weekly R-Tips: Visualizing Predictions

Lets say that we estimated a linear regression model on time series data with lagged predictors. The goal is to estimate sales as a function of inventory, search volume, and media spend from two months ago. After using the lm function to perform linear regression, we predict sales using values from two month ago.


frmla <- sales ~ inventory + search_volume + media_spend
mod <- lm(frmla, data=dat)
pred = predict(mod, values, interval="predict")



If this model is estimated weekly or monthly, we will eventually want to understand how well our model did in predicting actual sales from month to month. To perform this task, we must regularly maintain a spreadsheet or data structure (RDS object) with actual predicted sales figures for each time period. That data can be used to create line graphs that visualize both the actual versus predicted values.

Here is what the original spreadsheet looked like.

Transform that data into long format using whatever package you prefer.


library(reshape)
mydat = melt(d1)



This will provide a data frame with three columns.

We can utilize the ggplot2 package to create visualizations.


ggplot(mydat, aes(Month, value, group=variable, colour=variable)) +
geom_line(lwd=1.05) + geom_point(size=2.5) +
ggtitle("Sales (01/2010 to 05/2015)") +
xlab("Date") + ylab("Sales") + ylim(0,30000) + xlab(" ") + ylab(" ") +
theme(legend.title=element_blank()) + xlab(" ") +
theme(axis.text.x=element_text(colour="black")) +
theme(axis.text.y=element_text(colour="black")) +
theme(legend.position=c(.4, .85))



Above is an example of what the final product could look like. Visualizing predicted against actual values is an important component of evaluating the quality of a model. Furthermore, having such visualization will be of value when interacting with business audiences and “selling” your analysis.