R Programming Notes – Part 2

In an older post, I discussed a number of functions that are useful for programming in R. I wanted to expand on that topic by covering other functions, packages, and tools that are useful. Over the past year, I have been working as an R programmer and these are some of the new learnings that have become fundamental in my work.

IS TRUE and IS FALSE

isTRUE is a logical operator that can be very useful in checking whether a condition or variable has been set to true. Lets say that we are writing a script whereby we will take run a generalized linear regression when the parameter run_mod is set to true. The conditional portion of the script can be written as either if(isTRUE(run_mod)) or if(run_mod). I am partial to isTRUE, but this is entirely a matter of personal preference. Users should also be aware of the isFALSE function, which is part of the BBmisc package.


run_mod = TRUE 

if(isTRUE(run_mod){
	tryCatch(
         GLM_Model(full_data=full.df, train_data=train.df, 
                      test_data=test.df), 
    error = function(e) {
         print("error occured")
         print(e)
    })
}

if(BBmisc::isFALSE(check_one) & BBmisc::isFALSE(check_two)){
	data_output.tmp$score = 0.8
}

INVISIBLE

The invisible function can be used to return an object that is not printed out and can be useful in a number of circumstances. For example, it’s useful when you have helper functions that will be utilized within other functions to do calculations. In those cases, it’s often not desireable to print those results. I generally use invisible when I’m checking function arguments. For example, consider a function that takes two arguments and you need to check whether the input is a vector.


if(!check_input(response, type='character', length=1)) {
    stop('something is wrong')
}

The check_input function is something I created and has a few lines which contain invisible. The idea is for check_input to return true or false based on the inputs so that it’ll stop stop the execution when needed.


if(is.null(response) & !length(response)==0) {
    return(FALSE)
} else if (!is.null(response)) {
    return(invisible(TRUE))
}

DEBUG

When I’m putting together new classes or have multiple functions that interact with one another, I ensure that the code includes an comprehensive debugging process. This means that I’m checking my code at various stages so that I can identify when issues arise. Consider that I’m putting together a function that will go through a number of columns in a data frame, summarize those variables, and save the results as a nested list. To effectively put together code without issues, I ensure that the functions takes a debug argument that will run when it’s set to true. In the code below, it will print out values at different stages of the code. Furthermore, the final line of the code will check the resulting data structure.


DSummary_Variable(data_obj, var, debug=TRUE){
             ......
}

if(debug) message('|==========>>>  Processing of the variable. \n')  

if(debug){ 
  if(!missing(var_summary)){ 
      message('|==========>>>  var_summary has been created and 
              has a length of ', length(var_summary), ' and the 
              nested list has a length of ', 
              length(var_summary[['var_properties']]), ' \n')  
  } else {
      stop("var_summary is missing. Please investigate")
  }

If you have multiple functions that interact with one another, it’s a good idea to preface the printed message with the name of the function name.


add_func <- function(a,b) a + b

mult_func <- function(a,b) a * b

main_func <- function(data, cola, colb, debug=TRUE){

	if(debug){
		message("mult_func: checking inputs to be used")
	}

	mult_func(data[,cola], data[,colb])

	if(debug){
		message("mult_add: checking inputs to be used")
	}
}

Stay tuned for part three, where I’ll talk about the testthat and assertive package.

R Programming Notes

I’ve been on a note taking binge recently. This post covers a variety of topics related to programming in R. The contents were gathered from many sources and structured in such a way that it provided the author with a useful reference guide for understanding a number of useful R functions.

DO.CALL

The do.call function executes a function call on a list of arguments.

do.call("R_Function", "List_of_Arguments")

This is equivilant to telling R which arguments the function should operate on.

R_Function( "List_of_Arguments" ){
  ...
}

Consider the following list with four elements. We can use this function to find the total sum across all list elements or bind the rows into a data frame.

x1 <- c(1,2,5)
x2 <- c(1,3,6)
x3 <- c(1,4,7)
x4 <- c(1,5,8)
 
do.call(sum, list(x1,x2,x3,x4))  # sum all list elements
do.call(rbind, list(x1,x2,x3,x4))  # rbind the list elements

Let’s consider a scenario where we have a small data frame and want to run a general linear model on different combintions of attributes. One solution to this problem is to create the formula object as a list within a function, and then utilize do.call to run a linear model on the desired attribute.

dat <- data.frame(x1 = rnorm(100, m=50), x2 = rnorm(100, m=50), x3 = rnorm(100, m=50), y = rnorm(100))
 
new_mod <- function(form){
  lstt = list(formula=as.formula(paste("y ~ ", form, sep="")), data=as.name("dat"))
  summary(do.call("lm", lstt))
}
 
new_mod("x1")
new_mod("x2")
new_mod("x3")

EVAL

The eval function evaluates an expression, an object that represents an action that can be performed in R. An expression is different from an operation, which refers to the execution of an operation. In the following example, we assigned a value to a variable and performed an operation using that variable.

x <- 4
y <- x * 10
y

The expression and quote functions are used to takes an expression as an argument and returns an expression without evaluation.

ee = expression(y~x1)
ee
 
z <- quote(y <- x * 10)
z
 
is.expression(ee)
is.call(z)

In the above example, z is referred to as a call object. They are usually created with the call function. For example, the following code pieces together commands into a call object and evaluates them.

mycall <- call("lm", quote(y ~ x1), data=quote(dat))
mod <- eval(mycall)
summary(mod)

SUBSTITUTE

Another common procedure is to replace certain variables within a user defined function. This can be achieved with the substitute function. The below code replaces x1 and x2 in the expression with the values from the list.

replace <- list(x1 = as.name("alphabet"), x2 = as.name("zoology"))
substitute(expression(x1 + x2 + log(x1) + x3), replace)

SETNAMES

There are a number of commands for working within column and row names in R. It’s generally suggested that the setNames function be used when modifying data structure names within a function.

names_new <- c("one","two","three","four")
new_dat <- expression(data.frame(x1 = rnorm(100, m=50), x2 = rnorm(100, m=50), x3 = rnorm(100, m=50), y = rnorm(100)))
head(setNames(eval(new_dat), names_new))
 
my_lst <- list(lname = as.name("xls"))
setNames(my_lst, "x1")

When working with data within a function, it will often be useful to write code that creates the names and data structure, and then evaluates the pieces together.

MISSING

Use the missing function to check whether an argument was supplied by the user.

func <- function(x){
    if(missing(x)){
        x = 5
    }
    y = x*x; y
}
 
func(2)
func()

STOP

The stop function is used to halt a function, and is usually used when a particular condition is met.

guess <- function(n){
  if( n >= 6 ){
      stop("WRONG!")
}}
 
guess(5)
guess(10)

WARNING

Use the warning function to throw a comment when a condition is met. It does not halt the function.

guess <- function(n){
  if( n >= 6 ){
      warning("BE CAREFUL!")
}}
 
guess(5)
guess(10)

ELIPSES (…)

When functions are evaluated, R scans each function argument it can understand. When ellipsis are added as a function argument, it allows for other arguments to be passed into the function. This technique is frequently used when plotting and should be used when the function is designed to take any number of named or unnamed arguments.

plotter <- function(x, y, ...){
  plot(x, y, ...)
}

To make use of ellipsis, it’s suggested that the user scan through the … and turn them into a list. This is because some of the arguments in dots may be intended for several different functions.

 

Working With SEM Keywords in R

The following post was republished from two previous posts that were on an older blog of mine that is no longer available. These are from several years ago, and related to two critical questions that I encountered. One, how can I automatically generate hundreds of thousands of keywords for a search engine marketing campaign. Two, how can I develop an effective system for examining keywords based on different characteristics.

Generating PPC Keywords in R

Paid search marketing refers to the process of driving traffic to a website by purchasing ads on search engines. Advertisers bid on certain keywords that users might search for, and that determines when and where their ads appear. For example, an individual who owns an auto dealership would want to bid on keywords relating to automobiles that a reasonable people would search for on a search engine. In both Google and Bing, advertisers are able to specify which keywords they would like to bid for and at what amount. If the user decides to bid on just a small number of keywords, they can type that information and specify a bid. However, what if you want to bid on a significant number of keywords. Instead of typing each and every keyword into the Google or Bing dashboard, you could programmatically generate the keywords in R.

Let’s say that I run an online retail establishment that sells mens and womens streetwear and I want to drive more traffic to my online store by placing ads on both Google and Bing. I want to bid on about a number of keywords related to fashion and have created a number of ‘root’ words that will comprise the majority of these keywords. To generate my desired keywords, I have a written a function which will take every single permutation of the root words.

root1 = c("fashion", "streetwear")
root2 = c("karmaloop", "crooks and castles", "swag")
root3 = c("urban clothing", "fitted hats", "snapbacks")
root4 = c("best", "authentic", "low cost") 

myfunc <- function(){
      lst <- list(root1=c(root1), root2=c(root2), root3=c(root3),
            root4=c(root4))
      myone <- function(x, y){
            m1 <- do.call(paste, expand.grid(lst[[x]], lst[[y]]))
            mydf <- data.frame(keyword=c(m1))
      }
      mydf <- rbind(myone("root4","root1"), myone("root2","root1"))
      }

mydat <- myfunc()
mydat

write.table(mydat, "adppc.txt", quote=FALSE, row.names=FALSE)

This isn’t the prettiest code in the world, but it gets the job done. In fact, the same results could have achieved using the following code, which is much more efficient.

root5 = c("%s fashion")
root6 = c("%s streetwear")
adcam1 = sprintf(root5, root2)
adcam2 = sprintf(root6, root2)
df = data.frame(keywords=c(adcam1, adcam2))

write.table(df, "adppc.txt", quote=FALSE, row.names=FALSE)

If you have any suggestions for improving my R code, please mention it in the comment section below.

Creating Tags For PPC Keywords

When performing search engine marketing, it is usually beneficial to construct a system for making sense of keywords and their performance. While one could construct Bayesian Belief Networks to model the process of consumers clicking on ads, I have found that using ’tags’ to categorize keywords is just as useful for conducting post-hoc analysis on the effectiveness of marketing campaigns. By ‘tags,’ I mean identifiers which categorize keywords according to their characteristics. For example, in the following data frame, we have six keywords, our average bids, numbers of clicks, and tags for state, model, car, auto, save, and cheap. What we want to do now is set the boolean for each tag to 1 if and only if that tag is mentioned in the keyword.

# CREATE SOME DATA = 
df = data.frame(keyword=c("best car insurance",
                          "honda auto insurance",
                          "florida car insurance",
                          "cheap insurance online",
                          "free insurance quotes",
                          "iowa drivers save money"),
                average_bid=c(3.12, 2.55, 2.38, 5.99, 4.75, 4.59),
                clicks=c(15, 20, 30, 50, 10, 25),
                conversions=c(5, 2, 10, 15, 3, 5),
                state=0, model=0, car=0, auto=0, save=0, cheap=0)
df

# FUNCTION WHICH SETS EACH TAG TO 1 IF THE SPECIFIED TAG IS PRESENT IN THE KEYWORD
main <- function(df) {
  state <- c("michigan", "missouri", "florida", "iowa", "kansas")
  model <- c("honda", "toyota", "ford", "acura", "audi")
  car <- c("car")
  auto <- c("auto")
  save <- c("save")
  cheap <- c("cheap")
  for (i in 1:nrow(df)) {
    Words = strsplit(as.character(df[i, 'keyword']), " ")[[1]]
    if(any(Words %in% state)) df[i, 'state'] <- 1
    if(any(Words %in% model)) df[i, 'model'] <- 1 
    if(any(Words %in% car)) df[i, 'car'] <- 1
    if(any(Words %in% auto)) df[i, 'auto'] <- 1     
    if(any(Words %in% save)) df[i, 'save'] <- 1
    if(any(Words %in% cheap)) df[i, 'cheap'] <- 1
  }
  return(df)
}

one = main(df)

subset(one, state==TRUE | model==TRUE | auto==TRUE)

# AN ALTERNATE METHOD USING THE STRINGR PACKAGE

df

library(stringr)

# CREATE EACH TAG
state <- c("michigan", "missouri", "florida", "iowa", "kansas")
model <- c("honda", "toyota", "ford", "acura", "audi")
car <- c("car")
auto <- c("auto")
save <- c("save")
cheap <- c("cheap")

state_match <- str_c(state, collapse = "|")
model_match <- str_c(model, collapse = "|")
car_match <- str_c(car, collapse = "|")
auto_match <- str_c(auto, collapse = "|")
save_match <- str_c(save, collapse = "|")
cheap_match <- str_c(cheap, collapse = "|")

#FUNCTION TO SET TAG IF PRESENT IN THE KEYWORD
main <- function(df) {
  df$state <- str_detect(df$keyword, state_match)
  df$model <- str_detect(df$keyword, model_match)
  df$car <- str_detect(df$keyword, car_match)
  df$auto <- str_detect(df$keyword, auto_match)
  df$save <- str_detect(df$keyword, save_match)
  df$cheap <- str_detect(df$keyword, cheap_match)
  df
}

two = main(df2)

subset(two, state==TRUE | model==TRUE | auto==TRUE)

By now, some of you are probably wondering why we don’t just select the keyword directly from the original data frame based on the desired characteristic. Well, that works too, albeit I’ve found that the marketing professionals that I’ve worked with have preferred the ‘tagging’ method.

## Alternate approach - SELECT DIRECTLY

df

main <- function(df) {
  model <- c("honda", "toyota", "ford", "acura", "audi")
  for (i in 1:nrow(df)) {
    Words = strsplit(as.character(df[i, 'keyword']), " ")[[1]]
    if(any(Words %in% model)) return(df[i, c(1:4) ])    
  }}

three = main(df)

So there you have it, a method of ‘tagging’ strings according to a certain set of specified characteristics. The benefit of using ‘tags’ is that it provides you with a systematic way to document how the presence of certain words or phrases impacts performance.

 

 

Introduction to the RMS Package

The rms package offers a variety of tools to build and evaluate regression models in R. Originally named ‘Design’, the package accompanies the book “Regression Modeling Strategies” by Frank Harrell, which is essential reading for anyone who works in the ‘data science’ space. Over the past year or so, I have transitioned my personal modeling scripts to rms as it makes things such as bootstrapping, model validation, and plotting predicted probabilities easier to do. While the package is fairly well documented, I wanted to put together a ‘simpler’ and more accessible introduction that would explain to R-beginners how they could start using the rms package. For those with limited statistics training, I strongly suggest reading “Clinical Prediction Models” and working your way up to “Regression Modeling Strategies”.

We start this introduction to the rms package with the datadist function, which computes statistical summaries of predictors to automate estimation and plotting of effects. The user will generally supply the final data frame to the datadist function and set the data distribution using the options function. Note that if you modify the data in your data frame, then you will need to reset the distribution summaries using datadist.


my.df = data.frame(one=c(rnorm(100)), two=c(rnorm(100)), y=rnorm(100))
dd = datadist(my.df)
options(datadist="dd")

my.df_new = subset(my.df, two >= 0.05)
ddd <- datadist(my.df_new)
options( datadist = "ddd" )
ddd

The main functions to estimate models in rms are ols for linear models and lrm for logistic regression or ordinal logistic regression. There are also a few other functions for performing survival analysis,but they will not be covered in this post.


getHdata(prostate)
head(prostate)

ddd <- datadist(prostate)
options( datadist = "ddd" )

Using the prostate data, we built a linear model using ordinary least squares estimation. The argument x and y must be set to true if we plan on evaluate the model in later stages using the validate and calibrate functions. Because we haven’t altered the default contrasts or incorporated any smoothing splines, the coefficients and standard errors should be identical to the results of lm. Use the model variable name to see the estimates from the model and use the summary function to get an overview of the effects of each predictor on the response variable. One important thing to note that the effect point estimates in the summary.rms output relate to the estimated effect of an inter-quartile range increase in the predictor variable.


lmod = ols(wt ~ age + sbp + rx, data=prostate, x=TRUE, y=TRUE)
lmod
summary(lmod)
summary(lmod, age=c(50,70))

This may not seem like anything to write home about. But what makes the rms package special is that it makes the modeling process significantly easier. For the above linear regression model, let’s plot the predicted values and perform internal bootstrapped validation of the model. In the following code, the validate function is used to assess model fit and calibrate is used to assess if the observed responses agree with predicted responses.


plot(anova(lmod), what='proportion chisq') # relative importance
plot(Predict(lmod)) # predicted values
rms::validate(lmod, method="boot", B=500) # bootstrapped validation
my.calib <- rms::calibrate(lmod, method="boot", B=500) # model calibration
plot(my.calib, las=1)
vif(lmod) # test for multicolinearity
Predict(lmod)

Let us now build a logistic regression model using the lrm function, plot the expected probabilities, and evaluate the model. We also use the pentrace function to perform logistic regression with penalized maximum likelihood estimation.


mod1 = lrm(as.factor(bm) ~ age + sbp + rx, data=prostate, x=TRUE, y=TRUE)
mod1
summary(mod1)

plot(anova(mod1), what='proportion chisq') # relative importance
plot(Predict(mod1, fun=plogis)) # predicted values
rms::validate(mod1, method="boot", B=500) # bootstrapped validation
my.calib <- rms::calibrate(mod1, method="boot", B=500) # model calibration
plot(my.calib, las=1)

penalty <- pentrace(mod1, penalty=c(0.5,1,2,3,4,6,8,12,16,24), maxit=25)
mod1_pen <- update(mod1, penalty=penalty$penalty)
effective.df(mod1_pen)
mod1_pen

There you have it, a very basic introduction to the rms package for beginners to the R programming language. Once again, I strongly suggest that readers who are not trained statisticians should read and fully comprehend “Clinical Prediction Models” and “Regression Modeling Strategies” by Frank Harrell. You can also access a number of handouts and lecture notes at here.

Batch Forecasting in R

Given a data frame with multiple columns which contain time series data, let’s say that we are interested in executing an automatic forecasting algorithm on a number of columns. Furthermore, we want to train the model on a particular number of observations and assess how well they forecast future values. Based upon those testing procedures, we will estimate the full model. This is a fairly simple undertaking, but let’s walk through this task. My preference for such procedures is to loop through each column and append the results into a nested list.

First, let’s create some data.

ddat <- data.frame(date = c(seq(as.Date("2010/01/01"), as.Date("2010/03/02"), by=1)),
                      value1 = abs(round(rnorm(61), 2)),
                      value2 = abs(round(rnorm(61), 2)),
                      value3 = abs(round(rnorm(61), 2)))
head(ddat)
tail(ddat)

We want to forecast future values of the three columns. Because we want to save the results of these models into a list, lets begin by creating a list that contains the same number of elements as our data frame.

lst.names <- c(colnames(data))
lst <- vector("list", length(lst.names))
names(lst) <- lst.names
lst

I’ve gone ahead and written a user defined function that handles the batch forecasting process. It takes two arguments, a data frame and default argument which specifies the number of observations that will be used in the training set. The model estimates, forecasts, and diagnostic measures will be saved as a nested list and categorized under the appropriate variable name.

batch <- function(data, n_train=55){
 
  lst.names <- c(colnames(data))
  lst <- vector("list", length(lst.names))
  names(lst) <- lst.names    
 
  for( i in 2:ncol(data) ){  
 
    lst[[1]][["train_dates"]] <- data[1:(n_train),1]
    lst[[1]][["test_dates"]] <- data[(n_train+1):nrow(data),1]
 
    est <- auto.arima(data[1:n_train,i])
    fcas <- forecast(est, h=6)$mean
    acc <- accuracy(fcas, data[(n_train+1):nrow(data),i])
    fcas_upd <- data.frame(date=data[(n_train+1):nrow(data),1], forecast=fcas,                           actual=data[(n_train+1):nrow(data),i])
 
    lst[[i]][["estimates"]] <- est
    lst[[i]][["forecast"]] <- fcas
    lst[[i]][["forecast_f"]] <- fcas_upd
    lst[[i]][["accuracy"]] <- acc
 
    cond1 = diff(range(fcas[1], fcas[length(fcas)])) == 0
    cond2 = acc[,3] >= 0.025
 
    if(cond1|cond2){
 
      mfcas = forecast(ma(data[,i], order=3), h=5)        
      lst[[i]][["moving_average"]] <- mfcas
 
    } else {
 
      est2 <- auto.arima(data[,i])
      fcas2 <- forecast(est, h=5)$mean
 
      lst[[i]][["estimates_full"]] <- est2
      lst[[i]][["forecast_full"]] <- fcas2
 
    }  
  }  
  return(lst)
}
 
batch(ddat)

This isn’t the prettiest code, but it gets the job done. Note that lst was populated within a function and won’t be available in the global environment. Instead, I chose to simply print out the contents of the list after the function is evaluated.

Statistical Reading Rainbow

For those of us who received statistical training outside of statistics departments, it often emphasized procedures over principles. This entailed that we learned about various statistical techniques and how to perform analysis in a particular statistical software, but glossed over the mechanisms and mathematical statistics underlying these practices. While that training methodology (hereby referred to as the ‘heuristic method’) has value, it has many drawbacks when the ultimate goal is to perform sound statistical analysis that is valid and thorough. Even in my current role as a data scientist at a technology company in the San Francisco Bay Area, I have had to go back and understand various procedures and metrics instead of just “doing data analysis”.

Given this realization, I have dedicated hours of time outside of work over the last couple years to “re-training” myself on many of the important concepts in both descriptive and inferential statistics. This post will give brief mention to the books that have been most useful is helping me develop a fuller understanding of the statistical sciences. These books have also helped me fill in the gaps and deficiencies from my statistical training in university and graduate school. Furthermore, these are the texts that I often revisit when I need a reference on some statistical topic of interest. This is at a minimum a a six year journey, so I have a long way to go until I am able to stand solidly in my understanding of statistics. While I am sacrificing a lot of my free time to this undertaking, it will certainly improve my knowledge and help prepare me for graduate school (PhD) in biostatistics, which I hope to attend in around five to six years.[1]

Please note that I am not taking issue with the ‘heuristic method’ of statistical training. It certainly has its place and provides students with the immediate knowledge required to satisfactorily prepare for work in private industry. In fact, I prefer the ‘heuristic method’ and still rely on straight forward rules in my day to day work as that ensures that best practices are followed and satisfactory analysis is performed. Furthermore, I certainly believe that it is superior to the hack-ey nature of data mining and data science education, but that is a different story.

Fundamentals:

Statistics in Plain English – Urdan
Clear, concise, and covers all the fundamental items that one would need to know. Everything from descriptive statistics to linear regression are covered, with many good examples. Even if you never use ANOVA or factor analysis, this is a good book to review and one that I strongly recommend to people who are interested in data science.

Principles of Statistics – Balmer
This is a classic text that offers a good treatment of probability theory, distributions, and statistical inference. The text contains a bit more math than ‘Statistics in Plain English’, so I think it should be read after completing the previous book.

Fundamentals of Modern Statistical Methods – Wilcox
This book reviews ‘traditional’ parametric statistics and provides a good overview of robust statistical methods. There is a fair amount on the historical evolution of various techniques, and I found that a bit unnecessary. But overall, this is still a solid introductory text to learn about statistical inference using robust techniques.

Regression Analysis:

Mostly Harmless Econometrics – Angrist
While I don’t regularly work with instrumental variables, generalized methods of moments, or regression discontinuity, this book is a great high level introduction to econometrics. The chapters on regression theory and quantile regression are phenomenal.

Regression Modeling Strategies – Harrell
This is my most referenced book and the one that really helped in my overall development as an applied statistician. All the important topics are covered, from imputation, regression splines, and so forth. This book includes R code for performing analysis using the RMS package. I end up citing this book quite a lot. For example, in a recent work email, I mentioned that Harrell “also says on page 61 that “narrowly distributed predictor variables will require higher sample sizes.”” Essential reading in my opinion.

Data Analysis Using Regression and Multilevel/Hierarchical Models – Gelman and Hill
The first half of this book cover statistical inference using single level models and the second half is dedicated to multilevel methods. Given that I am rarely work with panel data, I use the first half of this book a reference for things that I may need a quick refresher on. It is very accessible and has plenty of examples with R code.

Semiparametric Regression for the Social Sciences – Keele
This is one of my favorite statistical books. Well written and easy to comprehend, but still rigorous. Covers local regression, splines, and generalized additive models. There is also a solid chapter on the use of bootstrapping with semiparametric and nonparametric models.

Statistical Learning:

Statistical Learning from a Regression Perspective – Berk
As a skeptic who is wary of every hype machine, I really enjoyed Berks preface in which he discusses the “dizzying array of new statistical procedures” that have been introduced over the past several decades with “the hype of a big-budget movie.” I got this text for its treatment of topics such as boosting, bagging, random forest, and support vector machines. I will probably need to reread this book several more times before I fully comprehend everything.

Time Series:

Time Series: a Biostatistical Introduction – Diggle
The lack of quality time series books is really infuriating. Don’t get me wrong, there are some good texts on forecasting, such as the free online book from Hyndaman. However, I’ve yet to find a really good intermediate level treatment of time series analysis besides this one. Contains good coverage of repeated measurements, ARIMA modeling, and forecasting.

Bayesian Methods

Statistical Rethinking – McElreath
While I was introduced to robust techniques and nonparametric statistics in graduate school, there was nothing on Bayesian methods. Due to a fear of the topic, I avoided learning about it until this past year. This book by McElreath has been great as it is very accessible and provides code for understanding various principles. Over the next year, I am hoping to dive deeper into Bayesian techniques and this was a good first step.

 

Weekly R-Tips: Visualizing Predictions

Lets say that we estimated a linear regression model on time series data with lagged predictors. The goal is to estimate sales as a function of inventory, search volume, and media spend from two months ago. After using the lm function to perform linear regression, we predict sales using values from two month ago.


frmla <- sales ~ inventory + search_volume + media_spend
mod <- lm(frmla, data=dat)
pred = predict(mod, values, interval="predict") 

If this model is estimated weekly or monthly, we will eventually want to understand how well our model did in predicting actual sales from month to month. To perform this task, we must regularly maintain a spreadsheet or data structure (RDS object) with actual predicted sales figures for each time period. That data can be used to create line graphs that visualize both the actual versus predicted values.

Here is what the original spreadsheet looked like.

Screenshot from 2016-02-05 16:41:22

Transform that data into long format using whatever package you prefer.


library(reshape)
mydat = melt(d1)

This will provide a data frame with three columns.

Screenshot from 2016-02-04 15:04:36

We can utilize the ggplot2 package to create visualizations.


ggplot(mydat, aes(Month, value, group=variable, colour=variable)) +
  geom_line(lwd=1.05) + geom_point(size=2.5) + 
  ggtitle("Sales (01/2010 to 05/2015)") +
  xlab("Date") + ylab("Sales") + ylim(0,30000) + xlab(" ") + ylab(" ") +  
  theme(legend.title=element_blank()) + xlab(" ") + 
  theme(axis.text.x=element_text(colour="black")) +
  theme(axis.text.y=element_text(colour="black")) +
  theme(legend.position=c(.4, .85))

Predictions

Above is an example of what the final product could look like. Visualizing predicted against actual values is an important component of evaluating the quality of a model. Furthermore, having such visualization will be of value when interacting with business audiences and “selling” your analysis.