Examining the Tweeting Patterns of Prominent Crossfit Gyms

A. Introduction

The growth of Crossfit has been one of the biggest developments in the fitness industry over the past decade. Promoted as both a physical exercise philosophy and also as a competitive fitness sport, Crossfit is a high-intensity fitness program incorporating elements from several sports and exercise protocols such as high-intensity interval training, Olympic weightlifting, plyometrics, powerlifting, gymnastics, strongman, and so forth. Now with over 10,000 Crossfit affiliated gyms (boxes) throughout the United States, the market has certainly become more saturated and gyms must initiate more unique marketing strategies to attract new members. In this post, I will investigate how some prominent Crossfit boxes are utilizing Twitter to engage with consumers. While Twitter is a great platform for news and entertainment, it is usually not the place for customer acquisition given the lack of targeted messaging. Furthermore, unlike platforms like Instagram,Twitter is simply not an image/video centric tool where followers can view accomplishments from their favorite fitness heroes, witness people achieving their goals, and so forth. Given these shortcomings, I wanted to understand how some prominent Crossfit boxes are actually using their Twitter accounts. In this post, I will investigate tweeting patterns, content characteristics, perform sentiment analysis, and build a predictive model to predict retweets.

B. Extract Data From Twitter

We begin by extracting the desired data from Twitter using the rtweet package in R. There are six prominent Crossfit gyms whose entire Twitter timeline we will use. To get this data, I looped through a vector containing each of their Twitter handles and used the get_timeline function to pull the desired data. Notice that there is a user defined function called add_engineered_features that is used to add a number of extra date columns. That function is available on my GitHub page here.


# set working directory

final_dat.tmp <- list()

cf_gyms <- c("reebokcrossfit5", "crossfitmayhem", "crossfitsanitas", "sfcrossfit", "cfbelltown", "WindyCityCF")

for(each_box in cf_gyms){

message("Getting data for: ", each_box)

each_cf <- get_timeline(each_box, n = 3200)
each_cf$crossfit_name <- each_box
each_cf = as.data.table(each_cf)
suppressWarnings( add_engineered_dates(each_cf, date_col = "created_at") )

final_dat.tmp[[each_box]] <- each_cf


final_dat <- rbindlist(final_dat.tmp)

final_dat$contains_hashtags <- ifelse(!is.na(final_dat$hashtags), 1, 0)

final_dat$hashtags_count <- lapply(final_dat$hashtags, function(x) length(x[!is.na(x)]))

C. Exploratory Data Analysis

Let us start by investigating this data set to get a better understanding of trends and patterns across these various Crossfit boxes. The important thing to note is that not all these twitter accounts are currently active. We can see that crossfitmayhem, sfcrossfit, and WindyCityCF are the only ones who remain active.

final_dat[, .(min_max = range(as.Date(created_at))), by=crossfit_name][,label := rep(c("first_tweet","last_tweet"))][]

final_dat %>% janitor::tabyl(crossfit_name)

Screen Shot 2018-12-20 at 12.02.26 PM

C1. Total Number of Tweets

Sfcrossfit, which is the oldest of these gyms, has the highest number of tweets. However, when looking at the total number of tweets per active month, they were less active than two other gyms.

## total number of tweets 

p1 = final_dat[, .N, by=crossfit_name] %>%
   ggplot(., aes(x=reorder(crossfit_name, N), y=N)) + geom_bar(stat='identity', fill="steelblue") +
      coord_flip() + labs(x="", y="") + ylim(0,3000) + ggtitle("Total Number of Tweets") +
         theme(plot.title = element_text(hjust = 0.5),
               axis.ticks.x = element_line(colour = "black"),
               axis.ticks.y = element_line(colour = "black"))  

## number of tweets per active month 

p2 = final_dat[, .(.N, start=lubridate::ymd_hms(min(created_at)), months_active=lubridate::interval(lubridate::ymd_hms(min(created_at)), Sys.Date()) %/% months(1)), by=crossfit_name][,
     .(tweets_per_month = N/months_active), by=crossfit_name] %>%
   ggplot(., aes(x=reorder(crossfit_name, tweets_per_month), y=tweets_per_month)) +
      geom_bar(stat='identity', fill="steelblue") + coord_flip() + labs(x="", y="") + ylim(0,32) +
         theme(plot.title = element_text(hjust = 0.5),
               axis.ticks.x = element_line(colour = "black"),
               axis.ticks.y = element_line(colour = "black")) +
            ggtitle("Total Number of Tweets per Active Month")

## add both plots to a single pane

grid.arrange(p1, p2, nrow=1)

C2. Total Number of Tweets Over Time

The time series for the total number of tweets by month shows that each gym had one or two peaks from 2012 through 2016 where they were aggressively sharing content with their followers. However, over the past two years, each gym has reduced their twitter usage significantly.

## total number of tweets by month 

final_dat[, .N, by = .(crossfit_name, created_at_YearMonth)][order(crossfit_name, created_at_YearMonth)][,
      created_at_YearMonth := lubridate::ymd(paste(created_at_YearMonth, "-01"))] %>%
   ggplot(., aes(created_at_YearMonth, N, colour=crossfit_name)) + geom_line(group=1, lwd=0.6) +
      facet_wrap(~crossfit_name) + labs(x="", y="") + theme(legend.position="none") +
         theme(plot.title = element_text(hjust = 0.5),
               axis.ticks.x = element_line(colour = "black"),
               axis.ticks.y = element_line(colour = "black"),
               strip.text.x = element_text(size = 10)) +
            ggtitle("Total Number of Tweets")

## total number of tweets by year

ggplot(data = final_dat,
       aes(lubridate::month(created_at, label=TRUE, abbr=TRUE),
           group=factor(lubridate::year(created_at)), color=factor(lubridate::year(created_at))))+
   geom_line(stat="count") + geom_point(stat="count") +
   facet_wrap(~crossfit_name) + labs(x="", colour="Year") + xlab("") + ylab("") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) +
   ggtitle("Total Number of Tweets by Year")

C3. Tweeting Volume by Year, Month, and Day

For each Crossfit gym, I plotted the volume of tweets by year, month, and day. Oddly enough, there really are not any noticeable patterns in these charts.

## years with the highest number of tweets 

ggplot(final_dat, aes(created_at_Year)) + geom_bar(fill="steelblue") +
   facet_wrap(~crossfit_name) + labs(x="", y="") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) + ylim(0,800) +
   ggtitle("Total Number of Tweets by Year")

## months with the highest number of tweets 

final_dat[, created_at_YearMonth2 := lubridate::ymd(paste(created_at_YearMonth, "-01"))][] %>%
   ggplot(., aes(lubridate::month(created_at_YearMonth2, label=TRUE, abbr=TRUE))) + geom_bar(fill="steelblue") +
   facet_wrap(~crossfit_name) + labs(x="", y="") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) + ylim(0,500) +
   ggtitle("Total Number of Tweets by Month")

## days with the highest number of tweets 

final_dat[, created_at_YearMonth2 := lubridate::wday(lubridate::ymd(paste(created_at_YearMonth, "-01")), label=T)][] %>%
   ggplot(., aes(created_at_YearMonth2)) + geom_bar(fill="steelblue") +
   facet_wrap(~crossfit_name) + labs(x="", y="") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) + ylim(0,600) +
   ggtitle("Total Number of Tweets by Day")

C4. Tweeting Volume by Time of Day

For most of these gyms, we see that the majority of their tweets came during the second half of the day or early in the morning. Given that most Crossfit boxes have their first classes at around 5am or 6am, and are usually most busy in the evening, this indicates that these businesses are tweeting during those hours where their facility is busiest.

## tweet density over the day 

ggplot(data = final_dat) +
   geom_density(aes(x = created_at_Time, y = ..scaled..),
                fill="steelblue", alpha=0.3) +
   xlab("Time") + ylab("Density") +
   scale_x_datetime(breaks = date_breaks("2 hours"),
                    labels = date_format("%H:%M")) +
   facet_wrap(~crossfit_name) + labs(x="", y="") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.text.x=element_text(angle=90,hjust=1)) +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) +
   ggtitle("Tweet Density by Time of Day")

C5. Source of Tweets

While a couple of the gyms are using a marketing management platform like Hootsuite and IFTTT, the majority of Tweets are coming from a Twitter application or some other social media tool such as Facebook and Instagram.

## source of tweets

final_dat[, .N, by = .(crossfit_name, source)][order(crossfit_name,-N)][,
        head(.SD, 3),by=crossfit_name] %>%
   ggplot(., aes(x=source, y=N)) +
   geom_bar(stat="identity", fill="steelblue") + coord_flip() +
   facet_wrap(~crossfit_name) + labs(x="", y="") +
   labs(x="", y="", colour="") + ylim(0,2500) +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black")) +
   ggtitle("Source of Tweets") 

D. Content Characteristics

Let us investigate the characteristics of the content that is being tweeted. From looking at average tweet length to the amount of retweets, this information will further expand our understanding of how these prominent Crossfit boxes are utilizing their Twitter accounts.

D1. Frequency of Retweets and Original Tweets

Only one gym, cfbelltown, had a majority of their tweets being retweets. Furthermore, a plurality of tweets from crossfitmayhem were not original. In the plots showing retweets over time, we can also see that crossfitmayhem did a lot of retweeting from 2014 through 2016, but started pushing original tweets from then on out.

## retweets and original tweets 

final_dat[, .N, by=.(crossfit_name,is_retweet)][, percen := N/sum(N), by=.(crossfit_name)][is_retweet=="TRUE"] %>%
   ggplot(., aes(x=reorder(crossfit_name, N), y=percen)) +
      geom_bar(stat="identity", position ="dodge", fill="steelblue") + coord_flip() +
         labs(x="", y="", colour="") + ylim(0,0.8) +
            theme(plot.title = element_text(hjust = 0.5),
                  axis.ticks.x = element_line(colour = "black"),
                  axis.ticks.y = element_line(colour = "black")) +
            ggtitle("Percentage of Tweets That Were Not Original")

# total number of retweets over time 

ggplot(data = final_dat, aes(x = created_at, fill = is_retweet)) +
   geom_histogram(bins=48) +
   labs(x="", y="", colour="") +
   scale_fill_manual(values = c("steelblue2","steelblue"), name = "Retweet") +
   facet_wrap(~crossfit_name) + labs(x="", y="") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) +
   ggtitle("Total Number of Retweets Over Time")

D2. Frequency of Favourited Tweets

Crossfitmayhem was the only box where a majority of their tweets were actually favorited at least once. Furthermore, they had a moderately large number of those tweets that were getting favorited three or more times.

## amount of favorited tweets 

final_dat[, .N, by=.(crossfit_name,favorite_count)][, percen := N/sum(N), by=.(crossfit_name)][favorite_count==0,] %>%
   ggplot(., aes(x=reorder(crossfit_name,percen), y=percen)) +
   geom_bar(stat="identity", fill="steelblue") + coord_flip() +
   labs(x="", y="", colour="") + ylim(0,1) +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black")) +
   ggtitle("Percentage of Tweets That Were Not Favorited")

## amount of favorited tweets by count

final_dat[, .N, by=.(crossfit_name,favorite_count)][, percen := N/sum(N), by=.(crossfit_name)][favorite_count %
   ggplot(., aes(x=reorder(favorite_count, -favorite_count), y=percen)) +
   geom_bar(stat="identity", fill="steelblue") + coord_flip() +
   labs(x="", y="", colour="") +
   facet_wrap(~crossfit_name) + ylim(0,1) +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) +
   ggtitle("Percentage of Tweets Based on Their Favorite Count")

D3. Use of Hashtags

Reebokcrossfit5 had the largest percentage of tweets that contained hashtags while crossfitsanitas and sfcrossfit rarely used hashtags in their tweets. And when it comes to the hashtags that are being used, it seems that these boxes are using the crossfit hashtag and the name of their own gym.

## contains hashtags 

final_dat[, .N, by=.(crossfit_name,contains_hashtags)][, percen := N/sum(N), by=.(crossfit_name)][contains_hashtags==1] %>%
   ggplot(., aes(x=reorder(crossfit_name,percen), y=percen)) +
   geom_bar(stat="identity", fill="steelblue") + coord_flip() +
   labs(x="", y="", colour="") + ylim(0,0.8) +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black")) +
   ggtitle("Percentage of Tweets That Contain Hashtags") 

## frequently used hashtags 

final_dat[hashtags != "", ][,.N, by=.(crossfit_name, hashtags)][order(crossfit_name, -N)][,
    head(.SD, 3),by=crossfit_name] %>%
   ggplot(., aes(x = reorder(hashtags, -N), y=N)) +
   geom_bar(stat="identity", fill="steelblue") +
   coord_flip() + ylim(0,150) +
   facet_wrap(~crossfit_name) + labs(x="", y="", colour="") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) +
   ggtitle("Most Commonly Used Hashtags") 

D4. Tweet Length

The following chart shows the summary statistics pertaining to the length of the tweets from each of the crossfit boxes. crossfitsanitas and WindyCityCF seem to have the longest tweets on average, and sfcrossfit has the shortest tweet.

## tweet length 

melt(final_dat[, .(avg_length = mean(display_text_width,na.rm=T),
                   med_length = median(display_text_width,na.rm=T),
                   min_length = min(display_text_width,na.rm=T),
                   max_length = mean(display_text_width,na.rm=T)), by=.(crossfit_name)], id="crossfit_name") %>%
   ggplot(., aes(x=reorder(crossfit_name, value), y=value)) +
      geom_bar(stat="identity", fill="steelblue") + coord_flip() +
         labs(x="Date", y="Number of Tweets", colour="Crossfit Box") +
         theme(plot.title = element_text(hjust = 0.5)) +
      facet_wrap(~variable) + labs(x="Time", y="Number of Tweets", colour="Crossfit Box") +
      theme(axis.text.x=element_text(angle=90,hjust=1)) +
      theme(plot.title = element_text(hjust = 0.5),
            strip.text.x = element_text(size = 10)) + ylim(0,200) +
      ggtitle("Tweet Length Summary Statistics") 

E. Sentiment Analysis

In order to evaluate the emotion associated with the tweets of each crossfit box, I used the syuzhet package. This package is based on emotion lexicon which maps different words with the various emotions (joy, fear, etc.) and sentiment polarity (positive/negative). We’ll have to calculate the emotion score based on the words present in the tweets and plot the same.

We can see that the majority of tweets for each Crossfit box had a largely positive sentiment, trust and anticipation were other common emotions. For cfbelltown and crossfitsanitas, the third most common emotion was negative.

## sentiment analysis 


sentiment_scores = list() 

for(each_gym in unique(final_dat[,crossfit_name])){


   tweet_text = final_dat[crossfit_name==each_gym, text]

   # removing retweets
   tweet_text <- gsub("(RT|via)((?:\\b\\w*@\\w+)+)","",tweet_text)

   # removing mentions
   tweet_text <- gsub("@\\w+","",tweet_text)

   all_sentiments <- get_nrc_sentiment((tweet_text))

   final_sentimentscores <- data.table(colSums(all_sentiments))

   names(final_sentimentscores) <- "score"
   final_sentimentscores <- cbind("sentiment"=colnames(all_sentiments), final_sentimentscores)
   final_sentimentscores$gym = each_gym

   sentiment_scores[[each_gym]] <- final_sentimentscores


sentiment_scores <- rbindlist(sentiment_scores)

ggplot(sentiment_scores, aes(x=sentiment, y=score))+
   geom_bar(aes(fill=sentiment), stat = "identity")+
   xlab("Sentiments")+ylab("Scores")+ coord_flip() +
   facet_wrap(~gym) + labs(x="", y="", colour="") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) +
   ggtitle("Total Sentiment Across All Original Tweets") 

F. Predictive Model

Out of curiosity, I wanted to see if a predictive model could be generated to predict whether a tweet would be retweeted. For this task, I looked at just the tweets from sfcrossfit. The target variable was 0 or 1, with zero representing a tweet that was not retweeted and one representing any tweet that was retweeted one or more times. Using the caret package, I trained a random forest on my training data and tried to predict on new data. We can see from the confusion matrix that the model did a very job at predicting whether a tweet would be retweeted or not. Perhaps with some better feature, a better model could have been produced, but I chose not to invest more time on this task.



final_dat[, .N, by=crossfit_name]
final_dat_sub = final_dat[crossfit_name=="sfcrossfit",]
final_dat_sub[, favorite_count_new := ifelse(favorite_count==0, 0, 1)]
final_dat_sub[, retweet_count_new := ifelse(retweet_count==0, 0, 1)]

splt <- createDataPartition(final_dat_sub$retweet_count_new, p = 0.70, list = F)
train <- final_dat_sub[splt, .(created_at,favorite_count,is_retweet, display_text_width, contains_hashtags, retweet_count_new)]
test <- final_dat_sub[-splt, .(created_at,favorite_count,is_retweet, display_text_width, contains_hashtags, retweet_count_new)]

congress_rf <- train(x = as.matrix(train[,.(year(created_at), month(created_at),favorite_count,is_retweet, display_text_width, contains_hashtags)]),
                     y = factor(train$retweet_count_new),
                     method = "rf",
                     ntree = 200,
                     trControl = trainControl(method = "oob"))

pred = predict(congress_rf, as.matrix(final_dat_sub[-splt,.(year(created_at), month(created_at),favorite_count,is_retweet, display_text_width, contains_hashtags)]))
confusionMatrix(final_dat_sub[-splt,retweet_count_new], pred, positive="1")

G. Conclusion

So there you have it. An investigation of how several prominent Crossfit gyms are using their Twitter accounts to engage with consumers. At the end of the day, I would suggest that any business in the health and wellness space should invest more time on Instagram or YouTube than Twitter when it comes to brand marketing or customer acquisition. Twitter is great for news and entertainment, but it isn’t the ideal platform to share fitness content or inspire new members.

R Programming Notes – Part 2

In an older post, I discussed a number of functions that are useful for programming in R. I wanted to expand on that topic by covering other functions, packages, and tools that are useful. Over the past year, I have been working as an R programmer and these are some of the new learnings that have become fundamental in my work.


isTRUE is a logical operator that can be very useful in checking whether a condition or variable has been set to true. Lets say that we are writing a script whereby we will take run a generalized linear regression when the parameter run_mod is set to true. The conditional portion of the script can be written as either if(isTRUE(run_mod)) or if(run_mod). I am partial to isTRUE, but this is entirely a matter of personal preference. Users should also be aware of the isFALSE function, which is part of the BBmisc package.

run_mod = TRUE 

         GLM_Model(full_data=full.df, train_data=train.df, 
    error = function(e) {
         print("error occured")

if(BBmisc::isFALSE(check_one) & BBmisc::isFALSE(check_two)){
	data_output.tmp$score = 0.8


The invisible function can be used to return an object that is not printed out and can be useful in a number of circumstances. For example, it’s useful when you have helper functions that will be utilized within other functions to do calculations. In those cases, it’s often not desireable to print those results. I generally use invisible when I’m checking function arguments. For example, consider a function that takes two arguments and you need to check whether the input is a vector.

if(!check_input(response, type='character', length=1)) {
    stop('something is wrong')

The check_input function is something I created and has a few lines which contain invisible. The idea is for check_input to return true or false based on the inputs so that it’ll stop stop the execution when needed.

if(is.null(response) & !length(response)==0) {
} else if (!is.null(response)) {


When I’m putting together new classes or have multiple functions that interact with one another, I ensure that the code includes an comprehensive debugging process. This means that I’m checking my code at various stages so that I can identify when issues arise. Consider that I’m putting together a function that will go through a number of columns in a data frame, summarize those variables, and save the results as a nested list. To effectively put together code without issues, I ensure that the functions takes a debug argument that will run when it’s set to true. In the code below, it will print out values at different stages of the code. Furthermore, the final line of the code will check the resulting data structure.

DSummary_Variable(data_obj, var, debug=TRUE){

if(debug) message('|==========>>>  Processing of the variable. \n')  

      message('|==========>>>  var_summary has been created and 
              has a length of ', length(var_summary), ' and the 
              nested list has a length of ', 
              length(var_summary[['var_properties']]), ' \n')  
  } else {
      stop("var_summary is missing. Please investigate")

If you have multiple functions that interact with one another, it’s a good idea to preface the printed message with the name of the function name.

add_func <- function(a,b) a + b

mult_func <- function(a,b) a * b

main_func <- function(data, cola, colb, debug=TRUE){

		message("mult_func: checking inputs to be used")

	mult_func(data[,cola], data[,colb])

		message("mult_add: checking inputs to be used")

Stay tuned for part three, where I’ll talk about the testthat and assertive package.

R Programming Notes – Part 1

I’ve been on a note taking binge recently. This post covers a variety of topics related to programming in R. The contents were gathered from many sources and structured in such a way that it provided the author with a useful reference guide for understanding a number of useful R functions.


The do.call function executes a function call on a list of arguments.

do.call("R_Function", "List_of_Arguments")

This is equivilant to telling R which arguments the function should operate on.

R_Function( "List_of_Arguments" ){

Consider the following list with four elements. We can use this function to find the total sum across all list elements or bind the rows into a data frame.

x1 <- c(1,2,5)
x2 <- c(1,3,6)
x3 <- c(1,4,7)
x4 <- c(1,5,8)
do.call(sum, list(x1,x2,x3,x4))  # sum all list elements
do.call(rbind, list(x1,x2,x3,x4))  # rbind the list elements

Let’s consider a scenario where we have a small data frame and want to run a general linear model on different combintions of attributes. One solution to this problem is to create the formula object as a list within a function, and then utilize do.call to run a linear model on the desired attribute.

dat <- data.frame(x1 = rnorm(100, m=50), x2 = rnorm(100, m=50), x3 = rnorm(100, m=50), y = rnorm(100))
new_mod <- function(form){
  lstt = list(formula=as.formula(paste("y ~ ", form, sep="")), data=as.name("dat"))
  summary(do.call("lm", lstt))


The eval function evaluates an expression, an object that represents an action that can be performed in R. An expression is different from an operation, which refers to the execution of an operation. In the following example, we assigned a value to a variable and performed an operation using that variable.

x <- 4
y <- x * 10

The expression and quote functions are used to takes an expression as an argument and returns an expression without evaluation.

ee = expression(y~x1)
z <- quote(y <- x * 10)

In the above example, z is referred to as a call object. They are usually created with the call function. For example, the following code pieces together commands into a call object and evaluates them.

mycall <- call("lm", quote(y ~ x1), data=quote(dat))
mod <- eval(mycall)


Another common procedure is to replace certain variables within a user defined function. This can be achieved with the substitute function. The below code replaces x1 and x2 in the expression with the values from the list.

replace <- list(x1 = as.name("alphabet"), x2 = as.name("zoology"))
substitute(expression(x1 + x2 + log(x1) + x3), replace)


There are a number of commands for working within column and row names in R. It’s generally suggested that the setNames function be used when modifying data structure names within a function.

names_new <- c("one","two","three","four")
new_dat <- expression(data.frame(x1 = rnorm(100, m=50), x2 = rnorm(100, m=50), x3 = rnorm(100, m=50), y = rnorm(100)))
head(setNames(eval(new_dat), names_new))
my_lst <- list(lname = as.name("xls"))
setNames(my_lst, "x1")

When working with data within a function, it will often be useful to write code that creates the names and data structure, and then evaluates the pieces together.


Use the missing function to check whether an argument was supplied by the user.

func <- function(x){
        x = 5
    y = x*x; y


The stop function is used to halt a function, and is usually used when a particular condition is met.

guess <- function(n){
  if( n >= 6 ){


Use the warning function to throw a comment when a condition is met. It does not halt the function.

guess <- function(n){
  if( n >= 6 ){
      warning("BE CAREFUL!")


When functions are evaluated, R scans each function argument it can understand. When ellipsis are added as a function argument, it allows for other arguments to be passed into the function. This technique is frequently used when plotting and should be used when the function is designed to take any number of named or unnamed arguments.

plotter <- function(x, y, ...){
  plot(x, y, ...)

To make use of ellipsis, it’s suggested that the user scan through the … and turn them into a list. This is because some of the arguments in dots may be intended for several different functions.


Turning Data Into Awesome With sqldf and pandasql

Both R and Python possess libraries for using SQL statements to interact with data frames. While both languages have native facilities for manipulating data, the sqldf and pandasql provide a simple and elegant interface for conducting tasks using an intuitive framework that’s widely used by analysts.

Screenshot from 2015-04-29 11:23:02

Screenshot from 2015-04-29 11:23:44







R and sqldf

sqldf("SELECT COUNT(*) FROM df2 WHERE state = 'CA'")
1        4
      FROM df1
      INNER JOIN df2 ON df1.personid = df2.id 
      WHERE df2.state = 'TX'")
  firstname lastname  var1 state
1     David    Spade -2.09    TX
2       Joe  Montana  1.16    TX
      FROM df1
      INNER JOIN df2 ON df1.personid = df2.id WHERE df1.var1 > 0
      GROUP BY df2.state")
   state COUNT(df1.var1)
1     AZ               1
2     CA               1
3     GA               1
4     IL               1
5     NC               1
6     NY               1
7     OK               1
8     SC               1
9     TX               1
10    VT               1

Python and pandasql

import pandasql as ps
q1 = """SELECT COUNT(*) FROM df2 WHERE state = 'CA'"""
print ps.sqldf(q1, locals())
0         4
q2 = """
    FROM df1 INNER JOIN df2 ON df1.personid = df2.id 
    WHERE df2.state = "TX";
print ps.sqldf(q2, locals())
  firstname lastname  var1 state
0     David    Spade -2.09    TX
1       Joe  Montana  1.16    TX
q3 = """SELECT 
      FROM df1
      INNER JOIN df2 ON df1.personid = df2.id WHERE df1.var1 > 0
      GROUP BY df2.state"""
print ps.sqldf(q3, locals())
  state  COUNT(df1.var1)
0    AZ                1
1    CA                1
2    GA                1
3    IL                1
4    NC                1
5    NY                1
6    OK                1
7    SC                1
8    TX                1
9    VT                1


Semiparametric Regression in R


When building statistical models, the goal is to define a compact and parsimonious mathematical representation of some data generating process. Many of these techniques require that one make assumptions about the data or how the analysis is specified. For example, Auto Regressive Integrated Moving Average (ARIMA) models require that the time series is weakly stationary or can be made so. Furthermore, ARIMA assumes that the data has no deterministic time trends, the variance of the error term is constant, and so forth. Assumptions are generally a good thing, but there are definitely situations in which one wants to free themselves from such “constraints.”

In the context of evaluating relationships between one or more target variables and a set of explanatory variables, semiparametric regression is one such technique that provides the user with some flexibility in modeling complex data without maintaining stringent assumptions. With semiparametric regression, the goal is to develop a properly specified model that integrates the simplicity of parametric estimation with the flexibility provided by nonparametric splines.


A core requirement of both linear and generalized linear regression is that the user must define a set of explanatory variables and functional form that describes the relationship between the predictors and response variable. So when the goal is to build a linear regression model that explains which social and political characteristics within a country account for different levels of macroeconomic performance, we must first specify a number of variables that may explain our target variable. Furthermore, we must specify the functional form of the relationship. In both linear and generalized linear regression, this refers to how we characterize and “encode” each of the explanatory variables. So if we include the log of yearly precipitation rate as
a predictor such that Y = B0 + B1*log(precipitation) , the model would nonlinear in the variables (log(precipitation)) and linear in the parameters (B1 ). Linearity in the variables is not an assumption of linear or generalized linear regression. However, ignoring it could result in a mispecified model with incorrect estimates.


One way to ensure that the model is properly specified is through the use of nonparametric splines. Instead of assuming that we know the functional form for a regression model, the user would essentially estimate the appropriate functional form from the data. Compared to a traditional parametric model whereby we estimate one global value that represents the relationship between X and Y , we would instead have a series of local estimates that characterize the relationship across different values of X . There are many different variations of splines that can utilized, including b-splines, natural splines, cubic splines, and so forth. However, numerous studies have shown that restricted cubic splines with three to five knots tend to perform best across a wide array of domains.

In our previous model where the goal was to identify how macroeconomic performance is affected by weather/precipitation, the linear regression model would look like economy = B0 + B1*precipitation . After incorporating a spline for the explanatory variable, we can now characterize the functional form of the model as economy = B0 + f(precipitation) whereby f characterized the relationship between X and Y conditional on X . An important thing to note when developing semiparametric models is that although it allow users flexibility by incorporating a smoothing term, the estimation process still requires some of the core assumptions in OLS and GLM, namely that the data keeps the requirements of independence, normally distributed errors, and constant error variance.

Before we look at how to implement semiparametric regression, it is worth noting that these types of models are often referred to as either general or generalized additive models. Furthermore, semiparametric variations of other regression models are available such as semiparametric quantile regression and even semiparametric nonlinear regression.


For this post, I’m going to stick with the gam function in the mgcv package because it is usually a good starting point. Taking the previous use case, let’s create some data and construct a linear regression model that regresses economy as a function of weather. Given that the data is from a random distribution, we obviously find a weak relationship between the two variables.

mydat = data.frame(economy = rnorm(100),
                   weather = rnorm(100))
mod = lm(economy ~ weather, data=mydat)
plot(economy ~ weather, data = mydat, cex = .8)
abline(mod, col="red")

Let us say that we know based on experience, theory, or other sources that linearity in the variables is a poor specification in this domain case. One thing that could be done is to incorporate a spline to create local estimates for our explanatory variable within the linear regression.

mod <- lm(economy ~ bs(weather, knots = 3), data=mydat)
plot(economy ~ weather, data = mydat, cex = .8)
lines(mydat$weather, predict(mod), lwd=2, col='red')

The same model could also be examined using gam. Take note of the edf value, which represents how much the explanatory variable is smoothed.

mod <- gam(economy ~ s(weather, bs = "cr", k = 3),
             data = mydat,
             family = gaussian)

This example data is pretty silly, but this should give you some things to investigate in terms of learning how semiparametric models can help improve the inferences that you are examining.


Have questions, comments, interesting work, etc? Feel free to contact me at mathewanalytics@gmail.com

Up next will be a series of posts on causal inference, a topic that I’ve been trying to get a better understanding of over the past month.

Packages for Getting Started with Time Series Analysis in R

A. Motivation

During the recent RStudio Conference, an attendee asked the panel about the lack of support provided by the tidyverse in relation to time series data. As someone who has spent the majority of their career on time series problems, this was somewhat surprising because R already has a great suite of tools for visualizing, manipulating, and modeling time series data. I can understand the desire for a ‘tidyverse approved’ tool for time series analysis, but it seemed like perhaps the issue was a lack of familiarity with the available toolage. Therefore, I wanted to put together a list of the packages and tools that I use most frequently in my work. For those unfamiliar with time series analysis, this could a good place to start investigating R’s current capabilities.

B. Background

Time series data refers to a sequence of measurements that are made over time at regular or irregular intervals with each observation being a single dimension. An example of low dimensional time series is daily wind temperature from 01/01/2001 through 12/31/2005. High dimensional time series is characterized by a larger number of observations, so an example could be the daily wind temperature from 01/01/1980 through 12/31/2010. In either case, the goal of the analysis could lead one to perform regression, clustering, forecasting, or even classification.

C. Data For Examples

To run the code in this post, you will need to access the following data through the unix terminal. It will download a csv file from the City of Chicago website that contains information on reported incidents of crime that occurred in the city of Chicago from 2001 to present.

$ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > chicago_crime_data.csv

Import the data into R and get the aggregate number of reported incidents of theft by day.

dat = fread("chicago_crime_data.csv")
colnames(dat) = gsub(" ", "_", tolower(colnames(dat)))
dat[, date2 := as.Date(date, format="%m/%d/%Y")]
mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]

D. Data Representation

The first set of packages that one should be aware of is related to data storage. One could use data frames, tibbles, or data tables, but there are already a number of data structures that are optimized for representing time series data. The fundamental time series object is “ts”. However, the “ts” class has a number of limitations, and so it is usually best to work with the extensible time series (“xts”) object.

D1. xts

The xts package offers a number of great tools for data manipulation and aggregation. At it’s core is the xts object, which is essentially a matrix object that can represent time series data at different time increments. Xts is a subclass of the zoo object, and that provides it with a lot of functionality.

Here are some functions in xts that are worth investigating:

# create a xts object 
mydat2 = as.xts(mydat)
# filter by date 
mydat2["2015"]  ## 2015
mydat2["201501"]  ## Jan 2015
mydat2["20150101/20150105"]  ## Jan 01 to Jan 05 2015 
# replace all valuues from Aug 25 onwards with 0
mydat2["20170825/"] <- 0
# get the last one month 
last(mydat2, "1 month")
# get stats by time frame
apply.monthly(mydat2, sum)
apply.monthly(mydat2, quantile)
period.apply(mydat2, endpoints(mydat2,on='months'), sum)
period.apply(mydat2, endpoints(mydat2,on='months'), quantile)

E. Dates

R has a maddening array of date and time classes. Be it yearmon, POSIXct, POSIXlt, chron, or something else, each has specific strengths and weaknesses. In general, I find myself using the lubridate package as it simplifies many of the complexities associated with date-times in R.

E1. lubridate

The lubridate package provides a lot of functionality for parsing and formatting dates, comparing different times, extracting the components of a date-time, and so forth.

ymd_h("2010-01-01 10")
ymd_hm("2010-01-01 10:02")
ymd_hms("2010-01-01 10:02:30")

F. Time Series Regression

Distributed lag models (error correction models) are a core component of doing time series analysis. They are many instances where we want to regress an outcome variable at the current time against values of various regressors at current and previous times. dynlm and ardl (wrapper for dynlm) are solid for this type of analysis. Another common task when working with distributed lag models involves using dynamic simulations to understand estimated outcomes in different scenarios. dynsim provides a coherent solution for simulation and visualization of those estimated values of the target variable.

F1. dynlm / ardl

Here is a brief example of how dynlm can be utilized. In what follows, I have created a new variable and lagged it by one day. So the model attempts to regress incidents or reported theft based on the weather from the previous day.

mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
mydat[, weather := sample(c(20:90), dim(mydat), replace=TRUE)]
mydat[, weather_lag := shift(weather, 1, type = 'lag')]
mod = dynlm(N ~ L(weather), data = mydat)

F2. dynsim

Here is a brief example of how dynlm can be utilized. In what follows, I have created a new variable and lagged it by one day. I’ve used the dynsim to product two dynamic simulations and plotted them.

mydat3 = mydat[1:10000]
mod = lm(N ~ weather_lag, data = mydat3)
Scen1 <- data.frame(weather_lag = min(mydat2$weather_lag, na.rm=T))
Scen2 <- data.frame(weather_lag = max(mydat2$weather_lag, na.rm=T))
ScenComb <- list(Scen1, Scen2)
Sim1 <- dynsim(obj = mod, ldv = 'weather_lag', scen = ScenComb, n = 20)

G. Forecasting

G1. forecast

The forecast package is the most used package in R for time series forecasting. It contains functions for performing decomposition and forecasting with exponential smoothing, arima, moving average models, and so forth. For aggregated data that is fairly low dimensional, one of the techniques present in this package should provide an adequate forecasting model given that the assumptions hold.

Here is a quick example of how to use the auto.arima function in R. In general, automatic forecasting tools should be used with caution, but it is a good place to explore time series data.

mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
fit = auto.arima(mydat[,.(N)])
pred = forecast(fit, 200)

G2. smooth

The smooth package provides functions to perform even more variations of exponential smoothing, moving average models, and various seasonal arima techniques. The smooth and forecast package are usually more than adequate for most forecasting problems that pertain to low dimensional data.

Here is a basic example that uses the automatic complex exponential smoothing function:

mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
fit = auto.ces(mydat[,N])
pred = forecast(fit, 200)

So for those of you getting introduced to the R programming language, these are a list extremely useful packages for time series analysis that you will want to get some exposure to.

Have questions, comments, interesting consulting projects, or work that needs done, feel free to contact me at mathewanalytics@gmail.com

Data.Table by Example – Part 3

For this final post, I will cover some advanced topics and discuss how to use data tables within user generated functions. Once again, let’s use the Chicago crime data.

dat = fread("rows.csv")
names(dat) <- gsub(" ", "_", names(dat))
dat[, c("value1", "value2", "value3") := sample(1:50, nrow(dat), replace=TRUE)]

Let’s start by subseting the data. The following code takes the first 50000 rows within the dat dataset, selects four columns, creates three new columns pertaining to the data, and then removes the original date column. The output was saved as to new variable and the user can see the first few columns of the new data table using brackets or head function.

ddat = dat[1:50000, .(Date, value1, value2, value3)][,
               c("year", "month", "day") :=

ddat[1:3]  # same as head(ddat, 3)

We can now do some intermediate calculations and suppress their output by using braces.

ddat[, { avg_val1 = mean(value1)
         new_val1 = mean(abs(value2-avg_val1))
         new_val2 = new_val1^2 }, by=month][order(month)]

In this simple sample case, we have taken the mean value1 for each month, subtracted it from value2, and then squared that result. The output show the final calculation in the brackets, which is the result from squaring. Note that I also ordered the results by month with the chaining process.

That’s all very nice and convenient, but what if we want to create user defined functions that essentially automate these tasks for future work. Let us start with a function for extracting each component from the date column.

ddat = dat[1:50000, .(Date, value1, value2, value3)]

add_engineered_dates <- function(mydt, date_col="Date"){

   new_dt = copy(mydt)

   new_dt[, paste0(date_col, "_year") := year(mdy_hms(get(date_col)))]
   new_dt[, paste0(date_col, "_month") := month(mdy_hms(get(date_col)))]
   new_dt[, paste0(date_col, "_day") := day(mdy_hms(get(date_col)))] 

   new_dt[, c(date_col) := NULL]


We’ve created a new functions that takes two arguments, the data table variable name and the name of the column containing the date values. The first step is to copy the data table so that we are not directly making changes to the original data. Because the goal is to add three new columns that extract the year, month, and day, from the date column, we’ve used the lubridate package to define the date column and then extract the desired values. Furthermore, each new column has been labeled so that it distinctly represents the original date column name and the component that it contains. The final step was to remove the original date column, so we’ve set the date column to NULL. The final line prints the results back to the screen. You can recognize from the code above that the get function is used to take a variable name that represents column names and extract those columns within the function.

result = add_engineered_dates(ddat)

Let’s now apply the previous code with braces to find the squared difference between value2 and the mean value1 metric.

result[, { avg_val1 = mean(value1)
           new_val1 = mean(abs(value2-avg_val1))
           new_val2 = new_val1^2 }, by=.(Date_year,Date_month)][


I’m hoping that these three posts have convinced beginners to the R language that the data.table package is beautiful and worth checking out. The syntax may be scary at first, but it offers an array of powerful tools that everyone should become familiar with. So pull your sleeves up and have fun.

If you have any comments or would like to cover other specific topics, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn