Packages for Getting Started with Time Series Analysis in R

A. Motivation

During the recent RStudio Conference, an attendee asked the panel about the lack of support provided by the tidyverse in relation to time series data. As someone who has spent the majority of their career on time series problems, this was somewhat surprising because R already has a great suite of tools for visualizing, manipulating, and modeling time series data. I can understand the desire for a ‘tidyverse approved’ tool for time series analysis, but it seemed like perhaps the issue was a lack of familiarity with the available toolage. Therefore, I wanted to put together a list of the packages and tools that I use most frequently in my work. For those unfamiliar with time series analysis, this could a good place to start investigating R’s current capabilities.

B. Background

Time series data refers to a sequence of measurements that are made over time at regular or irregular intervals with each observation being a single dimension. An example of low dimensional time series is daily wind temperature from 01/01/2001 through 12/31/2005. High dimensional time series is characterized by a larger number of observations, so an example could be the daily wind temperature from 01/01/1980 through 12/31/2010. In either case, the goal of the analysis could lead one to perform regression, clustering, forecasting, or even classification.

C. Data For Examples

To run the code in this post, you will need to access the following data through the unix terminal. It will download a csv file from the City of Chicago website that contains information on reported incidents of crime that occurred in the city of Chicago from 2001 to present.

$ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > chicago_crime_data.csv

Import the data into R and get the aggregate number of reported incidents of theft by day.

library(data.table)
dat = fread("chicago_crime_data.csv")
colnames(dat) = gsub(" ", "_", tolower(colnames(dat)))
dat[, date2 := as.Date(date, format="%m/%d/%Y")]
mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
mydat[1:6]

D. Data Representation

The first set of packages that one should be aware of is related to data storage. One could use data frames, tibbles, or data tables, but there are already a number of data structures that are optimized for representing time series data. The fundamental time series object is “ts”. However, the “ts” class has a number of limitations, and so it is usually best to work with the extensible time series (“xts”) object.

D1. xts

The xts package offers a number of great tools for data manipulation and aggregation. At it’s core is the xts object, which is essentially a matrix object that can represent time series data at different time increments. Xts is a subclass of the zoo object, and that provides it with a lot of functionality.

Here are some functions in xts that are worth investigating:

library(xts)
# create a xts object 
mydat2 = as.xts(mydat)
mydat2
plot.xts(mydat2)
# filter by date 
mydat2["2015"]  ## 2015
mydat2["201501"]  ## Jan 2015
mydat2["20150101/20150105"]  ## Jan 01 to Jan 05 2015 
# replace all valuues from Aug 25 onwards with 0
mydat2["20170825/"] <- 0
mydat2["20170821/"]
# get the last one month 
last(mydat2, "1 month")
# get stats by time frame
apply.monthly(mydat2, sum)
apply.monthly(mydat2, quantile)
period.apply(mydat2, endpoints(mydat2,on='months'), sum)
period.apply(mydat2, endpoints(mydat2,on='months'), quantile)

E. Dates

R has a maddening array of date and time classes. Be it yearmon, POSIXct, POSIXlt, chron, or something else, each has specific strengths and weaknesses. In general, I find myself using the lubridate package as it simplifies many of the complexities associated with date-times in R.

E1. lubridate

The lubridate package provides a lot of functionality for parsing and formatting dates, comparing different times, extracting the components of a date-time, and so forth.

library(lubridate)
ymd("2010-01-01")
mdy("01-01-2010")
ymd_h("2010-01-01 10")
ymd_hm("2010-01-01 10:02")
ymd_hms("2010-01-01 10:02:30")

F. Time Series Regression

Distributed lag models (error correction models) are a core component of doing time series analysis. They are many instances where we want to regress an outcome variable at the current time against values of various regressors at current and previous times. dynlm and ardl (wrapper for dynlm) are solid for this type of analysis. Another common task when working with distributed lag models involves using dynamic simulations to understand estimated outcomes in different scenarios. dynsim provides a coherent solution for simulation and visualization of those estimated values of the target variable.

F1. dynlm / ardl

Here is a brief example of how dynlm can be utilized. In what follows, I have created a new variable and lagged it by one day. So the model attempts to regress incidents or reported theft based on the weather from the previous day.

library(dynlm)
mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
mydat[, weather := sample(c(20:90), dim(mydat), replace=TRUE)]
mydat[, weather_lag := shift(weather, 1, type = 'lag')]
mod = dynlm(N ~ L(weather), data = mydat2)
summary(mod)

F2. dynsim

Here is a brief example of how dynlm can be utilized. In what follows, I have created a new variable and lagged it by one day. I’ve used the dynsim to product two dynamic simulations and plotted them.

library(dynsim)
mydat3 = mydat[1:10000]
mod = lm(N ~ weather_lag, data = mydat3)
Scen1 <- data.frame(weather_lag = min(mydat2$weather_lag, na.rm=T))
Scen2 <- data.frame(weather_lag = max(mydat2$weather_lag, na.rm=T))
ScenComb <- list(Scen1, Scen2)
Sim1 <- dynsim(obj = mod, ldv = 'weather_lag', scen = ScenComb, n = 20)
dynsimGG(Sim1)

G. Forecasting

G1. forecast

The forecast package is the most used package in R for time series forecasting. It contains functions for performing decomposition and forecasting with exponential smoothing, arima, moving average models, and so forth. For aggregated data that is fairly low dimensional, one of the techniques present in this package should provide an adequate forecasting model given that the assumptions hold.

Here is a quick example of how to use the auto.arima function in R. In general, automatic forecasting tools should be used with caution, but it is a good place to explore time series data.

library(forecast)
mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
fit = auto.arima(mydat[,.(N)])
pred = forecast(fit, 200)
plot(pred)

G2. smooth

The smooth package provides functions to perform even more variations of exponential smoothing, moving average models, and various seasonal arima techniques. The smooth and forecast package are usually more than adequate for most forecasting problems that pertain to low dimensional data.

Here is a basic example that uses the automatic complex exponential smoothing function:

library(smooth)
mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
fit = auto.ces(mydat[,N])
pred = forecast(fit, 200)
plot(pred)

So for those of you getting introduced to the R programming language, these are a list extremely useful packages for time series analysis that you will want to get some exposure to.

Have questions, comments, interesting consulting projects, or work that needs done, feel free to contact me at mathewanalytics@gmail.com

Advertisements

Data.Table by Example – Part 3

For this final post, I will cover some advanced topics and discuss how to use data tables within user generated functions. Once again, let’s use the Chicago crime data.

dat = fread("rows.csv")
names(dat) <- gsub(" ", "_", names(dat))
dat[, c("value1", "value2", "value3") := sample(1:50, nrow(dat), replace=TRUE)]
dat[1:3]

Let’s start by subseting the data. The following code takes the first 50000 rows within the dat dataset, selects four columns, creates three new columns pertaining to the data, and then removes the original date column. The output was saved as to new variable and the user can see the first few columns of the new data table using brackets or head function.

ddat = dat[1:50000, .(Date, value1, value2, value3)][,
               c("year", "month", "day") :=
                        .(year(mdy_hms(Date)),
                          month(mdy_hms(Date)),
                          day(mdy_hms(Date)))][,-c("Date")]

ddat[1:3]  # same as head(ddat, 3)

We can now do some intermediate calculations and suppress their output by using braces.

unique(ddat$month)
ddat[, { avg_val1 = mean(value1)
         new_val1 = mean(abs(value2-avg_val1))
         new_val2 = new_val1^2 }, by=month][order(month)]

In this simple sample case, we have taken the mean value1 for each month, subtracted it from value2, and then squared that result. The output show the final calculation in the brackets, which is the result from squaring. Note that I also ordered the results by month with the chaining process.

That’s all very nice and convenient, but what if we want to create user defined functions that essentially automate these tasks for future work. Let us start with a function for extracting each component from the date column.

ddat = dat[1:50000, .(Date, value1, value2, value3)]

add_engineered_dates <- function(mydt, date_col="Date"){

   new_dt = copy(mydt)

   new_dt[, paste0(date_col, "_year") := year(mdy_hms(get(date_col)))]
   new_dt[, paste0(date_col, "_month") := month(mdy_hms(get(date_col)))]
   new_dt[, paste0(date_col, "_day") := day(mdy_hms(get(date_col)))] 

   new_dt[, c(date_col) := NULL]

   new_dt[]
}

We’ve created a new functions that takes two arguments, the data table variable name and the name of the column containing the date values. The first step is to copy the data table so that we are not directly making changes to the original data. Because the goal is to add three new columns that extract the year, month, and day, from the date column, we’ve used the lubridate package to define the date column and then extract the desired values. Furthermore, each new column has been labeled so that it distinctly represents the original date column name and the component that it contains. The final step was to remove the original date column, so we’ve set the date column to NULL. The final line prints the results back to the screen. You can recognize from the code above that the get function is used to take a variable name that represents column names and extract those columns within the function.

result = add_engineered_dates(ddat)
result

Let’s now apply the previous code with braces to find the squared difference between value2 and the mean value1 metric.

result[, { avg_val1 = mean(value1)
           new_val1 = mean(abs(value2-avg_val1))
           new_val2 = new_val1^2 }, by=.(Date_year,Date_month)][
                  order(Date_year,Date_month)][1:12]

Perfect!

I’m hoping that these three posts have convinced beginners to the R language that the data.table package is beautiful and worth checking out. The syntax may be scary at first, but it offers an array of powerful tools that everyone should become familiar with. So pull your sleeves up and have fun.

If you have any comments or would like to cover other specific topics, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn

Data.Table by Example – Part 2

In part one, I provided an initial walk through of some nice features that are available within the data.table package. In particular, we saw how to filter data and get a count of rows by the date.

dat = fread("rows.csv")
names(dat) <- gsub(" ", "_", names(dat))
dat[1:3]

Let us now add a few columns to our dataset on reported crimes in the city of Chicago. There are many ways to do do this but they involve the use of the := operator. Since data.table updates values by reference, we do not need to save the results as another variable. This is a very desirable feature.

dat[, c("value1", "value2", "value3") := sample(1:50, nrow(dat), replace=TRUE)]

dat[, `:=`(value1 = sample(1:50, nrow(dat), replace=TRUE),
           value2 = sample(1:50, nrow(dat), replace=TRUE),
           value3 = sample(1:50, nrow(dat), replace=TRUE))]

You can also just use the traditional base R solution for adding new columns as data.tables are also data frames.

dat$value1 <- sample(1:50, nrow(dat), replace=TRUE)
dat$value2 <- sample(1:50, nrow(dat), replace=TRUE)
dat$value3 <- sample(1:50, nrow(dat), replace=TRUE)

In any case, we now have three new columns with randomly selected values between 1 and 50. We can now look to summarize these values and see how they differ across the primary arrest type and other categorical variables.

dat[, .(mean = mean(value1, na.rm = TRUE),
        median = median(value1, na.rm = TRUE),
        min = min(value1, na.rm = TRUE),
        max = max(value1, na.rm = TRUE))]

dat[Primary_Type=="PROSTITUTION",
              .(mean = mean(value1, na.rm = TRUE),
                median = median(value1, na.rm = TRUE),
                min = min(value1, na.rm = TRUE),
                max = max(value1, na.rm = TRUE))]

The above code allows us to get the mean, median, min, and max values from the value1 column for all rows and for when the primary type is prostitution. To get these summaries across multiple columns, we can use lapply, .SD, and .SDcols. Notice that in the second line below, we passed primary type to the by operator so that we get mean value for each of the arrest type categories.

dat[, lapply(.SD, mean, na.rm=TRUE), .SDcols=c("value1", "value2", "value3")]

dat[, lapply(.SD, mean, na.rm=TRUE), by=Primary_Type, .SDcols=c("value1", "value2", "value3")][1:4]

But wait, that only provides me with the mean of a single value column. If you want to apply multiple functions to columns, the user is required to write a function that can then be used within lapply.

my.summary <- function(x){
                   c(mean = mean(x),
                     median = median(x),
                     min = min(x),
                     max = max(x))
}

dat[, lapply(.SD, my.summary), .SDcols=c("value1", "value2", "value3")]

dat[, lapply(.SD, my.summary), by=.(Primary_Type), .SDcols=c("value1", "value2", "value3")]

Perfect! As you can see, the syntax is concise and is very easy to work with.

In the next part of this series, I’ll cover a few advanced features like get() and {}.

If you have any questions or comments, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn 

 

Data.Table by Example – Part 1

For many years, I actively avoided the data.table package and preferred to utilize the tools available in either base R or dplyr for data aggregation and exploration. However, over the past year, I have come to realize that this was a mistake. Data tables are incredible and provide R users with a syntatically
concise and efficient data structure for working with small, medium, or large datasets. While the package is well documented, I wanted to put together a series of posts that could be useful for those who want to get introduced to the data.table package in a more task oriented format.

For this series of posts, I will be working with data that comes from the Chicago Police Department’s Citizen Law Enforcement Analysis and Reporting system. This dataset contains information on reported incidents of crime that occured in the city of Chicago from 2001 to present. You can use the wget command in the terminal to export it as a csv file.

$ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > rows.csv

This file can be found in your working directory and was saved as “rows.csv”. We will import the data into R with the fread function and look at the first few rows and structure of the data.

dat = fread("rows.csv")

dat[1:3]
str(dat)

Notice that the each of the string variables in the data set was imported as a character and not a factor. With base R functions like read.csv, we have to set the stringsAsFactors argument to TRUE if we want this result.

names(dat) <- gsub(" ", "_", names(dat))

Let’s say that we want to see the frequency distribution of several of these variables. This can be done by using .N in conjunction with the by operator.

dat[, .N, by=.(Arrest)]

In the code below, you can also see how to chain operations togther. We started by finding the count of each response in the variable, ordered the count in descending order, and then selected only those which occured more than 200,000 times.

dat[, .N, by=.(Primary_Type)][order(-N)][N>=200000]

dat[, .N, by=.(Location_Description)][order(-N)][N>=200000]

Let’s say that we want to get a count of prostitution incidents by month. To get the desired results, we will need to modify the date values, filter instances in which the primary type was “prostitution”, and then get a count by each date.

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][
         Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)][]

If you want to plot the results as a line graph, just add another chain which executes the visualization or use the maggritr %>% operator.

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][
             Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)][,
          plot(N, type="l")]

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][
            Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)] %>%
          ggplot(., aes(date2, N)) + geom_line(group=1)

I’ve obviously skipped over a lot and some of the code presented here is more verbose than needed. Even so, beginners to R will hopefully find this useful and it will pique your interest in the beauty of the data table package. Future posts will cover more of the goodies available in the data.table package such as get(), set(), {}, and so forth.

If you have any questions or comments, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn 

R Programming Notes – Part 2

In an older post, I discussed a number of functions that are useful for programming in R. I wanted to expand on that topic by covering other functions, packages, and tools that are useful. Over the past year, I have been working as an R programmer and these are some of the new learnings that have become fundamental in my work.

IS TRUE and IS FALSE

isTRUE is a logical operator that can be very useful in checking whether a condition or variable has been set to true. Lets say that we are writing a script whereby we will take run a generalized linear regression when the parameter run_mod is set to true. The conditional portion of the script can be written as either if(isTRUE(run_mod)) or if(run_mod). I am partial to isTRUE, but this is entirely a matter of personal preference. Users should also be aware of the isFALSE function, which is part of the BBmisc package.


run_mod = TRUE 

if(isTRUE(run_mod){
	tryCatch(
         GLM_Model(full_data=full.df, train_data=train.df, 
                      test_data=test.df), 
    error = function(e) {
         print("error occured")
         print(e)
    })
}

if(BBmisc::isFALSE(check_one) & BBmisc::isFALSE(check_two)){
	data_output.tmp$score = 0.8
}

INVISIBLE

The invisible function can be used to return an object that is not printed out and can be useful in a number of circumstances. For example, it’s useful when you have helper functions that will be utilized within other functions to do calculations. In those cases, it’s often not desireable to print those results. I generally use invisible when I’m checking function arguments. For example, consider a function that takes two arguments and you need to check whether the input is a vector.


if(!check_input(response, type='character', length=1)) {
    stop('something is wrong')
}

The check_input function is something I created and has a few lines which contain invisible. The idea is for check_input to return true or false based on the inputs so that it’ll stop stop the execution when needed.


if(is.null(response) & !length(response)==0) {
    return(FALSE)
} else if (!is.null(response)) {
    return(invisible(TRUE))
}

DEBUG

When I’m putting together new classes or have multiple functions that interact with one another, I ensure that the code includes an comprehensive debugging process. This means that I’m checking my code at various stages so that I can identify when issues arise. Consider that I’m putting together a function that will go through a number of columns in a data frame, summarize those variables, and save the results as a nested list. To effectively put together code without issues, I ensure that the functions takes a debug argument that will run when it’s set to true. In the code below, it will print out values at different stages of the code. Furthermore, the final line of the code will check the resulting data structure.


DSummary_Variable(data_obj, var, debug=TRUE){
             ......
}

if(debug) message('|==========>>>  Processing of the variable. \n')  

if(debug){ 
  if(!missing(var_summary)){ 
      message('|==========>>>  var_summary has been created and 
              has a length of ', length(var_summary), ' and the 
              nested list has a length of ', 
              length(var_summary[['var_properties']]), ' \n')  
  } else {
      stop("var_summary is missing. Please investigate")
  }

If you have multiple functions that interact with one another, it’s a good idea to preface the printed message with the name of the function name.


add_func <- function(a,b) a + b

mult_func <- function(a,b) a * b

main_func <- function(data, cola, colb, debug=TRUE){

	if(debug){
		message("mult_func: checking inputs to be used")
	}

	mult_func(data[,cola], data[,colb])

	if(debug){
		message("mult_add: checking inputs to be used")
	}
}

Stay tuned for part three, where I’ll talk about the testthat and assertive package.

R Programming Notes – Part 1

I’ve been on a note taking binge recently. This post covers a variety of topics related to programming in R. The contents were gathered from many sources and structured in such a way that it provided the author with a useful reference guide for understanding a number of useful R functions.

DO.CALL

The do.call function executes a function call on a list of arguments.

do.call("R_Function", "List_of_Arguments")

This is equivilant to telling R which arguments the function should operate on.

R_Function( "List_of_Arguments" ){
  ...
}

Consider the following list with four elements. We can use this function to find the total sum across all list elements or bind the rows into a data frame.

x1 <- c(1,2,5)
x2 <- c(1,3,6)
x3 <- c(1,4,7)
x4 <- c(1,5,8)
 
do.call(sum, list(x1,x2,x3,x4))  # sum all list elements
do.call(rbind, list(x1,x2,x3,x4))  # rbind the list elements

Let’s consider a scenario where we have a small data frame and want to run a general linear model on different combintions of attributes. One solution to this problem is to create the formula object as a list within a function, and then utilize do.call to run a linear model on the desired attribute.

dat <- data.frame(x1 = rnorm(100, m=50), x2 = rnorm(100, m=50), x3 = rnorm(100, m=50), y = rnorm(100))
 
new_mod <- function(form){
  lstt = list(formula=as.formula(paste("y ~ ", form, sep="")), data=as.name("dat"))
  summary(do.call("lm", lstt))
}
 
new_mod("x1")
new_mod("x2")
new_mod("x3")

EVAL

The eval function evaluates an expression, an object that represents an action that can be performed in R. An expression is different from an operation, which refers to the execution of an operation. In the following example, we assigned a value to a variable and performed an operation using that variable.

x <- 4
y <- x * 10
y

The expression and quote functions are used to takes an expression as an argument and returns an expression without evaluation.

ee = expression(y~x1)
ee
 
z <- quote(y <- x * 10)
z
 
is.expression(ee)
is.call(z)

In the above example, z is referred to as a call object. They are usually created with the call function. For example, the following code pieces together commands into a call object and evaluates them.

mycall <- call("lm", quote(y ~ x1), data=quote(dat))
mod <- eval(mycall)
summary(mod)

SUBSTITUTE

Another common procedure is to replace certain variables within a user defined function. This can be achieved with the substitute function. The below code replaces x1 and x2 in the expression with the values from the list.

replace <- list(x1 = as.name("alphabet"), x2 = as.name("zoology"))
substitute(expression(x1 + x2 + log(x1) + x3), replace)

SETNAMES

There are a number of commands for working within column and row names in R. It’s generally suggested that the setNames function be used when modifying data structure names within a function.

names_new <- c("one","two","three","four")
new_dat <- expression(data.frame(x1 = rnorm(100, m=50), x2 = rnorm(100, m=50), x3 = rnorm(100, m=50), y = rnorm(100)))
head(setNames(eval(new_dat), names_new))
 
my_lst <- list(lname = as.name("xls"))
setNames(my_lst, "x1")

When working with data within a function, it will often be useful to write code that creates the names and data structure, and then evaluates the pieces together.

MISSING

Use the missing function to check whether an argument was supplied by the user.

func <- function(x){
    if(missing(x)){
        x = 5
    }
    y = x*x; y
}
 
func(2)
func()

STOP

The stop function is used to halt a function, and is usually used when a particular condition is met.

guess <- function(n){
  if( n >= 6 ){
      stop("WRONG!")
}}
 
guess(5)
guess(10)

WARNING

Use the warning function to throw a comment when a condition is met. It does not halt the function.

guess <- function(n){
  if( n >= 6 ){
      warning("BE CAREFUL!")
}}
 
guess(5)
guess(10)

ELIPSES (…)

When functions are evaluated, R scans each function argument it can understand. When ellipsis are added as a function argument, it allows for other arguments to be passed into the function. This technique is frequently used when plotting and should be used when the function is designed to take any number of named or unnamed arguments.

plotter <- function(x, y, ...){
  plot(x, y, ...)
}

To make use of ellipsis, it’s suggested that the user scan through the … and turn them into a list. This is because some of the arguments in dots may be intended for several different functions.

 

Working With SEM Keywords in R

The following post was republished from two previous posts that were on an older blog of mine that is no longer available. These are from several years ago, and related to two critical questions that I encountered. One, how can I automatically generate hundreds of thousands of keywords for a search engine marketing campaign. Two, how can I develop an effective system for examining keywords based on different characteristics.

Generating PPC Keywords in R

Paid search marketing refers to the process of driving traffic to a website by purchasing ads on search engines. Advertisers bid on certain keywords that users might search for, and that determines when and where their ads appear. For example, an individual who owns an auto dealership would want to bid on keywords relating to automobiles that a reasonable people would search for on a search engine. In both Google and Bing, advertisers are able to specify which keywords they would like to bid for and at what amount. If the user decides to bid on just a small number of keywords, they can type that information and specify a bid. However, what if you want to bid on a significant number of keywords. Instead of typing each and every keyword into the Google or Bing dashboard, you could programmatically generate the keywords in R.

Let’s say that I run an online retail establishment that sells mens and womens streetwear and I want to drive more traffic to my online store by placing ads on both Google and Bing. I want to bid on about a number of keywords related to fashion and have created a number of ‘root’ words that will comprise the majority of these keywords. To generate my desired keywords, I have a written a function which will take every single permutation of the root words.

root1 = c("fashion", "streetwear")
root2 = c("karmaloop", "crooks and castles", "swag")
root3 = c("urban clothing", "fitted hats", "snapbacks")
root4 = c("best", "authentic", "low cost") 

myfunc <- function(){
      lst <- list(root1=c(root1), root2=c(root2), root3=c(root3),
            root4=c(root4))
      myone <- function(x, y){
            m1 <- do.call(paste, expand.grid(lst[[x]], lst[[y]]))
            mydf <- data.frame(keyword=c(m1))
      }
      mydf <- rbind(myone("root4","root1"), myone("root2","root1"))
      }

mydat <- myfunc()
mydat

write.table(mydat, "adppc.txt", quote=FALSE, row.names=FALSE)

This isn’t the prettiest code in the world, but it gets the job done. In fact, the same results could have achieved using the following code, which is much more efficient.

root5 = c("%s fashion")
root6 = c("%s streetwear")
adcam1 = sprintf(root5, root2)
adcam2 = sprintf(root6, root2)
df = data.frame(keywords=c(adcam1, adcam2))

write.table(df, "adppc.txt", quote=FALSE, row.names=FALSE)

If you have any suggestions for improving my R code, please mention it in the comment section below.

Creating Tags For PPC Keywords

When performing search engine marketing, it is usually beneficial to construct a system for making sense of keywords and their performance. While one could construct Bayesian Belief Networks to model the process of consumers clicking on ads, I have found that using ’tags’ to categorize keywords is just as useful for conducting post-hoc analysis on the effectiveness of marketing campaigns. By ‘tags,’ I mean identifiers which categorize keywords according to their characteristics. For example, in the following data frame, we have six keywords, our average bids, numbers of clicks, and tags for state, model, car, auto, save, and cheap. What we want to do now is set the boolean for each tag to 1 if and only if that tag is mentioned in the keyword.

# CREATE SOME DATA = 
df = data.frame(keyword=c("best car insurance",
                          "honda auto insurance",
                          "florida car insurance",
                          "cheap insurance online",
                          "free insurance quotes",
                          "iowa drivers save money"),
                average_bid=c(3.12, 2.55, 2.38, 5.99, 4.75, 4.59),
                clicks=c(15, 20, 30, 50, 10, 25),
                conversions=c(5, 2, 10, 15, 3, 5),
                state=0, model=0, car=0, auto=0, save=0, cheap=0)
df

# FUNCTION WHICH SETS EACH TAG TO 1 IF THE SPECIFIED TAG IS PRESENT IN THE KEYWORD
main <- function(df) {
  state <- c("michigan", "missouri", "florida", "iowa", "kansas")
  model <- c("honda", "toyota", "ford", "acura", "audi")
  car <- c("car")
  auto <- c("auto")
  save <- c("save")
  cheap <- c("cheap")
  for (i in 1:nrow(df)) {
    Words = strsplit(as.character(df[i, 'keyword']), " ")[[1]]
    if(any(Words %in% state)) df[i, 'state'] <- 1
    if(any(Words %in% model)) df[i, 'model'] <- 1 
    if(any(Words %in% car)) df[i, 'car'] <- 1
    if(any(Words %in% auto)) df[i, 'auto'] <- 1     
    if(any(Words %in% save)) df[i, 'save'] <- 1
    if(any(Words %in% cheap)) df[i, 'cheap'] <- 1
  }
  return(df)
}

one = main(df)

subset(one, state==TRUE | model==TRUE | auto==TRUE)

# AN ALTERNATE METHOD USING THE STRINGR PACKAGE

df

library(stringr)

# CREATE EACH TAG
state <- c("michigan", "missouri", "florida", "iowa", "kansas")
model <- c("honda", "toyota", "ford", "acura", "audi")
car <- c("car")
auto <- c("auto")
save <- c("save")
cheap <- c("cheap")

state_match <- str_c(state, collapse = "|")
model_match <- str_c(model, collapse = "|")
car_match <- str_c(car, collapse = "|")
auto_match <- str_c(auto, collapse = "|")
save_match <- str_c(save, collapse = "|")
cheap_match <- str_c(cheap, collapse = "|")

#FUNCTION TO SET TAG IF PRESENT IN THE KEYWORD
main <- function(df) {
  df$state <- str_detect(df$keyword, state_match)
  df$model <- str_detect(df$keyword, model_match)
  df$car <- str_detect(df$keyword, car_match)
  df$auto <- str_detect(df$keyword, auto_match)
  df$save <- str_detect(df$keyword, save_match)
  df$cheap <- str_detect(df$keyword, cheap_match)
  df
}

two = main(df2)

subset(two, state==TRUE | model==TRUE | auto==TRUE)

By now, some of you are probably wondering why we don’t just select the keyword directly from the original data frame based on the desired characteristic. Well, that works too, albeit I’ve found that the marketing professionals that I’ve worked with have preferred the ‘tagging’ method.

## Alternate approach - SELECT DIRECTLY

df

main <- function(df) {
  model <- c("honda", "toyota", "ford", "acura", "audi")
  for (i in 1:nrow(df)) {
    Words = strsplit(as.character(df[i, 'keyword']), " ")[[1]]
    if(any(Words %in% model)) return(df[i, c(1:4) ])    
  }}

three = main(df)

So there you have it, a method of ‘tagging’ strings according to a certain set of specified characteristics. The benefit of using ‘tags’ is that it provides you with a systematic way to document how the presence of certain words or phrases impacts performance.