Introduction to the RMS Package

The rms package offers a variety of tools to build and evaluate regression models in R. Originally named ‘Design’, the package accompanies the book “Regression Modeling Strategies” by Frank Harrell, which is essential reading for anyone who works in the ‘data science’ space. Over the past year or so, I have transitioned my personal modeling scripts to rms as it makes things such as bootstrapping, model validation, and plotting predicted probabilities easier to do. While the package is fairly well documented, I wanted to put together a ‘simpler’ and more accessible introduction that would explain to R-beginners how they could start using the rms package. For those with limited statistics training, I strongly suggest reading “Clinical Prediction Models” and working your way up to “Regression Modeling Strategies”.

We start this introduction to the rms package with the datadist function, which computes statistical summaries of predictors to automate estimation and plotting of effects. The user will generally supply the final data frame to the datadist function and set the data distribution using the options function. Note that if you modify the data in your data frame, then you will need to reset the distribution summaries using datadist.


my.df = data.frame(one=c(rnorm(100)), two=c(rnorm(100)), y=rnorm(100))
dd = datadist(my.df)
options(datadist="dd")

my.df_new = subset(my.df, two >= 0.05)
ddd <- datadist(my.df_new)
options( datadist = "ddd" )
ddd

The main functions to estimate models in rms are ols for linear models and lrm for logistic regression or ordinal logistic regression. There are also a few other functions for performing survival analysis,but they will not be covered in this post.


getHdata(prostate)
head(prostate)

ddd <- datadist(prostate)
options( datadist = "ddd" )

Using the prostate data, we built a linear model using ordinary least squares estimation. The argument x and y must be set to true if we plan on evaluate the model in later stages using the validate and calibrate functions. Because we haven’t altered the default contrasts or incorporated any smoothing splines, the coefficients and standard errors should be identical to the results of lm. Use the model variable name to see the estimates from the model and use the summary function to get an overview of the effects of each predictor on the response variable. One important thing to note that the effect point estimates in the summary.rms output relate to the estimated effect of an inter-quartile range increase in the predictor variable.


lmod = ols(wt ~ age + sbp + rx, data=prostate, x=TRUE, y=TRUE)
lmod
summary(lmod)
summary(lmod, age=c(50,70))

This may not seem like anything to write home about. But what makes the rms package special is that it makes the modeling process significantly easier. For the above linear regression model, let’s plot the predicted values and perform internal bootstrapped validation of the model. In the following code, the validate function is used to assess model fit and calibrate is used to assess if the observed responses agree with predicted responses.


plot(anova(lmod), what='proportion chisq') # relative importance
plot(Predict(lmod)) # predicted values
rms::validate(lmod, method="boot", B=500) # bootstrapped validation
my.calib <- rms::calibrate(lmod, method="boot", B=500) # model calibration
plot(my.calib, las=1)
vif(lmod) # test for multicolinearity
Predict(lmod)

Let us now build a logistic regression model using the lrm function, plot the expected probabilities, and evaluate the model. We also use the pentrace function to perform logistic regression with penalized maximum likelihood estimation.


mod1 = lrm(as.factor(bm) ~ age + sbp + rx, data=prostate, x=TRUE, y=TRUE)
mod1
summary(mod1)

plot(anova(mod1), what='proportion chisq') # relative importance
plot(Predict(mod1, fun=plogis)) # predicted values
rms::validate(mod1, method="boot", B=500) # bootstrapped validation
my.calib <- rms::calibrate(mod1, method="boot", B=500) # model calibration
plot(my.calib, las=1)

penalty <- pentrace(mod1, penalty=c(0.5,1,2,3,4,6,8,12,16,24), maxit=25)
mod1_pen <- update(mod1, penalty=penalty$penalty)
effective.df(mod1_pen)
mod1_pen

There you have it, a very basic introduction to the rms package for beginners to the R programming language. Once again, I strongly suggest that readers who are not trained statisticians should read and fully comprehend “Clinical Prediction Models” and “Regression Modeling Strategies” by Frank Harrell. You can also access a number of handouts and lecture notes at here.

Batch Forecasting in R

Given a data frame with multiple columns which contain time series data, let’s say that we are interested in executing an automatic forecasting algorithm on a number of columns. Furthermore, we want to train the model on a particular number of observations and assess how well they forecast future values. Based upon those testing procedures, we will estimate the full model. This is a fairly simple undertaking, but let’s walk through this task. My preference for such procedures is to loop through each column and append the results into a nested list.

First, let’s create some data.

ddat <- data.frame(date = c(seq(as.Date("2010/01/01"), as.Date("2010/03/02"), by=1)),
                      value1 = abs(round(rnorm(61), 2)),
                      value2 = abs(round(rnorm(61), 2)),
                      value3 = abs(round(rnorm(61), 2)))
head(ddat)
tail(ddat)

We want to forecast future values of the three columns. Because we want to save the results of these models into a list, lets begin by creating a list that contains the same number of elements as our data frame.

lst.names <- c(colnames(data))
lst <- vector("list", length(lst.names))
names(lst) <- lst.names
lst

I’ve gone ahead and written a user defined function that handles the batch forecasting process. It takes two arguments, a data frame and default argument which specifies the number of observations that will be used in the training set. The model estimates, forecasts, and diagnostic measures will be saved as a nested list and categorized under the appropriate variable name.

batch <- function(data, n_train=55){
 
  lst.names <- c(colnames(data))
  lst <- vector("list", length(lst.names))
  names(lst) <- lst.names    
 
  for( i in 2:ncol(data) ){  
 
    lst[[1]][["train_dates"]] <- data[1:(n_train),1]
    lst[[1]][["test_dates"]] <- data[(n_train+1):nrow(data),1]
 
    est <- auto.arima(data[1:n_train,i])
    fcas <- forecast(est, h=6)$mean
    acc <- accuracy(fcas, data[(n_train+1):nrow(data),i])
    fcas_upd <- data.frame(date=data[(n_train+1):nrow(data),1], forecast=fcas,                           actual=data[(n_train+1):nrow(data),i])
 
    lst[[i]][["estimates"]] <- est
    lst[[i]][["forecast"]] <- fcas
    lst[[i]][["forecast_f"]] <- fcas_upd
    lst[[i]][["accuracy"]] <- acc
 
    cond1 = diff(range(fcas[1], fcas[length(fcas)])) == 0
    cond2 = acc[,3] >= 0.025
 
    if(cond1|cond2){
 
      mfcas = forecast(ma(data[,i], order=3), h=5)        
      lst[[i]][["moving_average"]] <- mfcas
 
    } else {
 
      est2 <- auto.arima(data[,i])
      fcas2 <- forecast(est, h=5)$mean
 
      lst[[i]][["estimates_full"]] <- est2
      lst[[i]][["forecast_full"]] <- fcas2
 
    }  
  }  
  return(lst)
}
 
batch(ddat)

This isn’t the prettiest code, but it gets the job done. Note that lst was populated within a function and won’t be available in the global environment. Instead, I chose to simply print out the contents of the list after the function is evaluated.

R Programming Notes

I’ve been on a note taking binge recently. This post covers a variety of topics related to programming in R. The contents were gathered from many sources and structured in such a way that it provided the author with a useful reference guide for understanding a number of useful R functions.

DO.CALL

The do.call function executes a function call on a list of arguments.

do.call("R_Function", "List_of_Arguments")

This is equivilant to telling R which arguments the function should operate on.

R_Function( "List_of_Arguments" ){
  ...
}

Consider the following list with four elements. We can use this function to find the total sum across all list elements or bind the rows into a data frame.

x1 <- c(1,2,5)
x2 <- c(1,3,6)
x3 <- c(1,4,7)
x4 <- c(1,5,8)
 
do.call(sum, list(x1,x2,x3,x4))  # sum all list elements
do.call(rbind, list(x1,x2,x3,x4))  # rbind the list elements

Let’s consider a scenario where we have a small data frame and want to run a general linear model on different combintions of attributes. One solution to this problem is to create the formula object as a list within a function, and then utilize do.call to run a linear model on the desired attribute.

dat <- data.frame(x1 = rnorm(100, m=50), x2 = rnorm(100, m=50), x3 = rnorm(100, m=50), y = rnorm(100))
 
new_mod <- function(form){
  lstt = list(formula=as.formula(paste("y ~ ", form, sep="")), data=as.name("dat"))
  summary(do.call("lm", lstt))
}
 
new_mod("x1")
new_mod("x2")
new_mod("x3")

EVAL

The eval function evaluates an expression, an object that represents an action that can be performed in R. An expression is different from an operation, which refers to the execution of an operation. In the following example, we assigned a value to a variable and performed an operation using that variable.

x <- 4
y <- x * 10
y

The expression and quote functions are used to takes an expression as an argument and returns an expression without evaluation.

ee = expression(y~x1)
ee
 
z <- quote(y <- x * 10)
z
 
is.expression(ee)
is.call(z)

In the above example, z is referred to as a call object. They are usually created with the call function. For example, the following code pieces together commands into a call object and evaluates them.

mycall <- call("lm", quote(y ~ x1), data=quote(dat))
mod <- eval(mycall)
summary(mod)

SUBSTITUTE

Another common procedure is to replace certain variables within a user defined function. This can be achieved with the substitute function. The below code replaces x1 and x2 in the expression with the values from the list.

replace <- list(x1 = as.name("alphabet"), x2 = as.name("zoology"))
substitute(expression(x1 + x2 + log(x1) + x3), replace)

SETNAMES

There are a number of commands for working within column and row names in R. It’s generally suggested that the setNames function be used when modifying data structure names within a function.

names_new <- c("one","two","three","four")
new_dat <- expression(data.frame(x1 = rnorm(100, m=50), x2 = rnorm(100, m=50), x3 = rnorm(100, m=50), y = rnorm(100)))
head(setNames(eval(new_dat), names_new))
 
my_lst <- list(lname = as.name("xls"))
setNames(my_lst, "x1")

When working with data within a function, it will often be useful to write code that creates the names and data structure, and then evaluates the pieces together.

MISSING

Use the missing function to check whether an argument was supplied by the user.

func <- function(x){
    if(missing(x)){
        x = 5
    }
    y = x*x; y
}
 
func(2)
func()

STOP

The stop function is used to halt a function, and is usually used when a particular condition is met.

guess <- function(n){
  if( n >= 6 ){
      stop("WRONG!")
}}
 
guess(5)
guess(10)

WARNING

Use the warning function to throw a comment when a condition is met. It does not halt the function.

guess <- function(n){
  if( n >= 6 ){
      warning("BE CAREFUL!")
}}
 
guess(5)
guess(10)

ELIPSES (…)

When functions are evaluated, R scans each function argument it can understand. When ellipsis are added as a function argument, it allows for other arguments to be passed into the function. This technique is frequently used when plotting and should be used when the function is designed to take any number of named or unnamed arguments.

plotter <- function(x, y, ...){
  plot(x, y, ...)
}

To make use of ellipsis, it’s suggested that the user scan through the … and turn them into a list. This is because some of the arguments in dots may be intended for several different functions.

 

SQL Cheat Sheet

I’ve been putting together a basic SQL cheat sheet that could be used as a reference guide. Here are a series of common procedures that should be of use for anyone who uses SQL to extract data. No explanations are provided as they should largely be known to the end user.


-- COUNT OF DISTINCT VALUES 

COUNT(DISTINCT year) AS years_count
COUNT(DISTINCT month) AS months_count

-- SELECT DUPLICATED ROWS

SELECT name, email, COUNT(*)
FROM users
GROUP BY name, email
HAVING COUNT(*) &gt; 1

SELECT name, email
    FROM users
    WHERE email in
    (
    SELECT email
    FROM users
    GROUP BY email
    HAVING COUNT(*)&gt;1
    )

SELECT firstname, lastname, list.address FROM list
INNER JOIN (SELECT address FROM list
GROUP BY address HAVING count(id) &gt; 1) dup ON list.address = dup.address

SELECT
	td.user_id,
	td.brand,
	td.order_dt,
	COUNT(*)
FROM training_details AS td
INNER JOIN users as u on u.user_id = td.user_id
GROUP BY 1, 2, 3
HAVING COUNT(*) &gt;= 2
ORDER BY td.order_date DESC;

-- USING CASE WHEN - ONE

SELECT rr.id,
       rr.created_at,
       rr.report_id,
       rr.account_id,
       rr.executed_by_id,
       rr.data_source_id,
       rr.state,
       MAX(CASE WHEN rs.id IS NOT NULL OR lsr.id IS NOT NULL THEN 1 ELSE 0 END) AS scheduled_run
 FROM report_runs rr
 LEFT JOIN report_schedule_runs rs ON rs.report_run_id = rr.id
 LEFT JOIN list_run_report_runs lrrr ON lrrr.report_run_id = rr.id
 LEFT JOIN list_schedule_runs lsr ON lsr.list_run_id = lrrr.list_run_id
 GROUP BY 1,2,3,4,5,6,7

-- SEE THE FIRST FEW ROWS OF A TABLE

SELECT count(1) FROM table;

SELECT * FROM table LIMIT 5;

-- USING THE LIKE OPERATOR

SELECT * FROM student WHERE name LIKE ‘d%n’;
(returns dan or den)

-- CHECKING QUERY PERFORMANCE

EXPLAIN QUERY PLAN SELECT * FROM student;

-- SUBQUERY - ONE

SELECT SUM (Sales) FROM Store_Information
WHERE Store_Name IN
(SELECT Store_Name FROM Geography
WHERE Region_Name = 'West');

-- SUBQUERY - TWO

SELECT SUM (a1.Sales) FROM Store_Information a1
WHERE a1.Store_Name IN
(SELECT Store_Name FROM Geography a2
WHERE a2.Store_Name = a1.Store_Name);

-- SUBQUERY - TWO

SELECT sub.*
  FROM (
        SELECT *
          FROM table
         WHERE day_of_week = 'Friday'
       ) sub
 WHERE sub.resolution = 'NONE'

-- SUBQUERY - THREE

SELECT *
  FROM table
 WHERE Date IN (SELECT date
                 FROM table
                ORDER BY date
                LIMIT 5
              )

-- SUBQUERY - FOUR

SELECT incidents.*,
       sub.incidents AS incidents_that_day
  FROM tutorial.sf_crime_incidents_2014_01 incidents
  JOIN ( SELECT date,
          COUNT(incidnt_num) AS incidents
           FROM tutorial.sf_crime_incidents_2014_01
          GROUP BY 1
       ) sub
    ON incidents.date = sub.date
 ORDER BY sub.incidents DESC, time

-- SELECT RECORDS FROM A TIME FRAME

SELECT * FROM users WHERE TO_DAYS(last_login) = ( TO_DAYS(NOW()) - 1 )

SELECT* FROM users DATE_SUB(NOW(),INTERVAL 90 MINUTE);

-- UNION OPERATOR

SELECT users.name
FROM users WHERE (users.name BETWEEN 'A%' AND 'M%')
UNION
SELECT banned_users.name FROM banned_users
WHERE (banned_users.name BETWEEN 'A%' AND 'M%');

-- CONCATENATE DATA INTO ONE COLUMN

SELECT CONCAT(emp.firstname, '-', emp.lastname) AS emp_full_name FROM emp;

-- INDEXING STRINGS

SELECT LEFT(date, 10) AS cleaned_date,
       RIGHT(date, 17) AS cleaned_time
FROM table

SELECT SUBSTR(date, 4, 2) AS day
FROM table

-- OTHER

Select database: use [database];

Show all tables: show tables;

Show table structure: describe [table];

Counting and selecting grouped records:
SELECT *, (SELECT COUNT([column]) FROM [table]) AS count
FROM [table]
GROUP BY [column];

Select records containing [value]:
SELECT * FROM [table]
WHERE [column] LIKE '%[value]%';

Select records starting with [value]:
SELECT * FROM [table]
WHERE [column] LIKE '[value]%';

Select records starting with val and ending with ue:
SELECT * FROM [table]
WHERE [column] LIKE '[val_ue]';

Select a range:
SELECT * FROM [table]
WHERE [column] BETWEEN [value1] and [value2];

Select with custom order and only limit:
SELECT * FROM [table]
WHERE [column]
ORDER BY [column] ASC
LIMIT [value]; 

-- INNER, LEFT, RIGHT, AND OUTER JOIN

INNER JOIN: returns rows when there is a match in both tables.

LEFT JOIN: returns all rows from the left table, even if there are no matches in the right table.

RIGHT JOIN: returns all rows from the right table, even if there are no matches in the left table.

FULL JOIN: returns rows when there is a match in one of the tables.

SELF JOIN: is used to join a table to itself as if the table were two tables, temporarily renaming at least one table in the SQL statement.

CARTESIAN JOIN: returns the Cartesian product of the sets of records from the two or more joined tables.

Weekly R-Tips: Visualizing Predictions

Lets say that we estimated a linear regression model on time series data with lagged predictors. The goal is to estimate sales as a function of inventory, search volume, and media spend from two months ago. After using the lm function to perform linear regression, we predict sales using values from two month ago.


frmla <- sales ~ inventory + search_volume + media_spend
mod <- lm(frmla, data=dat)
pred = predict(mod, values, interval="predict") 

If this model is estimated weekly or monthly, we will eventually want to understand how well our model did in predicting actual sales from month to month. To perform this task, we must regularly maintain a spreadsheet or data structure (RDS object) with actual predicted sales figures for each time period. That data can be used to create line graphs that visualize both the actual versus predicted values.

Here is what the original spreadsheet looked like.

Screenshot from 2016-02-05 16:41:22

Transform that data into long format using whatever package you prefer.


library(reshape)
mydat = melt(d1)

This will provide a data frame with three columns.

Screenshot from 2016-02-04 15:04:36

We can utilize the ggplot2 package to create visualizations.


ggplot(mydat, aes(Month, value, group=variable, colour=variable)) +
  geom_line(lwd=1.05) + geom_point(size=2.5) + 
  ggtitle("Sales (01/2010 to 05/2015)") +
  xlab("Date") + ylab("Sales") + ylim(0,30000) + xlab(" ") + ylab(" ") +  
  theme(legend.title=element_blank()) + xlab(" ") + 
  theme(axis.text.x=element_text(colour="black")) +
  theme(axis.text.y=element_text(colour="black")) +
  theme(legend.position=c(.4, .85))

Predictions

Above is an example of what the final product could look like. Visualizing predicted against actual values is an important component of evaluating the quality of a model. Furthermore, having such visualization will be of value when interacting with business audiences and “selling” your analysis.

Extract Google Trends Data with Python

Anyone who has regularly worked with Google Trends data has had to deal with the slightly tedious task of grabbing keyword level data and reformatting the spreadsheet provided by Google. After looking for a seamless way to pull the data, I came upon the PyTrends library on GitHub, and sought to put together some quick user defined functions to manage the task of pulling daily and weekly trends data.


# set working directory (location where code is)
import os
os.chdir("path")

import re
import csv
import time
import pandas as pd
from random import randint
from GT_Automation_Code import pyGTrends

# set gmail credentials and path to extract data
google_username = "*****@gmail.com"
google_password = "*****"

Daily_Data = [ ]
        
# define daily pull code
def GT_Daily_Run(keys):
    
    path = 'path'

    # connect to Google
    connector = pyGTrends(google_username, google_password)
    # make request
    connector.request_report(keys, date="today 90-d", geo="US")
    # wait a random amount of time between requests to avoid bot detection
    time.sleep(randint(5, 10))
    # download file
    connector.save_csv(path, '_' + "GT_Daily" + '_' + keys.replace(' ', '_'))

    name = path + '_' + "GT_Daily" + '_' + keys.replace(' ', '_')
      
    with open(name + '.csv', 'rt') as csvfile:    
        csvReader = csv.reader(csvfile)
        data = []

        for row in csvReader:
            if any('2015' in s for s in row): 
                data.append(row)

        day_df = pd.DataFrame(data)
        cols = ["Day", keys]    
        day_df.columns = [cols]  
        Daily_Data.append(day_df) 

keywords = ['soccer', 'football', 'baseball']    

map(lambda x: GT_Daily_Run(x), keywords)  

rge = [Daily_Data[0], Daily_Data[1], Daily_Data[2]]    

df_final_daily = reduce(lambda left,right: pd.merge(left,right, on='Day'), rge)
df_final_daily = df_final_daily.loc[:, (df_final_daily != "0").any(axis=0)]
df_final_daily.to_csv("Daily_Trends_Data.csv", index=False)

Weekly_Data = [ ]     
        
# define weekly pull code
def GT_Weekly_Run(keys):

    path = 'path'
    
    # connect to Google
    connector = pyGTrends(google_username, google_password)
    # make request
    connector.request_report(keys, geo="US")
    # wait a random amount of time between requests to avoid bot detection
    time.sleep(randint(5, 10))
    # download file
    connector.save_csv(path, '_' + "GT_Weekly" + '_' + keys.replace(' ', '_'))

    name = path + '_' + "GT_Weekly" + '_' + keys.replace(' ', '_') 

    with open(name + '.csv', 'rt') as csvfile:    
        csvReader = csv.reader(csvfile)
        data = []
        datex = re.compile('(19|20)dd-(0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])')

        for row in csvReader:
            if datex.search(str(row)):
                data.append(row)
       
        week_df = pd.DataFrame(data)
        cols = ["Week", keys]    
        week_df.columns = [cols]
        Weekly_Data.append(week_df)              
             
map(lambda x: GT_Weekly_Run(x), keywords)  

rge = [Weekly_Data[0], Weekly_Data[1], Weekly_Data[2]]    

df_final_weekly = reduce(lambda left,right: pd.merge(left,right, on='Week'), rge)
df_final_weekly = df_final_weekly.loc[:, (df_final_weekly != "0").any(axis=0)] 
df_final_weekly.to_csv("Weekly_Trends_Data.csv", index=False)


Weekly R-Tips: Importing Packages and User Inputs

Number 1: Importing Multiple Packages

Anyone who has used R for some time has written code that required the use of multiple packages. In most cases, this will be done by using the library or require function to bring in the appropriate extensions.


library(forecast)
library(ggplot2)
library(stringr)
library(lubridateee)
library(rockchalk)

That’s nice and gets the desired result, but can’t we just import all the packages we need in one or two lines. Yes we can, and here is the one line of code to do that.


libs <- c("forecast", "ggplot2", "stringr", "lubridateee", "rockchalk")
sapply(libs, library, character.only=TRUE, logical.return=TRUE)

libs <- c("forecast", "ggplot2", "stringr", "lubridateee", "rockchalk")
lapply(libs, require, character.only=TRUE)

Number 2: User Input

One side project that I hope to start on is a process whereby I can interact with R and select options that will result in particular outcomes. For example, let’s say you’re trying to put together a script that manages a weekly list. A good first step would be a list of options that the user would see and be prompted to select an option. Here is how R can be used to get user input in such circumstances.


lopts <- cat("
             1. Add an item
             2. Delete an item
             3. Print the list
             4. Quit 
             ")

action <- readline("Choose an option: ")