Turning Data Into Awesome With sqldf and pandasql

Both R and Python possess libraries for using SQL statements to interact with data frames. While both languages have native facilities for manipulating data, the sqldf and pandasql provide a simple and elegant interface for conducting tasks using an intuitive framework that’s widely used by analysts.

Screenshot from 2015-04-29 11:23:02

Screenshot from 2015-04-29 11:23:44







R and sqldf

sqldf("SELECT COUNT(*) FROM df2 WHERE state = 'CA'")
1        4
      FROM df1
      INNER JOIN df2 ON df1.personid = df2.id 
      WHERE df2.state = 'TX'")
  firstname lastname  var1 state
1     David    Spade -2.09    TX
2       Joe  Montana  1.16    TX
      FROM df1
      INNER JOIN df2 ON df1.personid = df2.id WHERE df1.var1 > 0
      GROUP BY df2.state")
   state COUNT(df1.var1)
1     AZ               1
2     CA               1
3     GA               1
4     IL               1
5     NC               1
6     NY               1
7     OK               1
8     SC               1
9     TX               1
10    VT               1

Python and pandasql

import pandasql as ps
q1 = """SELECT COUNT(*) FROM df2 WHERE state = 'CA'"""
print ps.sqldf(q1, locals())
0         4
q2 = """
    FROM df1 INNER JOIN df2 ON df1.personid = df2.id 
    WHERE df2.state = "TX";
print ps.sqldf(q2, locals())
  firstname lastname  var1 state
0     David    Spade -2.09    TX
1       Joe  Montana  1.16    TX
q3 = """SELECT 
      FROM df1
      INNER JOIN df2 ON df1.personid = df2.id WHERE df1.var1 > 0
      GROUP BY df2.state"""
print ps.sqldf(q3, locals())
  state  COUNT(df1.var1)
0    AZ                1
1    CA                1
2    GA                1
3    IL                1
4    NC                1
5    NY                1
6    OK                1
7    SC                1
8    TX                1
9    VT                1



R Programming Notes – Part 2

In an older post, I discussed a number of functions that are useful for programming in R. I wanted to expand on that topic by covering other functions, packages, and tools that are useful. Over the past year, I have been working as an R programmer and these are some of the new learnings that have become fundamental in my work.


isTRUE is a logical operator that can be very useful in checking whether a condition or variable has been set to true. Lets say that we are writing a script whereby we will take run a generalized linear regression when the parameter run_mod is set to true. The conditional portion of the script can be written as either if(isTRUE(run_mod)) or if(run_mod). I am partial to isTRUE, but this is entirely a matter of personal preference. Users should also be aware of the isFALSE function, which is part of the BBmisc package.

run_mod = TRUE 

         GLM_Model(full_data=full.df, train_data=train.df, 
    error = function(e) {
         print("error occured")

if(BBmisc::isFALSE(check_one) & BBmisc::isFALSE(check_two)){
	data_output.tmp$score = 0.8


The invisible function can be used to return an object that is not printed out and can be useful in a number of circumstances. For example, it’s useful when you have helper functions that will be utilized within other functions to do calculations. In those cases, it’s often not desireable to print those results. I generally use invisible when I’m checking function arguments. For example, consider a function that takes two arguments and you need to check whether the input is a vector.

if(!check_input(response, type='character', length=1)) {
    stop('something is wrong')

The check_input function is something I created and has a few lines which contain invisible. The idea is for check_input to return true or false based on the inputs so that it’ll stop stop the execution when needed.

if(is.null(response) & !length(response)==0) {
} else if (!is.null(response)) {


When I’m putting together new classes or have multiple functions that interact with one another, I ensure that the code includes an comprehensive debugging process. This means that I’m checking my code at various stages so that I can identify when issues arise. Consider that I’m putting together a function that will go through a number of columns in a data frame, summarize those variables, and save the results as a nested list. To effectively put together code without issues, I ensure that the functions takes a debug argument that will run when it’s set to true. In the code below, it will print out values at different stages of the code. Furthermore, the final line of the code will check the resulting data structure.

DSummary_Variable(data_obj, var, debug=TRUE){

if(debug) message('|==========>>>  Processing of the variable. \n')  

      message('|==========>>>  var_summary has been created and 
              has a length of ', length(var_summary), ' and the 
              nested list has a length of ', 
              length(var_summary[['var_properties']]), ' \n')  
  } else {
      stop("var_summary is missing. Please investigate")

If you have multiple functions that interact with one another, it’s a good idea to preface the printed message with the name of the function name.

add_func <- function(a,b) a + b

mult_func <- function(a,b) a * b

main_func <- function(data, cola, colb, debug=TRUE){

		message("mult_func: checking inputs to be used")

	mult_func(data[,cola], data[,colb])

		message("mult_add: checking inputs to be used")

Stay tuned for part three, where I’ll talk about the testthat and assertive package.

R Programming Notes – Part 1

I’ve been on a note taking binge recently. This post covers a variety of topics related to programming in R. The contents were gathered from many sources and structured in such a way that it provided the author with a useful reference guide for understanding a number of useful R functions.


The do.call function executes a function call on a list of arguments.

do.call("R_Function", "List_of_Arguments")

This is equivilant to telling R which arguments the function should operate on.

R_Function( "List_of_Arguments" ){

Consider the following list with four elements. We can use this function to find the total sum across all list elements or bind the rows into a data frame.

x1 <- c(1,2,5)
x2 <- c(1,3,6)
x3 <- c(1,4,7)
x4 <- c(1,5,8)
do.call(sum, list(x1,x2,x3,x4))  # sum all list elements
do.call(rbind, list(x1,x2,x3,x4))  # rbind the list elements

Let’s consider a scenario where we have a small data frame and want to run a general linear model on different combintions of attributes. One solution to this problem is to create the formula object as a list within a function, and then utilize do.call to run a linear model on the desired attribute.

dat <- data.frame(x1 = rnorm(100, m=50), x2 = rnorm(100, m=50), x3 = rnorm(100, m=50), y = rnorm(100))
new_mod <- function(form){
  lstt = list(formula=as.formula(paste("y ~ ", form, sep="")), data=as.name("dat"))
  summary(do.call("lm", lstt))


The eval function evaluates an expression, an object that represents an action that can be performed in R. An expression is different from an operation, which refers to the execution of an operation. In the following example, we assigned a value to a variable and performed an operation using that variable.

x <- 4
y <- x * 10

The expression and quote functions are used to takes an expression as an argument and returns an expression without evaluation.

ee = expression(y~x1)
z <- quote(y <- x * 10)

In the above example, z is referred to as a call object. They are usually created with the call function. For example, the following code pieces together commands into a call object and evaluates them.

mycall <- call("lm", quote(y ~ x1), data=quote(dat))
mod <- eval(mycall)


Another common procedure is to replace certain variables within a user defined function. This can be achieved with the substitute function. The below code replaces x1 and x2 in the expression with the values from the list.

replace <- list(x1 = as.name("alphabet"), x2 = as.name("zoology"))
substitute(expression(x1 + x2 + log(x1) + x3), replace)


There are a number of commands for working within column and row names in R. It’s generally suggested that the setNames function be used when modifying data structure names within a function.

names_new <- c("one","two","three","four")
new_dat <- expression(data.frame(x1 = rnorm(100, m=50), x2 = rnorm(100, m=50), x3 = rnorm(100, m=50), y = rnorm(100)))
head(setNames(eval(new_dat), names_new))
my_lst <- list(lname = as.name("xls"))
setNames(my_lst, "x1")

When working with data within a function, it will often be useful to write code that creates the names and data structure, and then evaluates the pieces together.


Use the missing function to check whether an argument was supplied by the user.

func <- function(x){
        x = 5
    y = x*x; y


The stop function is used to halt a function, and is usually used when a particular condition is met.

guess <- function(n){
  if( n >= 6 ){


Use the warning function to throw a comment when a condition is met. It does not halt the function.

guess <- function(n){
  if( n >= 6 ){
      warning("BE CAREFUL!")


When functions are evaluated, R scans each function argument it can understand. When ellipsis are added as a function argument, it allows for other arguments to be passed into the function. This technique is frequently used when plotting and should be used when the function is designed to take any number of named or unnamed arguments.

plotter <- function(x, y, ...){
  plot(x, y, ...)

To make use of ellipsis, it’s suggested that the user scan through the … and turn them into a list. This is because some of the arguments in dots may be intended for several different functions.


Semiparametric Regression in R


When building statistical models, the goal is to define a compact and parsimonious mathematical representation of some data generating process. Many of these techniques require that one make assumptions about the data or how the analysis is specified. For example, Auto Regressive Integrated Moving Average (ARIMA) models require that the time series is weakly stationary or can be made so. Furthermore, ARIMA assumes that the data has no deterministic time trends, the variance of the error term is constant, and so forth. Assumptions are generally a good thing, but there are definitely situations in which one wants to free themselves from such “constraints.”

In the context of evaluating relationships between one or more target variables and a set of explanatory variables, semiparametric regression is one such technique that provides the user with some flexibility in modeling complex data without maintaining stringent assumptions. With semiparametric regression, the goal is to develop a properly specified model that integrates the simplicity of parametric estimation with the flexibility provided by nonparametric splines.


A core requirement of both linear and generalized linear regression is that the user must define a set of explanatory variables and functional form that describes the relationship between the predictors and response variable. So when the goal is to build a linear regression model that explains which social and political characteristics within a country account for different levels of macroeconomic performance, we must first specify a number of variables that may explain our target variable. Furthermore, we must specify the functional form of the relationship. In both linear and generalized linear regression, this refers to how we characterize and “encode” each of the explanatory variables. So if we include the log of yearly precipitation rate as
a predictor such that Y = B0 + B1*log(precipitation) , the model would nonlinear in the variables (log(precipitation)) and linear in the parameters (B1 ). Linearity in the variables is not an assumption of linear or generalized linear regression. However, ignoring it could result in a mispecified model with incorrect estimates.


One way to ensure that the model is properly specified is through the use of nonparametric splines. Instead of assuming that we know the functional form for a regression model, the user would essentially estimate the appropriate functional form from the data. Compared to a traditional parametric model whereby we estimate one global value that represents the relationship between X and Y , we would instead have a series of local estimates that characterize the relationship across different values of X . There are many different variations of splines that can utilized, including b-splines, natural splines, cubic splines, and so forth. However, numerous studies have shown that restricted cubic splines with three to five knots tend to perform best across a wide array of domains.

In our previous model where the goal was to identify how macroeconomic performance is affected by weather/precipitation, the linear regression model would look like economy = B0 + B1*precipitation . After incorporating a spline for the explanatory variable, we can now characterize the functional form of the model as economy = B0 + f(precipitation) whereby f characterized the relationship between X and Y conditional on X . An important thing to note when developing semiparametric models is that although it allow users flexibility by incorporating a smoothing term, the estimation process still requires some of the core assumptions in OLS and GLM, namely that the data keeps the requirements of independence, normally distributed errors, and constant error variance.

Before we look at how to implement semiparametric regression, it is worth noting that these types of models are often referred to as either general or generalized additive models. Furthermore, semiparametric variations of other regression models are available such as semiparametric quantile regression and even semiparametric nonlinear regression.


For this post, I’m going to stick with the gam function in the mgcv package because it is usually a good starting point. Taking the previous use case, let’s create some data and construct a linear regression model that regresses economy as a function of weather. Given that the data is from a random distribution, we obviously find a weak relationship between the two variables.

mydat = data.frame(economy = rnorm(100),
                   weather = rnorm(100))
mod = lm(economy ~ weather, data=mydat)
plot(economy ~ weather, data = mydat, cex = .8)
abline(mod, col="red")

Let us say that we know based on experience, theory, or other sources that linearity in the variables is a poor specification in this domain case. One thing that could be done is to incorporate a spline to create local estimates for our explanatory variable within the linear regression.

mod <- lm(economy ~ bs(weather, knots = 3), data=mydat)
plot(economy ~ weather, data = mydat, cex = .8)
lines(mydat$weather, predict(mod), lwd=2, col='red')

The same model could also be examined using gam. Take note of the edf value, which represents how much the explanatory variable is smoothed.

mod <- gam(economy ~ s(weather, bs = "cr", k = 3),
             data = mydat,
             family = gaussian)

This example data is pretty silly, but this should give you some things to investigate in terms of learning how semiparametric models can help improve the inferences that you are examining.


Have questions, comments, interesting work, etc? Feel free to contact me at mathewanalytics@gmail.com

Up next will be a series of posts on causal inference, a topic that I’ve been trying to get a better understanding of over the past month.

Packages for Getting Started with Time Series Analysis in R

A. Motivation

During the recent RStudio Conference, an attendee asked the panel about the lack of support provided by the tidyverse in relation to time series data. As someone who has spent the majority of their career on time series problems, this was somewhat surprising because R already has a great suite of tools for visualizing, manipulating, and modeling time series data. I can understand the desire for a ‘tidyverse approved’ tool for time series analysis, but it seemed like perhaps the issue was a lack of familiarity with the available toolage. Therefore, I wanted to put together a list of the packages and tools that I use most frequently in my work. For those unfamiliar with time series analysis, this could a good place to start investigating R’s current capabilities.

B. Background

Time series data refers to a sequence of measurements that are made over time at regular or irregular intervals with each observation being a single dimension. An example of low dimensional time series is daily wind temperature from 01/01/2001 through 12/31/2005. High dimensional time series is characterized by a larger number of observations, so an example could be the daily wind temperature from 01/01/1980 through 12/31/2010. In either case, the goal of the analysis could lead one to perform regression, clustering, forecasting, or even classification.

C. Data For Examples

To run the code in this post, you will need to access the following data through the unix terminal. It will download a csv file from the City of Chicago website that contains information on reported incidents of crime that occurred in the city of Chicago from 2001 to present.

$ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > chicago_crime_data.csv

Import the data into R and get the aggregate number of reported incidents of theft by day.

dat = fread("chicago_crime_data.csv")
colnames(dat) = gsub(" ", "_", tolower(colnames(dat)))
dat[, date2 := as.Date(date, format="%m/%d/%Y")]
mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]

D. Data Representation

The first set of packages that one should be aware of is related to data storage. One could use data frames, tibbles, or data tables, but there are already a number of data structures that are optimized for representing time series data. The fundamental time series object is “ts”. However, the “ts” class has a number of limitations, and so it is usually best to work with the extensible time series (“xts”) object.

D1. xts

The xts package offers a number of great tools for data manipulation and aggregation. At it’s core is the xts object, which is essentially a matrix object that can represent time series data at different time increments. Xts is a subclass of the zoo object, and that provides it with a lot of functionality.

Here are some functions in xts that are worth investigating:

# create a xts object 
mydat2 = as.xts(mydat)
# filter by date 
mydat2["2015"]  ## 2015
mydat2["201501"]  ## Jan 2015
mydat2["20150101/20150105"]  ## Jan 01 to Jan 05 2015 
# replace all valuues from Aug 25 onwards with 0
mydat2["20170825/"] <- 0
# get the last one month 
last(mydat2, "1 month")
# get stats by time frame
apply.monthly(mydat2, sum)
apply.monthly(mydat2, quantile)
period.apply(mydat2, endpoints(mydat2,on='months'), sum)
period.apply(mydat2, endpoints(mydat2,on='months'), quantile)

E. Dates

R has a maddening array of date and time classes. Be it yearmon, POSIXct, POSIXlt, chron, or something else, each has specific strengths and weaknesses. In general, I find myself using the lubridate package as it simplifies many of the complexities associated with date-times in R.

E1. lubridate

The lubridate package provides a lot of functionality for parsing and formatting dates, comparing different times, extracting the components of a date-time, and so forth.

ymd_h("2010-01-01 10")
ymd_hm("2010-01-01 10:02")
ymd_hms("2010-01-01 10:02:30")

F. Time Series Regression

Distributed lag models (error correction models) are a core component of doing time series analysis. They are many instances where we want to regress an outcome variable at the current time against values of various regressors at current and previous times. dynlm and ardl (wrapper for dynlm) are solid for this type of analysis. Another common task when working with distributed lag models involves using dynamic simulations to understand estimated outcomes in different scenarios. dynsim provides a coherent solution for simulation and visualization of those estimated values of the target variable.

F1. dynlm / ardl

Here is a brief example of how dynlm can be utilized. In what follows, I have created a new variable and lagged it by one day. So the model attempts to regress incidents or reported theft based on the weather from the previous day.

mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
mydat[, weather := sample(c(20:90), dim(mydat), replace=TRUE)]
mydat[, weather_lag := shift(weather, 1, type = 'lag')]
mod = dynlm(N ~ L(weather), data = mydat)

F2. dynsim

Here is a brief example of how dynlm can be utilized. In what follows, I have created a new variable and lagged it by one day. I’ve used the dynsim to product two dynamic simulations and plotted them.

mydat3 = mydat[1:10000]
mod = lm(N ~ weather_lag, data = mydat3)
Scen1 <- data.frame(weather_lag = min(mydat2$weather_lag, na.rm=T))
Scen2 <- data.frame(weather_lag = max(mydat2$weather_lag, na.rm=T))
ScenComb <- list(Scen1, Scen2)
Sim1 <- dynsim(obj = mod, ldv = 'weather_lag', scen = ScenComb, n = 20)

G. Forecasting

G1. forecast

The forecast package is the most used package in R for time series forecasting. It contains functions for performing decomposition and forecasting with exponential smoothing, arima, moving average models, and so forth. For aggregated data that is fairly low dimensional, one of the techniques present in this package should provide an adequate forecasting model given that the assumptions hold.

Here is a quick example of how to use the auto.arima function in R. In general, automatic forecasting tools should be used with caution, but it is a good place to explore time series data.

mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
fit = auto.arima(mydat[,.(N)])
pred = forecast(fit, 200)

G2. smooth

The smooth package provides functions to perform even more variations of exponential smoothing, moving average models, and various seasonal arima techniques. The smooth and forecast package are usually more than adequate for most forecasting problems that pertain to low dimensional data.

Here is a basic example that uses the automatic complex exponential smoothing function:

mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)]
fit = auto.ces(mydat[,N])
pred = forecast(fit, 200)

So for those of you getting introduced to the R programming language, these are a list extremely useful packages for time series analysis that you will want to get some exposure to.

Have questions, comments, interesting consulting projects, or work that needs done, feel free to contact me at mathewanalytics@gmail.com

Data.Table by Example – Part 3

For this final post, I will cover some advanced topics and discuss how to use data tables within user generated functions. Once again, let’s use the Chicago crime data.

dat = fread("rows.csv")
names(dat) <- gsub(" ", "_", names(dat))
dat[, c("value1", "value2", "value3") := sample(1:50, nrow(dat), replace=TRUE)]

Let’s start by subseting the data. The following code takes the first 50000 rows within the dat dataset, selects four columns, creates three new columns pertaining to the data, and then removes the original date column. The output was saved as to new variable and the user can see the first few columns of the new data table using brackets or head function.

ddat = dat[1:50000, .(Date, value1, value2, value3)][,
               c("year", "month", "day") :=

ddat[1:3]  # same as head(ddat, 3)

We can now do some intermediate calculations and suppress their output by using braces.

ddat[, { avg_val1 = mean(value1)
         new_val1 = mean(abs(value2-avg_val1))
         new_val2 = new_val1^2 }, by=month][order(month)]

In this simple sample case, we have taken the mean value1 for each month, subtracted it from value2, and then squared that result. The output show the final calculation in the brackets, which is the result from squaring. Note that I also ordered the results by month with the chaining process.

That’s all very nice and convenient, but what if we want to create user defined functions that essentially automate these tasks for future work. Let us start with a function for extracting each component from the date column.

ddat = dat[1:50000, .(Date, value1, value2, value3)]

add_engineered_dates <- function(mydt, date_col="Date"){

   new_dt = copy(mydt)

   new_dt[, paste0(date_col, "_year") := year(mdy_hms(get(date_col)))]
   new_dt[, paste0(date_col, "_month") := month(mdy_hms(get(date_col)))]
   new_dt[, paste0(date_col, "_day") := day(mdy_hms(get(date_col)))] 

   new_dt[, c(date_col) := NULL]


We’ve created a new functions that takes two arguments, the data table variable name and the name of the column containing the date values. The first step is to copy the data table so that we are not directly making changes to the original data. Because the goal is to add three new columns that extract the year, month, and day, from the date column, we’ve used the lubridate package to define the date column and then extract the desired values. Furthermore, each new column has been labeled so that it distinctly represents the original date column name and the component that it contains. The final step was to remove the original date column, so we’ve set the date column to NULL. The final line prints the results back to the screen. You can recognize from the code above that the get function is used to take a variable name that represents column names and extract those columns within the function.

result = add_engineered_dates(ddat)

Let’s now apply the previous code with braces to find the squared difference between value2 and the mean value1 metric.

result[, { avg_val1 = mean(value1)
           new_val1 = mean(abs(value2-avg_val1))
           new_val2 = new_val1^2 }, by=.(Date_year,Date_month)][


I’m hoping that these three posts have convinced beginners to the R language that the data.table package is beautiful and worth checking out. The syntax may be scary at first, but it offers an array of powerful tools that everyone should become familiar with. So pull your sleeves up and have fun.

If you have any comments or would like to cover other specific topics, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn

Data.Table by Example – Part 2

In part one, I provided an initial walk through of some nice features that are available within the data.table package. In particular, we saw how to filter data and get a count of rows by the date.

dat = fread("rows.csv")
names(dat) <- gsub(" ", "_", names(dat))

Let us now add a few columns to our dataset on reported crimes in the city of Chicago. There are many ways to do do this but they involve the use of the := operator. Since data.table updates values by reference, we do not need to save the results as another variable. This is a very desirable feature.

dat[, c("value1", "value2", "value3") := sample(1:50, nrow(dat), replace=TRUE)]

dat[, `:=`(value1 = sample(1:50, nrow(dat), replace=TRUE),
           value2 = sample(1:50, nrow(dat), replace=TRUE),
           value3 = sample(1:50, nrow(dat), replace=TRUE))]

You can also just use the traditional base R solution for adding new columns as data.tables are also data frames.

dat$value1 <- sample(1:50, nrow(dat), replace=TRUE)
dat$value2 <- sample(1:50, nrow(dat), replace=TRUE)
dat$value3 <- sample(1:50, nrow(dat), replace=TRUE)

In any case, we now have three new columns with randomly selected values between 1 and 50. We can now look to summarize these values and see how they differ across the primary arrest type and other categorical variables.

dat[, .(mean = mean(value1, na.rm = TRUE),
        median = median(value1, na.rm = TRUE),
        min = min(value1, na.rm = TRUE),
        max = max(value1, na.rm = TRUE))]

              .(mean = mean(value1, na.rm = TRUE),
                median = median(value1, na.rm = TRUE),
                min = min(value1, na.rm = TRUE),
                max = max(value1, na.rm = TRUE))]

The above code allows us to get the mean, median, min, and max values from the value1 column for all rows and for when the primary type is prostitution. To get these summaries across multiple columns, we can use lapply, .SD, and .SDcols. Notice that in the second line below, we passed primary type to the by operator so that we get mean value for each of the arrest type categories.

dat[, lapply(.SD, mean, na.rm=TRUE), .SDcols=c("value1", "value2", "value3")]

dat[, lapply(.SD, mean, na.rm=TRUE), by=Primary_Type, .SDcols=c("value1", "value2", "value3")][1:4]

But wait, that only provides me with the mean of a single value column. If you want to apply multiple functions to columns, the user is required to write a function that can then be used within lapply.

my.summary <- function(x){
                   c(mean = mean(x),
                     median = median(x),
                     min = min(x),
                     max = max(x))

dat[, lapply(.SD, my.summary), .SDcols=c("value1", "value2", "value3")]

dat[, lapply(.SD, my.summary), by=.(Primary_Type), .SDcols=c("value1", "value2", "value3")]

Perfect! As you can see, the syntax is concise and is very easy to work with.

In the next part of this series, I’ll cover a few advanced features like get() and {}.

If you have any questions or comments, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn