SQL Cheat Sheet

I’ve been putting together a basic SQL cheat sheet that could be used as a reference guide. Here are a series of common procedures that should be of use for anyone who uses SQL to extract data. No explanations are provided as they should largely be known to the end user.


COUNT(DISTINCT year) AS years_count
COUNT(DISTINCT month) AS months_count


SELECT name, email, COUNT(*)
FROM users
GROUP BY name, email

SELECT name, email
    FROM users
    WHERE email in
    SELECT email
    FROM users
    GROUP BY email

SELECT firstname, lastname, list.address FROM list
GROUP BY address HAVING count(id) > 1) dup ON list.address = dup.address

FROM training_details AS td
INNER JOIN users as u on u.user_id = td.user_id
GROUP BY 1, 2, 3
ORDER BY td.order_date DESC;


SELECT rr.id,
       MAX(CASE WHEN rs.id IS NOT NULL OR lsr.id IS NOT NULL THEN 1 ELSE 0 END) AS scheduled_run
 FROM report_runs rr
 LEFT JOIN report_schedule_runs rs ON rs.report_run_id = rr.id
 LEFT JOIN list_run_report_runs lrrr ON lrrr.report_run_id = rr.id
 LEFT JOIN list_schedule_runs lsr ON lsr.list_run_id = lrrr.list_run_id
 GROUP BY 1,2,3,4,5,6,7


SELECT count(1) FROM table;



SELECT * FROM student WHERE name LIKE ‘d%n’;
(returns dan or den)




SELECT SUM (Sales) FROM Store_Information
WHERE Store_Name IN
(SELECT Store_Name FROM Geography
WHERE Region_Name = 'West');


SELECT SUM (a1.Sales) FROM Store_Information a1
WHERE a1.Store_Name IN
(SELECT Store_Name FROM Geography a2
WHERE a2.Store_Name = a1.Store_Name);


SELECT sub.*
  FROM (
        SELECT *
          FROM table
         WHERE day_of_week = 'Friday'
       ) sub
 WHERE sub.resolution = 'NONE'


  FROM table
                 FROM table
                ORDER BY date
                LIMIT 5


SELECT incidents.*,
       sub.incidents AS incidents_that_day
  FROM tutorial.sf_crime_incidents_2014_01 incidents
  JOIN ( SELECT date,
          COUNT(incidnt_num) AS incidents
           FROM tutorial.sf_crime_incidents_2014_01
          GROUP BY 1
       ) sub
    ON incidents.date = sub.date
 ORDER BY sub.incidents DESC, time


SELECT * FROM users WHERE TO_DAYS(last_login) = ( TO_DAYS(NOW()) - 1 )



SELECT users.name
FROM users WHERE (users.name BETWEEN 'A%' AND 'M%')
SELECT banned_users.name FROM banned_users
WHERE (banned_users.name BETWEEN 'A%' AND 'M%');


SELECT CONCAT(emp.firstname, '-', emp.lastname) AS emp_full_name FROM emp;


SELECT LEFT(date, 10) AS cleaned_date,
       RIGHT(date, 17) AS cleaned_time
FROM table

SELECT SUBSTR(date, 4, 2) AS day
FROM table


Select database: use [database];

Show all tables: show tables;

Show table structure: describe [table];

Counting and selecting grouped records:
SELECT *, (SELECT COUNT([column]) FROM [table]) AS count
FROM [table]
GROUP BY [column];

Select records containing [value]:
SELECT * FROM [table]
WHERE [column] LIKE '%[value]%';

Select records starting with [value]:
SELECT * FROM [table]
WHERE [column] LIKE '[value]%';

Select records starting with val and ending with ue:
SELECT * FROM [table]
WHERE [column] LIKE '[val_ue]';

Select a range:
SELECT * FROM [table]
WHERE [column] BETWEEN [value1] and [value2];

Select with custom order and only limit:
SELECT * FROM [table]
WHERE [column]
ORDER BY [column] ASC
LIMIT [value]; 


INNER JOIN: returns rows when there is a match in both tables.

LEFT JOIN: returns all rows from the left table, even if there are no matches in the right table.

RIGHT JOIN: returns all rows from the right table, even if there are no matches in the left table.

FULL JOIN: returns rows when there is a match in one of the tables.

SELF JOIN: is used to join a table to itself as if the table were two tables, temporarily renaming at least one table in the SQL statement.

CARTESIAN JOIN: returns the Cartesian product of the sets of records from the two or more joined tables.

Weekly R-Tips: Visualizing Predictions

Lets say that we estimated a linear regression model on time series data with lagged predictors. The goal is to estimate sales as a function of inventory, search volume, and media spend from two months ago. After using the lm function to perform linear regression, we predict sales using values from two month ago.

frmla <- sales ~ inventory + search_volume + media_spend
mod <- lm(frmla, data=dat)
pred = predict(mod, values, interval="predict") 

If this model is estimated weekly or monthly, we will eventually want to understand how well our model did in predicting actual sales from month to month. To perform this task, we must regularly maintain a spreadsheet or data structure (RDS object) with actual predicted sales figures for each time period. That data can be used to create line graphs that visualize both the actual versus predicted values.

Here is what the original spreadsheet looked like.

Screenshot from 2016-02-05 16:41:22

Transform that data into long format using whatever package you prefer.

mydat = melt(d1)

This will provide a data frame with three columns.

Screenshot from 2016-02-04 15:04:36

We can utilize the ggplot2 package to create visualizations.

ggplot(mydat, aes(Month, value, group=variable, colour=variable)) +
  geom_line(lwd=1.05) + geom_point(size=2.5) + 
  ggtitle("Sales (01/2010 to 05/2015)") +
  xlab("Date") + ylab("Sales") + ylim(0,30000) + xlab(" ") + ylab(" ") +  
  theme(legend.title=element_blank()) + xlab(" ") + 
  theme(axis.text.x=element_text(colour="black")) +
  theme(axis.text.y=element_text(colour="black")) +
  theme(legend.position=c(.4, .85))


Above is an example of what the final product could look like. Visualizing predicted against actual values is an important component of evaluating the quality of a model. Furthermore, having such visualization will be of value when interacting with business audiences and “selling” your analysis.

Weekly R-Tips: Importing Packages and User Inputs

Number 1: Importing Multiple Packages

Anyone who has used R for some time has written code that required the use of multiple packages. In most cases, this will be done by using the library or require function to bring in the appropriate extensions.


That’s nice and gets the desired result, but can’t we just import all the packages we need in one or two lines. Yes we can, and here is the one line of code to do that.

libs <- c("forecast", "ggplot2", "stringr", "lubridateee", "rockchalk")
sapply(libs, library, character.only=TRUE, logical.return=TRUE)

libs <- c("forecast", "ggplot2", "stringr", "lubridateee", "rockchalk")
lapply(libs, require, character.only=TRUE)

Number 2: User Input

One side project that I hope to start on is a process whereby I can interact with R and select options that will result in particular outcomes. For example, let’s say you’re trying to put together a script that manages a weekly list. A good first step would be a list of options that the user would see and be prompted to select an option. Here is how R can be used to get user input in such circumstances.

lopts <- cat("
             1. Add an item
             2. Delete an item
             3. Print the list
             4. Quit 

action <- readline("Choose an option: ")

Automate the Boring Stuff: GGPlot2

The majority of my interaction with the ggplot2 package involves the interactive execution of code to visualize data within the context of exploratory data analysis. This is often a manual process and quite laborious. I recently sought to improve these tasks by creating a series of user defined functions that contained my most commonly used ggplot calls. These functions could then be sourced in and the appropriate arguments specified to generate the desired visualization. While this is a fairly simple task, attempting to call ggplot2 functions within a user defined function requires some understanding of R’s evaluation procedures. The key thing to remember is that the generic aes mapping argument uses non-standard evaluation to specify variables names within ggplot. When programming, it is suggested that we utilize standard evaluation by using aes_string to map the properties of a geom. Here are some examples of how aes_string can be utilized within a function to create graphics.


mydat <- data.frame(date = c(seq(as.Date("2010/01/01"), as.Date("2010/01/31"), by=1)),
                    value1 = abs(round(rnorm(31), 2)),
                    value2 = abs(round(rnorm(31), 2)),
                    value3 = abs(round(rnorm(31), 2)))


viz_func <- function(data, x, y){
&nbsp;   ggplot(data, aes_string(x=x, y=y)) +
    geom_line(lwd=1.05) + geom_point(size=2.5) + 
    ggtitle("Insert Title Here") +
    xlab("Date") + ylab("Value") + ylim(0,5) + 
    theme(axis.text.x=element_text(colour="black")) +

viz_func(mydat, 'date', 'value1')

viz_func(mydat, 'date', 'value3') + 
  ggtitle("Insert Different Title Here") +
  xlab("Different Date") + ylab("Different Value")

viz_func <- function(data, x){
&nbsp;   ggplot(data, aes_string(x=x)) +
    geom_histogram() +
    ggtitle("Insert Title Here") +
    xlab("Date") + ylab("Value") + ylim(0,5) + 
    theme(axis.text.x=element_text(colour="black")) +

viz_func(mydat, 'value1')

viz_func(mydat, 'value3') + 
  ggtitle("Insert Different Title Here") +
  xlab("Different Date") + ylab("Different Value")

Applied Statistical Theory: Quantile Regression

This is part two of the ‘applied statistical theory’ series that will cover the bare essentials of various statistical techniques. As analysts, we need to know enough about what we’re doing to be dangerous and explain approaches to others. It’s not enough to say “I used X because the misclassification rate was low.”

Standard linear regression summarizes the average relationship between a set of predictors and the response variable. \beta_1 represents the change in the mean value of Y given a one unit change in X_1 . A single slope is used to describe the relationship. Therefore, linear regression only provides a partial view of the link between the response variable and predictors. This is often inadaquete when there is heterogenous variance between X and Y . In such cases, we need to examine how the relationship between X and Y changes depending on the value of Y . For example, the impact of education on income may be more pronounced for those at higher income levels than those at lower income levels. Likewise, the the affect of parental care on the mean infant birth weight can be compared to it’s effect on other quantiles of infant birth weight. Quantile regression solves for these problems by looking at changes in the different quantiles of the response. The parameter estimates for this technique represent the change in a specified quantile of the response variable produced by a one unit change in the predictor variable. One major benefit of quantile regression is that it makes no assumptions about the error distribution.



frmla <- mpg ~ .

mm = rq(frmla, data=mtcars, tau=u) # for a series of quantiles
mm = rq(frmla, data=mtcars, tau=0.50) # for the median

summ <- summary(mm, se = "boot")


Applied Statistical Theory: Belief Networks

Applied statistical theory is a new series that will cover the basic methodology and framework behind various statistical procedures. As analysts, we need to know enough about what we’re doing to be dangerous and explain approaches to others. It’s not enough to say “I used X because the misclassification rate was low.” At the same time, we don’t need to have doctoral level understanding of approach X. I’m hoping that these posts will provide a simple, succinct middle ground for understanding various statistical techniques.

Probabilistic grphical models represent the conditional dependencies between random variables through a graph structure. Nodes correspond to random variables and edges represent statistical dependencies between the variables. Two variables are said to be conditionally dependent if they have a direct impact on each others’ values. Therefore, a graph with directed edges from parent A_p and child B_c denotes a causal relationship. Two variables are conditionally independent if the link between those variables are conditional on another. For a graph with directed edges from A to B and from B to C , it would suggest that A and C are conditionally independent given variable B . Each node fits a probability distribution function that depends only on the value(s) of the variables with edges leading into the variable. For example, the probability distribution for variable C in the following graphic depends only on the value of variable B.


Let’s consider a graphical model with K = (k_1, k_2, ... , k_n) variables and a set of dependencies between the variables, A = (a_1, a_2, ... , a_n) . For each K and A , we denote a set of conditional probability distributions for each K given the parent variable. In the following directed acyclic graph, we see that P(A|B,C) = P(A|B) . This means that the probability of A is conditionally dependent only on B and the value of C does not explain the other random variables. For belief networks, inference involves computing the probability of each value of a node in a network.

There you go; the absolute basics. And below is a presentation on belief networks that I made last year.


Basic Forecasting

Forecasting refers to the process of using statistical procedures to predict future values of a time series based on historical trends. For businesses, being able gauge expected outcomes for a given time period is essential for managing marketing, planning, and finances. For example, an advertising agency may want to utilizes sales forecasts to identify which future months may require increased marketing expenditures. Companies may also use forecasts to identify which sales persons met their expected targets for a fiscal quarter.

There are a number of techniques that can be utilized to generate quantitative forecasts. Some methods are fairly simple while others are more robust and incorporate exogenous factors. Regardless of what is utilized, the first step should always be to visualize the data using a line graph. You want to consider how the metric changes over time, whether there is a distinct trend, or if there are distinct patterns that are noteworthy.

data <- structure(c(12, 20.5, 21, 15.5, 15.3, 23.5, 24.5, 21.3, 23.5,
                    28, 24, 15.5, 17.3, 25.3, 25, 36.5, 36.5, 29.6, 30.5, 28, 26,
                    21.5, 19.7, 19, 16, 20.7, 26.5, 30.6, 32.3, 29.5, 28.3, 31.3,
                    32.2, 26.4, 23.4, 16.4, 15, 16, 18, 27, 21, 49, 21, 22, 28, 36,
                    40, 3, 21, 29, 62, 65, 46, 44, 33, 62, 22, 12, 24, 3, 5, 14,
                    36, 40, 49, 7, 52, 65, 17, 5, 17, 1),
                  .Dim = c(36L, 2L), .Dimnames = list(NULL, c("Advertising", "Sales")),
                  .Tsp = c(2006, 2008.91666666667, 12), class = c("mts", "ts", "matrix"))
head(data); nrow(data)

There are several key concepts that we should be cognizant of when describing time series data. These characteristics will inform how we pre-process the data and select the appropriate modeling technique and parameters. Ultimately, the goal is to simplify the patterns in the historical data by removing known sources of variatiion and making the patterns more consistent across the entire data set. Simpler patterns will generally lead to more accurate forecasts.

Trend: A trend exists when there is a long-term increase or decrease in the data.

Seasonality: A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week.

Autocorrelation: Refers to the pheneomena whereby values of Y at time t are impacted by previous values of Y at t-i. To find the proper lag structure and the nature of auto correlated values in your data, use the autocorrelation function plot.

Stationary: A time series is said to be stationary if there is no systematic trend, no systematic change in variance, and if strictly periodic variations or seasonality do not exist

Quantitative forecasting techniques are usually based on regression analysis or time series techniques. Regression approaches examine the relationship between the forecasted variable and other explanatory variables using cross-sectional data. Time series models use hitorical data that’s been collected at regular intervals over time for the target variablle to forecast its future values. There isn’t time to cover the theory behind each of these approaches in this post, so I’ve chosen to cover high level concepts and provide code for performing time series forecasting in R. I strongly suggest understandig the statistical theory behind a technique before running the code.

First, we can use the ma function in the forecast package to perform forecasting using the moving average method. This technique estimates future values at time t by averaging values of the time series within k periods of t. When the time series is stationary, the moving average can be very effective as the observations are nearby across time.

moving_average = forecast(ma(data[1:31,1], order=3), h=5)
moving_average_accuracy = accuracy(moving_average, data[32:36])
moving_average; moving_average_accuracy
plot(moving_average, ylim=c(0,60))

The simple exponential smooting is also good when the data has no trend or seasonal patterns. Unlike a moving average, this technique gives greater weight to the most recent observations of the time series.

exp <- ses(data[1:31,1], 5, initial="simple")
exp_accuracy = accuracy(exp, data[32:36])
exp; exp_accuracy
plot(exp, ylim=c(0,60))

In the forecast package, there is an automatic forecasting function that will run through possible models and select the most appropriate model give the data. This could be an auto regressive model of the first oder (AR(1)), an ARIMA model with the right values for p, d, and q, or something else that is more appropriate.

train = data[1:31,1]
test = data[32:36,1]
arma_fit <- auto.arima(train)
arma_forecast <- forecast(arma_fit, h = 5)
arma_fit_accuracy <- accuracy(arma_forecast, test)
arma_fit; arma_forecast; arma_fit_accuracy
plot(arma_forecast, ylim=c(0,60))

There you go, a basic non-technical introduction to forecasting. This should get one familiar with the key concepts and how to perform some basic forecasting in R