Applied Statistical Theory: Quantile Regression

This is part two of the ‘applied statistical theory’ series that will cover the bare essentials of various statistical techniques. As analysts, we need to know enough about what we’re doing to be dangerous and explain approaches to others. It’s not enough to say “I used X because the misclassification rate was low.”

Standard linear regression summarizes the average relationship between a set of predictors and the response variable. \beta_1 represents the change in the mean value of Y given a one unit change in X_1 . A single slope is used to describe the relationship. Therefore, linear regression only provides a partial view of the link between the response variable and predictors. This is often inadaquete when there is heterogenous variance between X and Y . In such cases, we need to examine how the relationship between X and Y changes depending on the value of Y . For example, the impact of education on income may be more pronounced for those at higher income levels than those at lower income levels. Likewise, the the affect of parental care on the mean infant birth weight can be compared to it’s effect on other quantiles of infant birth weight. Quantile regression solves for these problems by looking at changes in the different quantiles of the response. The parameter estimates for this technique represent the change in a specified quantile of the response variable produced by a one unit change in the predictor variable. One major benefit of quantile regression is that it makes no assumptions about the error distribution.



frmla <- mpg ~ .

mm = rq(frmla, data=mtcars, tau=u) # for a series of quantiles
mm = rq(frmla, data=mtcars, tau=0.50) # for the median

summ <- summary(mm, se = "boot")


Applied Statistical Theory: Belief Networks

Applied statistical theory is a new series that will cover the basic methodology and framework behind various statistical procedures. As analysts, we need to know enough about what we’re doing to be dangerous and explain approaches to others. It’s not enough to say “I used X because the misclassification rate was low.” At the same time, we don’t need to have doctoral level understanding of approach X. I’m hoping that these posts will provide a simple, succinct middle ground for understanding various statistical techniques.

Probabilistic grphical models represent the conditional dependencies between random variables through a graph structure. Nodes correspond to random variables and edges represent statistical dependencies between the variables. Two variables are said to be conditionally dependent if they have a direct impact on each others’ values. Therefore, a graph with directed edges from parent A_p and child B_c denotes a causal relationship. Two variables are conditionally independent if the link between those variables are conditional on another. For a graph with directed edges from A to B and from B to C , it would suggest that A and C are conditionally independent given variable B . Each node fits a probability distribution function that depends only on the value(s) of the variables with edges leading into the variable. For example, the probability distribution for variable C in the following graphic depends only on the value of variable B.


Let’s consider a graphical model with K = (k_1, k_2, ... , k_n) variables and a set of dependencies between the variables, A = (a_1, a_2, ... , a_n) . For each K and A , we denote a set of conditional probability distributions for each K given the parent variable. In the following directed acyclic graph, we see that P(A|B,C) = P(A|B) . This means that the probability of A is conditionally dependent only on B and the value of C does not explain the other random variables. For belief networks, inference involves computing the probability of each value of a node in a network.

There you go; the absolute basics. And below is a presentation on belief networks that I made last year.


Basic Forecasting

Forecasting refers to the process of using statistical procedures to predict future values of a time series based on historical trends. For businesses, being able gauge expected outcomes for a given time period is essential for managing marketing, planning, and finances. For example, an advertising agency may want to utilizes sales forecasts to identify which future months may require increased marketing expenditures. Companies may also use forecasts to identify which sales persons met their expected targets for a fiscal quarter.

There are a number of techniques that can be utilized to generate quantitative forecasts. Some methods are fairly simple while others are more robust and incorporate exogenous factors. Regardless of what is utilized, the first step should always be to visualize the data using a line graph. You want to consider how the metric changes over time, whether there is a distinct trend, or if there are distinct patterns that are noteworthy.

data <- structure(c(12, 20.5, 21, 15.5, 15.3, 23.5, 24.5, 21.3, 23.5,
                    28, 24, 15.5, 17.3, 25.3, 25, 36.5, 36.5, 29.6, 30.5, 28, 26,
                    21.5, 19.7, 19, 16, 20.7, 26.5, 30.6, 32.3, 29.5, 28.3, 31.3,
                    32.2, 26.4, 23.4, 16.4, 15, 16, 18, 27, 21, 49, 21, 22, 28, 36,
                    40, 3, 21, 29, 62, 65, 46, 44, 33, 62, 22, 12, 24, 3, 5, 14,
                    36, 40, 49, 7, 52, 65, 17, 5, 17, 1),
                  .Dim = c(36L, 2L), .Dimnames = list(NULL, c("Advertising", "Sales")),
                  .Tsp = c(2006, 2008.91666666667, 12), class = c("mts", "ts", "matrix"))
head(data); nrow(data)

There are several key concepts that we should be cognizant of when describing time series data. These characteristics will inform how we pre-process the data and select the appropriate modeling technique and parameters. Ultimately, the goal is to simplify the patterns in the historical data by removing known sources of variatiion and making the patterns more consistent across the entire data set. Simpler patterns will generally lead to more accurate forecasts.

Trend: A trend exists when there is a long-term increase or decrease in the data.

Seasonality: A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week.

Autocorrelation: Refers to the pheneomena whereby values of Y at time t are impacted by previous values of Y at t-i. To find the proper lag structure and the nature of auto correlated values in your data, use the autocorrelation function plot.

Stationary: A time series is said to be stationary if there is no systematic trend, no systematic change in variance, and if strictly periodic variations or seasonality do not exist

Quantitative forecasting techniques are usually based on regression analysis or time series techniques. Regression approaches examine the relationship between the forecasted variable and other explanatory variables using cross-sectional data. Time series models use hitorical data that’s been collected at regular intervals over time for the target variablle to forecast its future values. There isn’t time to cover the theory behind each of these approaches in this post, so I’ve chosen to cover high level concepts and provide code for performing time series forecasting in R. I strongly suggest understandig the statistical theory behind a technique before running the code.

First, we can use the ma function in the forecast package to perform forecasting using the moving average method. This technique estimates future values at time t by averaging values of the time series within k periods of t. When the time series is stationary, the moving average can be very effective as the observations are nearby across time.

moving_average = forecast(ma(data[1:31,1], order=3), h=5)
moving_average_accuracy = accuracy(moving_average, data[32:36])
moving_average; moving_average_accuracy
plot(moving_average, ylim=c(0,60))

The simple exponential smooting is also good when the data has no trend or seasonal patterns. Unlike a moving average, this technique gives greater weight to the most recent observations of the time series.

exp <- ses(data[1:31,1], 5, initial="simple")
exp_accuracy = accuracy(exp, data[32:36])
exp; exp_accuracy
plot(exp, ylim=c(0,60))

In the forecast package, there is an automatic forecasting function that will run through possible models and select the most appropriate model give the data. This could be an auto regressive model of the first oder (AR(1)), an ARIMA model with the right values for p, d, and q, or something else that is more appropriate.

train = data[1:31,1]
test = data[32:36,1]
arma_fit <- auto.arima(train)
arma_forecast <- forecast(arma_fit, h = 5)
arma_fit_accuracy <- accuracy(arma_forecast, test)
arma_fit; arma_forecast; arma_fit_accuracy
plot(arma_forecast, ylim=c(0,60))

There you go, a basic non-technical introduction to forecasting. This should get one familiar with the key concepts and how to perform some basic forecasting in R

Writing a Resignation Letter in Latex

Here’s a resignation letter I wrote earlier this year in Latex. Of course, such tasks are much easier done in Microsoft Word or Open Office, but I somehow enjoy using Latex to typeset documents. It’s not the most efficient and often not a practical solution, but if you ever need to quit your job, here’s a quick Latex document for that.

Screenshot from 2015-10-13 18:30:12




Abraham Mathew \\
Address, Apt \ \\
City, ST 55408 \\
720-648-0000 \\ \\ 

Joe Smith \\	 
Director \\
Company AAA \\
1111 North Lewis \\
City, ST 22222 \\


Dear Mr. Smith,


The purpose of this letter is to formally notify you that I am resigning from my position as 
Statistical Analyst at Company AAA. My last day of employment will be March 23.


I'm very appreciative of the opportunities that were granted to me at Company AAA. Thank you 
for the guidance and professional support. Company AAA was a great place to work at; it's 
plenty obvious why the company is regularly voted as one of the best places to work. I wish 
you and the company all the best.


Abraham Mathew \\


My Latex Preamble

Over the past few years, I’ve relied less on Latex for writing documents. This is primarily because documentation hasn’t been a critical part of the jobs that I’ve had. Another reason is that I’ve found a format and structure that works well, and so I end up using this template over and over for every task that involves document writing. Below is that latex preamble and a brief document that uses my preferred structure.

Screenshot from 2015-10-05 21:49:28



\raisebox{0.6in}[0in]{\makebox[\textwidth][r]{Abraham Mathew - 10/18/2014}}

\bf\large Tidy Data



\section*{What is tidy data?}

\item Tidy data possesses the following characteristics.
    \item Each variable forms a column
    \item Each observation forms a row
	\item Each tpe of observational unit forms a table
\item Tidy data facilitates initial exploration and analysis of the data, and simplifies the development of data analysis tools that work well together.
\item Tidying refers to the act of structuring datasets to facilitate analysis.

\section*{What is messy data?}

\item Messy data is any dataset that does not conform to the characteristics of a tidy data set.
\item Common problems associated with messy datasets include the following.
    \item Column headers are values and possess no variable names.
    \item Multiple variables are stored in one column.
	\item Variables are stored in both rows and columns.
	\item Multiple types of observational units are stored in the same table.
	\item A single observational nit is stored in multiple tables.
\item A little bit of melting, string splitting, and casting can easily correct these issues.


Do you have a preamble or structure that you use regularly? Please share in the comments below.

A Few Days of Python: Automating Tasks Involving Excel Files

There are plenty of instances where analysts are regularly forwarded xls spreadsheets and tasked with summarizing the data. In many cases, these scenarios can be automated through fairly simple Python scripts. In the following code, I take an Excel spreadsheet with two sheets, summarize each sheet using a pivot table, and add those results to sheets in a new spreadsheet.

import os 
import sys
import pandas as pd
import numpy as np
import xlsxwriter

def automate(file1, file2):

    # check file
    if os.stat(file1).st_size == 0:
       print "File Status: Empty"
       xl = pd.ExcelFile(file1)
       writer = pd.ExcelWriter(file2, engine='xlsxwriter')

    # Sheet1 pivot 
    dat1 = xl.parse("Sheet1")
    dat1_result = dat1[["buyer_aid","lead_payout"]].groupby("buyer_aid").aggregate(np.mean).reset_index().sort(['lead_payout'], ascending=[False])
    dat1_result.to_excel(writer, sheet_name='Average_Payout', index=False, header=True)
    # Sheet2 pivot 
    dat2 = xl.parse("Sheet2")
    dat2_result = dat2[["buyer_aid","lead_payout"]].groupby("buyer_aid").aggregate(np.sum).reset_index().sort(['lead_payout'], ascending=[False])
    dat2_result.to_excel(writer, sheet_name='Sum_Payout', index=False, header=True)
    dat3 = pd.merge(dat1_result, dat2_result, how="inner")
    dat3.to_excel(writer, sheet_name='All_Results', index=False, header=True)
    # save the Excel file.

automate("LP_Data.xls", 'New_Report.xls')

A Few Days of Python: Using R in Python

Using R Functions in Python

import pandas as pd
import pyper as pr

def zone_func(zone_file):

    dat = pd.read_csv(zone_file, parse_dates=["Row Labels"])

    r = pr.R()
    r.assign("rsfores", dat["forest"]) 
    forecast = r("""
            forecast_func <- function(vehicle){
                a_fit <- auto.arima(vehicle)
                a_forecast <- forecast(a_fit, h = 8)
                a_accuracy <- accuracy(a_forecast)

            fores_output = forecast_func("rsfores")

    pydata = pd.DataFrame(r.get("fores_output"))
    print pydata