A Few Days of Python: Automating Tasks Involving Excel Files

There are plenty of instances where analysts are regularly forwarded xls spreadsheets and tasked with summarizing the data. In many cases, these scenarios can be automated through fairly simple Python scripts. In the following code, I take an Excel spreadsheet with two sheets, summarize each sheet using a pivot table, and add those results to sheets in a new spreadsheet.

import os 
import sys
import pandas as pd
import numpy as np
import xlsxwriter

def automate(file1, file2):

    # check file
    if os.stat(file1).st_size == 0:
       print "File Status: Empty"
       xl = pd.ExcelFile(file1)
       writer = pd.ExcelWriter(file2, engine='xlsxwriter')

    # Sheet1 pivot 
    dat1 = xl.parse("Sheet1")
    dat1_result = dat1[["buyer_aid","lead_payout"]].groupby("buyer_aid").aggregate(np.mean).reset_index().sort(['lead_payout'], ascending=[False])
    dat1_result.to_excel(writer, sheet_name='Average_Payout', index=False, header=True)
    # Sheet2 pivot 
    dat2 = xl.parse("Sheet2")
    dat2_result = dat2[["buyer_aid","lead_payout"]].groupby("buyer_aid").aggregate(np.sum).reset_index().sort(['lead_payout'], ascending=[False])
    dat2_result.to_excel(writer, sheet_name='Sum_Payout', index=False, header=True)
    dat3 = pd.merge(dat1_result, dat2_result, how="inner")
    dat3.to_excel(writer, sheet_name='All_Results', index=False, header=True)
    # save the Excel file.    

automate("LP_Data.xls", 'New_Report.xls')

A Few Days of Python: Using R in Python

Using R Functions in Python

import pandas as pd
import pyper as pr

def zone_func(zone_file):

    dat = pd.read_csv(zone_file, parse_dates=["Row Labels"])

    r = pr.R()
    r.assign("rsfores", dat["forest"]) 
    forecast = r("""
            forecast_func <- function(vehicle){
                a_fit <- auto.arima(vehicle)
                a_forecast <- forecast(a_fit, h = 8)
                a_accuracy <- accuracy(a_forecast)

            fores_output = forecast_func("rsfores")

    pydata = pd.DataFrame(r.get("fores_output"))
    print pydata


Working With SEM Keywords in R

The following post was republished from two previous posts that were on an older blog of mine that is no longer available. These are from several years ago, and related to two critical questions that I encountered. One, how can I automatically generate hundreds of thousands of keywords for a search engine marketing campaign. Two, how can I develop an effective system for examining keywords based on different characteristics.

Generating PPC Keywords in R

Paid search marketing refers to the process of driving traffic to a website by purchasing ads on search engines. Advertisers bid on certain keywords that users might search for, and that determines when and where their ads appear. For example, an individual who owns an auto dealership would want to bid on keywords relating to automobiles that a reasonable people would search for on a search engine. In both Google and Bing, advertisers are able to specify which keywords they would like to bid for and at what amount. If the user decides to bid on just a small number of keywords, they can type that information and specify a bid. However, what if you want to bid on a significant number of keywords. Instead of typing each and every keyword into the Google or Bing dashboard, you could programmatically generate the keywords in R.

Let’s say that I run an online retail establishment that sells mens and womens streetwear and I want to drive more traffic to my online store by placing ads on both Google and Bing. I want to bid on about a number of keywords related to fashion and have created a number of ‘root’ words that will comprise the majority of these keywords. To generate my desired keywords, I have a written a function which will take every single permutation of the root words.

root1 = c("fashion", "streetwear")
root2 = c("karmaloop", "crooks and castles", "swag")
root3 = c("urban clothing", "fitted hats", "snapbacks")
root4 = c("best", "authentic", "low cost") 

myfunc <- function(){
      lst <- list(root1=c(root1), root2=c(root2), root3=c(root3),
      myone <- function(x, y){
            m1 <- do.call(paste, expand.grid(lst[[x]], lst[[y]]))
            mydf <- data.frame(keyword=c(m1))
      mydf <- rbind(myone("root4","root1"), myone("root2","root1"))

mydat <- myfunc()

write.table(mydat, "adppc.txt", quote=FALSE, row.names=FALSE)

This isn’t the prettiest code in the world, but it gets the job done. In fact, the same results could have achieved using the following code, which is much more efficient.

root5 = c("%s fashion")
root6 = c("%s streetwear")
adcam1 = sprintf(root5, root2)
adcam2 = sprintf(root6, root2)
df = data.frame(keywords=c(adcam1, adcam2))

write.table(df, "adppc.txt", quote=FALSE, row.names=FALSE)

If you have any suggestions for improving my R code, please mention it in the comment section below.

Creating Tags For PPC Keywords

When performing search engine marketing, it is usually beneficial to construct a system for making sense of keywords and their performance. While one could construct Bayesian Belief Networks to model the process of consumers clicking on ads, I have found that using ’tags’ to categorize keywords is just as useful for conducting post-hoc analysis on the effectiveness of marketing campaigns. By ‘tags,’ I mean identifiers which categorize keywords according to their characteristics. For example, in the following data frame, we have six keywords, our average bids, numbers of clicks, and tags for state, model, car, auto, save, and cheap. What we want to do now is set the boolean for each tag to 1 if and only if that tag is mentioned in the keyword.

df = data.frame(keyword=c("best car insurance",
                          "honda auto insurance",
                          "florida car insurance",
                          "cheap insurance online",
                          "free insurance quotes",
                          "iowa drivers save money"),
                average_bid=c(3.12, 2.55, 2.38, 5.99, 4.75, 4.59),
                clicks=c(15, 20, 30, 50, 10, 25),
                conversions=c(5, 2, 10, 15, 3, 5),
                state=0, model=0, car=0, auto=0, save=0, cheap=0)

main <- function(df) {
  state <- c("michigan", "missouri", "florida", "iowa", "kansas")
  model <- c("honda", "toyota", "ford", "acura", "audi")
  car <- c("car")
  auto <- c("auto")
  save <- c("save")
  cheap <- c("cheap")
  for (i in 1:nrow(df)) {
    Words = strsplit(as.character(df[i, 'keyword']), " ")[[1]]
    if(any(Words %in% state)) df[i, 'state'] <- 1
    if(any(Words %in% model)) df[i, 'model'] <- 1 
    if(any(Words %in% car)) df[i, 'car'] <- 1
    if(any(Words %in% auto)) df[i, 'auto'] <- 1     
    if(any(Words %in% save)) df[i, 'save'] <- 1
    if(any(Words %in% cheap)) df[i, 'cheap'] <- 1

one = main(df)

subset(one, state==TRUE | model==TRUE | auto==TRUE)




state <- c("michigan", "missouri", "florida", "iowa", "kansas")
model <- c("honda", "toyota", "ford", "acura", "audi")
car <- c("car")
auto <- c("auto")
save <- c("save")
cheap <- c("cheap")

state_match <- str_c(state, collapse = "|")
model_match <- str_c(model, collapse = "|")
car_match <- str_c(car, collapse = "|")
auto_match <- str_c(auto, collapse = "|")
save_match <- str_c(save, collapse = "|")
cheap_match <- str_c(cheap, collapse = "|")

main <- function(df) {
  df$state <- str_detect(df$keyword, state_match)
  df$model <- str_detect(df$keyword, model_match)
  df$car <- str_detect(df$keyword, car_match)
  df$auto <- str_detect(df$keyword, auto_match)
  df$save <- str_detect(df$keyword, save_match)
  df$cheap <- str_detect(df$keyword, cheap_match)

two = main(df2)

subset(two, state==TRUE | model==TRUE | auto==TRUE)

By now, some of you are probably wondering why we don’t just select the keyword directly from the original data frame based on the desired characteristic. Well, that works too, albeit I’ve found that the marketing professionals that I’ve worked with have preferred the ‘tagging’ method.

## Alternate approach - SELECT DIRECTLY


main <- function(df) {
  model <- c("honda", "toyota", "ford", "acura", "audi")
  for (i in 1:nrow(df)) {
    Words = strsplit(as.character(df[i, 'keyword']), " ")[[1]]
    if(any(Words %in% model)) return(df[i, c(1:4) ])    

three = main(df)

So there you have it, a method of ‘tagging’ strings according to a certain set of specified characteristics. The benefit of using ‘tags’ is that it provides you with a systematic way to document how the presence of certain words or phrases impacts performance.



Examining Website Pathing Data Using Markov Chains

A markov model can be used to examine a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Let’s define a stochastic process (X_{n}, n = 0, 1, 2, ...)  that takes on a finite number of possible values which are nonnegative integers. Each state, X_{n} , represents it’s value in time period n . If the probability of being in X_{n+1} is dependent on X_{n} , it’s refered to as the first-order Markov property. We are interested in estimating P_{ij} , which is the fixed probability that X at time i  will be followed by state j . These n step transition probabilities are calculated through the Chapman-Kolmogorov equations, which relates the joint probability distributions of different sets of coordinates on a stochastic process. Markov chains are generally represented as a state diagram or transition matrix where every row of the matrix, P , is a conditional probability mass function.

Let’s consider an example using website pathing data from an ecommerce website. The set of possible outcomes, or sample space, is defined below. For this example, S_{i} takes the values of each page on the site. This likely violates the Markov property, given that pages on an ecommerce website aren’t generally dependent on the previous page visited, but let’s proceed anyways.

S = (Home, About, Shoes, Denim, Cart)

Given a series of clickstreams, a markov chain can be fit in order to predict the next page visited. Below are the state diagram and transition matrix for this data. It suggests that from the home state, there is a 83% probability that a visit to the shoes state will be next.

Markov_Network Screenshot from 2015-09-08 20:24:28

To be honest, I’m not certain whether is the best technique to model how consumers utilize a website. Markov chains seem to be promising, but I’m a little uneasy about whether our assumptions are met in such scenarios. If you have ideas for such problems, please comment below.

Statistics Refresher

Let’s face it, a good statistics refresher is always worthwhile. There are times we all forget basic concepts and calculations. Therefore, I put together a document that could act as a statistics refresher and thought that I’d share it with the world. This is part one of a two part document that is still being completed. This refresher is based on Principles of Statistics by Balmer and Statistics in Plain English by Brightman.

The Two Concepts of Probability

Statistical Probability

  • Statistical probability pertains to the relative frequency with which an event occurs in the long run.
  • Example:
    Let’s say we flip a coin twice. What is the probability of getting two heads?
    If we flip a coin twice, there are four possible outcomes, [(H,H), (H,T), (T,H), (T,T)] .
    Therefore, the probability of flipping two heads is \frac{(H,H)}{N} = \frac{1}{2}*\frac{1}{2} = \frac{1}{4}

Inductive Probability

  • Inductive probability pertains to the degree of belief which is reasonable to place on a proposition given evidence.
  • Example:
    I’m 95\% certain that the answer to 1 + 1 is between 1.5 and 2.5 .

The Two Laws of Probability

Law of Addition

  • If A and B are mutually exclusive events, the probability that either A  or B will occur is equal to the sum of their separate probabilities.

\displaystyle P(A \space or \space B) = P(A) + P(B)

Law of Multiplication

  • If A and B are two events, the probability that both A and B will occur is equal to the probability that A will occur multiplied by the conditional probability that B  will occur given that A has occured.

P(A \space and \space B) = P(A) * P(B|A)

Conditional Probability

  • The probability of B  given A , or P(B|A) , is the probability that B will occur if we consider only those occasionson which A also occurs. This is defined as \frac{n(A \space and \space B)}{n(A)} .

Random Variables and Probability Distributions

Discrete Variables

  • Variables which arise from counting and can only take integral values (0, 1, 2, \ldots) .
  • A frequency distribution represents the amount of occurences for all the possible values of a variable. This can be represented in a table or graphically as a probability distribution.
  • Associated with any discrete random variable, X , is a corresponding probability function which tells us the probability with which X takes any value. The particular value that X  can take is characterized by x . Based on x , the probability that X will take can be calculated. This measure is the probability function and is defined by P(x) .
  • The cumulative probability function specifies the probability that X is less than or equal to some particular value, x . This is denoted by F(x) . The cumulative probability function can be calculated by summing the probabilities of all values less than or equal to x .

F(x) = Prob[X \leq x]

F(x) = P(0) + P(1) + \ldots + P(x) = \sum_{u \leq x} p(u)

Continuous Variables

  • Variables which arise from measuring and can take any value within a given range.
  • Continuous variables are best graphically represented by a histogram, where the area of each rectangle represents the proportion of observations falling in that interval.
  • The probability density function, f(x) , refers to the smooth continuous curve that is used to describe the relative likelihood a random variable to take on a given value. f(x) can also be used to show the probability that the random variable will lie between x_1 and x_2 .
  • A continuous probability distribution can also be represented by its cumulative probability function, f(x) . which specified the probability that X  is less than or equal to x .
  • A continuous random variable is said to be uniformly distributed between 0 and 1 if it is equally likely to lie anywhere in this interval but cannot lie outside it.

Multivariate Distributions

  • The joint frequency distribution of two random variables is called a bivariate distribution. P(x,y) denotes the probability that simultaneously X will be x and Y will be y . This is expressed through a bivariate distribution table.

P(x,y) = Prob[X == x \space and \space Y == y]

  • In a bivariate distribution table, the right hand margin sums the probabilities in different rows. It expresses the overall probability distribution of x , regardless of the value of y .

p(x) = Prob[X == x] = \sum_{y} p(x,y)

  • In a bivariate distribution table, the bottom margin sums the probabilities in different columns. It expresses the overall probability distribution of y , regardless of the value of x .

p(y) = Prob[Y == y] = \sum_{x} p(x,y)

Properties of Distributions

Measures of Central Tendancy

  • The mean is measured by taking the sum divided by the number of observations.

\bar{x} = \frac{x_1 + x_2 + \ldots + x_n}{n} = \sum_{i=1}^n \frac{x_i}{n}

  • The median is the middle observation in a series of numbers. If the number of observations are even, then the two middle observations would be divided by two.
  • The mode refers to the most frequent observation.
  • The main question of interest is whether the sample mean, median, or mode provides the most accurate estimate of central tendancy within the population.

Measures of Dispersion

  • The standard deviation of a set of observations is the square root of the average of the squared deviations from the mean. The squared deviations from the mean is called the variance.

The Shape of Distributions

  • Unimodal distributions have only one peak while multimodal distributions have several peaks.
  • An observation that is skewed to the right contains a few large values which results in a long tail towards the right hand side of the chart.
  • An observation that is skewed to the left contains a few small values which results in a long tail towards the left hand side of the chart.
  • The kurtosis of a distribution refers to the degree of peakedness of a distribution.

The Binomial, Poisson, and Exponential Distributions

Binomial Distribution

  • Think of a repeated process with two possible outcome, failure (F ) and success (S ). After repeating the experiment n times, we will have a sequence of outcomes that include both failures and successes, SFFFSF . The primary metric of interest is the total number of successes.
  • What is the probability of obtaining x  successes and n-x failures in n  repetitions of the experiment?

Poisson Distribution

  • The poisson distribution is the limiting form of the binomial distribution when there are a large number of trials but only a small probability of success at each of them.

Exponential Distribution

  • A continuous, positive random variable is said to follow an exponential distribution if its probability density function decreases as the values of x go from 0  to \infty . The probability declines from its highest levels at the initial values of x .

The Normal Distribution

Properties of the Normal Distribution

  • The real reason for the importance of the normal distribution lies in the central limit theorem, which states that the sum of a large number of independent random variables will be approximately normally distributed regardless of their individual distributions.
  • A normal distribution is defined by its mean, \mu , and standard deviation, \sigma . A change in the mean shifts the distribution along the x-axis. A change in the standard deviation flattens it or compresses it while leaving its centre in the same position. The totral area under the curve is one and the mean is at the middle and divides the area into halves.
  • One standard deviation above and below the mean of a normal distribution will include 68% of the observations for that variable. For two standard deviates, that value will be 95%, and for three standard deviations, that value will be 99%.

There you have it, a quick review of basic concepts in statistics and probability. Please leave comments or suggestions below. If you’re looking to hire a marketing scientist, please contact me at mathewanalytics@gmail.com

Logistic Regression in R – Part Two

My previous post covered the basics of logistic regression. We must now examine the model to understand how well it fits the data and generalizes to other observations. The evaluation process involves the assessment of three distinct areas – goodness of fit, tests of individual predictors, and validation of predicted values – in order to produce the most useful model. While the following content isn’t exhaustive, it should provide a compact ‘cheat sheet’ and guide for the modeling process.

Goodness of Fit: Likelihood Ratio Test
A logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This occurs by comparing the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors. The null hypothesis, H_0 holds that the reduced model is true,so an \alpha for the overall model fit statistic that is less than 0.05  would compel us to reject H_0 .

mod_fit_one <- glm(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own +
CreditHistory.Critical, data=training, family="binomial")
mod_fit_two <- glm(Class ~ Age + ForeignWorker, data=training, family="binomial")
lrtest(mod_fit_one, mod_fit_two)


Goodness of Fit: Pseudo R^2
With linear regression, the R^2 statistic tells us the proportion of variance in the dependent variable that is explained by the predictors. While no equivilent metric exists for logistic regression, there are a number of R^2 values that can be of value. Most notable is McFadden’s R^2 , which is defined as 1 - \frac{ ln(L_M) }{ ln(L_0) } where ln(L_M) is the log likelihood value for the fitted model and ln(L_0) is the log likelihood for the null model with only an intercept as a predictor. The measure ranges from 0 to just under 1 , with values closer to zero indicating that the model has no predictive power.

pR2(mod_fit_one) # look for 'McFadden'

Goodness of Fit: Hosmer-Lemeshow Test
The Hosmer-Lemeshow test examines whether the observed proportion of events are similar to the predicted probabilities of occurences in subgroups of the dataset using a pearson chi-square statistic from the 2 x g table of observed and expected frequencies. Small values with large p-values indicate a good fit to the data while large values with p-values below 0.05 indicate a poor fit. The null hypothesis holds that the model fits the data and in the below example we would reject H_0 .


HLgof.test(fit = fitted(mod_fit_one), obs = training$Class)
hoslem.test(training$Class, fitted(mod_fit_one), g=10)

Tests of Individual Predictors: Wald Test
A wald test is used to evaluate the statistical significance of each coefficient in the model and is calculated by taking the ratio of the square of the regression coefficient to the square of the standard error of the coefficient. The idea is to test the hypothesis that the coefficient of an independent variable in the model is not significantly different from zero. If the test fails to reject the null hypothesis, this suggests that removing the variable from the model will not substantially harm the fit of that model.

regTermTest(mod_fit_one, "ForeignWorker")
regTermTest(mod_fit_one, "CreditHistory.Critical")


Tests of Individual Predictors: Variable Importance
To assess the relative importance of individual predictors in the model, we can also look at the absolute value of the t-statistic for each model parameter. This technique is utilized by the varImp function in the caret package for general and generalized linear models. The t-statistic for each model parameter helps us determine if it’s significantly different from zero.

mod_fit <- train(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own +
CreditHistory.Critical, data=training, method="glm", family="binomial")

Validation of Predicted Values: Classification Rate
With predictive models, he most critical metric regards how well the model does in predicting the target variable on out of sample observations. The process involves using the model estimates to predict values on the training set. Afterwards, we will compare the predicted target variable versus the observed values for each observation.

pred = predict(mod_fit, newdata=testing)
accuracy <- table(pred, testing[,"Class"])
pred = predict(mod_fit, newdata=testing)
confusionMatrix(data=pred, testing$Class)


Validation of Predicted Values: ROC Curve
The receiving operating characteristic is a measure of classifier performance. It’s based on the proportion of positive data points that are correctly considered as positive, TPR = \frac{TP}{n(Y=1)} , and the proportion of negative data points that are accuratecly considered as negative, TNR = \frac{TN}{n(Y=0)} . These metrics are expressed through a graphic that shows the trade off between these values. Ultimately, we’re concerned about the area under the ROC curve, or AUROC. That metric ranges from 0.50 to 1.00 , and values above 0.80 indicate that the model does a great job in discriminating between the two categories which comprise our target variable.

# Compute AUC for predicting Class with the variable CreditHistory.Critical
f1 = roc(Class ~ CreditHistory.Critical, data=training)
plot(f1, col="red")
# Compute AUC for predicting Class with the model
prob <- predict(mod_fit_one, newdata=testing, type="response")
pred <- prediction(prob, testing$Class)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]


This post has provided a quick overview of how to evaluate logistic regression models in R. If you have any comments or corrections, please comment below.