As data analysts, we’re frequently presented with comma-separated value files and tasked with reporting insights. While it’s tempting to import that data directly into R or Python in order to perform data munging and exploratory data analysis, there are also a number of utilities to examine, fix, slice, transform, and summarize data through the command line. In particular, Csvkit is a suite of python based utilities for working with CSV files from the terminal. For this post, we will grab data using wget, subset rows containing a particular value, and summarize the data in different ways. The goal is to take data on criminal activity, group by a particular offense type, and develop counts to understand the frequency distribution. Lets start by installing csvkit. Go to your command line and type in the following commands. $ pip install csvkit One: Set the working directory. $ cd /home/abraham/Blog/Chicago_Analysis Two: Use the wget command to grab data and export it as a
A markov model can be used to examine a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Let’s define a stochastic process that takes on a finite number of possible values which are nonnegative integers. Each state, , represents it’s value in time period . If the probability of being in is dependent on , it’s refered to as the first-order Markov property. We are interested in estimating , which is the fixed probability that at time will be followed by state . These step transition probabilities are calculated through the Chapman-Kolmogorov equations, which relates the joint probability distributions of different sets of coordinates on a stochastic process. Markov chains are generally represented as a state diagram or transition matrix where every row of the matrix, , is a conditional probability mass function. Let’s consider an example using website pathing data from an ecommerce website. The set
Let’s face it, a good statistics refresher is always worthwhile. There are times we all forget basic concepts and calculations. Therefore, I put together a document that could act as a statistics refresher and thought that I’d share it with the world. This is part one of a two part document that is still being completed. This refresher is based on Principles of Statistics by Balmer and Statistics in Plain English by Brightman. The Two Concepts of Probability Statistical Probability Statistical probability pertains to the relative frequency with which an event occurs in the long run. Example: Let’s say we flip a coin twice. What is the probability of getting two heads? If we flip a coin twice, there are four possible outcomes, . Therefore, the probability of flipping two heads is Inductive Probability Inductive probability pertains to the degree of belief which is reasonable to place on a proposition given evidence. Example: I’m certain that the answer to is between and . The Two Laws
The rms package offers a variety of tools to build and evaluate regression models in R. Originally named ‘Design’, the package accompanies the book “Regression Modeling Strategies” by Frank Harrell, which is essential reading for anyone who works in the ‘data science’ space. Over the past year or so, I have transitioned my personal modeling scripts to rms as it makes things such as bootstrapping, model validation, and plotting predicted probabilities easier to do. While the package is fairly well documented, I wanted to put together a ‘simpler’ and more accessible introduction that would explain to R-beginners how they could start using the rms package. For those with limited statistics training, I strongly suggest reading “Clinical Prediction Models” and working your way up to “Regression Modeling Strategies”. We start this introduction to the rms package with the datadist function, which computes statistical summaries of predictors to automate estimation and plotting of effects. The user will generally supply the final data
Anyone who has regularly worked with Google Trends data has had to deal with the slightly tedious task of grabbing keyword level data and reformatting the spreadsheet provided by Google. After looking for a seamless way to pull the data, I came upon the PyTrends library on GitHub, and sought to put together some quick user defined functions to manage the task of pulling daily and weekly trends data.
Given a data frame with multiple columns which contain time series data, let’s say that we are interested in executing an automatic forecasting algorithm on a number of columns. Furthermore, we want to train the model on a particular number of observations and assess how well they forecast future values. Based upon those testing procedures, we will estimate the full model. This is a fairly simple undertaking, but let’s walk through this task. My preference for such procedures is to loop through each column and append the results into a nested list. First, let’s create some data. ddat <- data.frame(date = c(seq(as.Date(“2010/01/01”), as.Date(“2010/03/02”), by=1)), value1 = abs(round(rnorm(61), 2)), value2 = abs(round(rnorm(61), 2)), value3 = abs(round(rnorm(61), 2))) head(ddat) tail(ddat) We want to forecast future values of the three columns. Because we want to save the results of these models into a list, lets begin by creating a list that contains the same number of elements as our data frame. lst.names <-
For those of us who received statistical training outside of statistics departments, it often emphasized procedures over principles. This entailed that we learned about various statistical techniques and how to perform analysis in a particular statistical software, but glossed over the mechanisms and mathematical statistics underlying these practices. While that training methodology (hereby referred to as the ‘heuristic method’) has value, it has many drawbacks when the ultimate goal is to perform sound statistical analysis that is valid and thorough. Even in my current role as a data scientist at a technology company in the San Francisco Bay Area, I have had to go back and understand various procedures and metrics instead of just “doing data analysis”. Given this realization, I have dedicated hours of time outside of work over the last couple years to “re-training” myself on many of the important concepts in both descriptive and inferential statistics. This post will give brief mention to the books that have
I’ve been putting together a basic SQL cheat sheet that could be used as a reference guide. Here are a series of common procedures that should be of use for anyone who uses SQL to extract data. No explanations are provided as they should largely be known to the end user.
Lets say that we estimated a linear regression model on time series data with lagged predictors. The goal is to estimate sales as a function of inventory, search volume, and media spend from two months ago. After using the lm function to perform linear regression, we predict sales using values from two month ago. If this model is estimated weekly or monthly, we will eventually want to understand how well our model did in predicting actual sales from month to month. To perform this task, we must regularly maintain a spreadsheet or data structure (RDS object) with actual predicted sales figures for each time period. That data can be used to create line graphs that visualize both the actual versus predicted values. Here is what the original spreadsheet looked like. Transform that data into long format using whatever package you prefer. This will provide a data frame with three columns. We can utilize the ggplot2 package to create visualizations. Above
Number 1: Importing Multiple Packages Anyone who has used R for some time has written code that required the use of multiple packages. In most cases, this will be done by using the library or require function to bring in the appropriate extensions. That’s nice and gets the desired result, but can’t we just import all the packages we need in one or two lines. Yes we can, and here is the one line of code to do that. Number 2: User Input One side project that I hope to start on is a process whereby I can interact with R and select options that will result in particular outcomes. For example, let’s say you’re trying to put together a script that manages a weekly list. A good first step would be a list of options that the user would see and be prompted to select an option. Here is how R can be used to get user input in