Data.Table by Example – Part 3

For this final post, I will cover some advanced topics and discuss how to use data tables within user generated functions. Once again, let’s use the Chicago crime data.

dat = fread("rows.csv")
names(dat) <- gsub(" ", "_", names(dat))
dat[, c("value1", "value2", "value3") := sample(1:50, nrow(dat), replace=TRUE)]
dat[1:3]

Let’s start by subseting the data. The following code takes the first 50000 rows within the dat dataset, selects four columns, creates three new columns pertaining to the data, and then removes the original date column. The output was saved as to new variable and the user can see the first few columns of the new data table using brackets or head function.

ddat = dat[1:50000, .(Date, value1, value2, value3)][,
               c("year", "month", "day") :=
                        .(year(mdy_hms(Date)),
                          month(mdy_hms(Date)),
                          day(mdy_hms(Date)))][,-c("Date")]

ddat[1:3]  # same as head(ddat, 3)

We can now do some intermediate calculations and suppress their output by using braces.

unique(ddat$month)
ddat[, { avg_val1 = mean(value1)
         new_val1 = mean(abs(value2-avg_val1))
         new_val2 = new_val1^2 }, by=month][order(month)]

In this simple sample case, we have taken the mean value1 for each month, subtracted it from value2, and then squared that result. The output show the final calculation in the brackets, which is the result from squaring. Note that I also ordered the results by month with the chaining process.

That’s all very nice and convenient, but what if we want to create user defined functions that essentially automate these tasks for future work. Let us start with a function for extracting each component from the date column.

ddat = dat[1:50000, .(Date, value1, value2, value3)]

add_engineered_dates <- function(mydt, date_col="Date"){

   new_dt = copy(mydt)

   new_dt[, paste0(date_col, "_year") := year(mdy_hms(get(date_col)))]
   new_dt[, paste0(date_col, "_month") := month(mdy_hms(get(date_col)))]
   new_dt[, paste0(date_col, "_day") := day(mdy_hms(get(date_col)))] 

   new_dt[, c(date_col) := NULL]

   new_dt[]
}

We’ve created a new functions that takes two arguments, the data table variable name and the name of the column containing the date values. The first step is to copy the data table so that we are not directly making changes to the original data. Because the goal is to add three new columns that extract the year, month, and day, from the date column, we’ve used the lubridate package to define the date column and then extract the desired values. Furthermore, each new column has been labeled so that it distinctly represents the original date column name and the component that it contains. The final step was to remove the original date column, so we’ve set the date column to NULL. The final line prints the results back to the screen. You can recognize from the code above that the get function is used to take a variable name that represents column names and extract those columns within the function.

result = add_engineered_dates(ddat)
result

Let’s now apply the previous code with braces to find the squared difference between value2 and the mean value1 metric.

result[, { avg_val1 = mean(value1)
           new_val1 = mean(abs(value2-avg_val1))
           new_val2 = new_val1^2 }, by=.(Date_year,Date_month)][
                  order(Date_year,Date_month)][1:12]

Perfect!

I’m hoping that these three posts have convinced beginners to the R language that the data.table package is beautiful and worth checking out. The syntax may be scary at first, but it offers an array of powerful tools that everyone should become familiar with. So pull your sleeves up and have fun.

If you have any comments or would like to cover other specific topics, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn

Advertisements

Data.Table by Example – Part 2

In part one, I provided an initial walk through of some nice features that are available within the data.table package. In particular, we saw how to filter data and get a count of rows by the date.

dat = fread("rows.csv")
names(dat) <- gsub(" ", "_", names(dat))
dat[1:3]

Let us now add a few columns to our dataset on reported crimes in the city of Chicago. There are many ways to do do this but they involve the use of the := operator. Since data.table updates values by reference, we do not need to save the results as another variable. This is a very desirable feature.

dat[, c("value1", "value2", "value3") := sample(1:50, nrow(dat), replace=TRUE)]

dat[, `:=`(value1 = sample(1:50, nrow(dat), replace=TRUE),
           value2 = sample(1:50, nrow(dat), replace=TRUE),
           value3 = sample(1:50, nrow(dat), replace=TRUE))]

You can also just use the traditional base R solution for adding new columns as data.tables are also data frames.

dat$value1 <- sample(1:50, nrow(dat), replace=TRUE)
dat$value2 <- sample(1:50, nrow(dat), replace=TRUE)
dat$value3 <- sample(1:50, nrow(dat), replace=TRUE)

In any case, we now have three new columns with randomly selected values between 1 and 50. We can now look to summarize these values and see how they differ across the primary arrest type and other categorical variables.

dat[, .(mean = mean(value1, na.rm = TRUE),
        median = median(value1, na.rm = TRUE),
        min = min(value1, na.rm = TRUE),
        max = max(value1, na.rm = TRUE))]

dat[Primary_Type=="PROSTITUTION",
              .(mean = mean(value1, na.rm = TRUE),
                median = median(value1, na.rm = TRUE),
                min = min(value1, na.rm = TRUE),
                max = max(value1, na.rm = TRUE))]

The above code allows us to get the mean, median, min, and max values from the value1 column for all rows and for when the primary type is prostitution. To get these summaries across multiple columns, we can use lapply, .SD, and .SDcols. Notice that in the second line below, we passed primary type to the by operator so that we get mean value for each of the arrest type categories.

dat[, lapply(.SD, mean, na.rm=TRUE), .SDcols=c("value1", "value2", "value3")]

dat[, lapply(.SD, mean, na.rm=TRUE), by=Primary_Type, .SDcols=c("value1", "value2", "value3")][1:4]

But wait, that only provides me with the mean of a single value column. If you want to apply multiple functions to columns, the user is required to write a function that can then be used within lapply.

my.summary <- function(x){
                   c(mean = mean(x),
                     median = median(x),
                     min = min(x),
                     max = max(x))
}

dat[, lapply(.SD, my.summary), .SDcols=c("value1", "value2", "value3")]

dat[, lapply(.SD, my.summary), by=.(Primary_Type), .SDcols=c("value1", "value2", "value3")]

Perfect! As you can see, the syntax is concise and is very easy to work with.

In the next part of this series, I’ll cover a few advanced features like get() and {}.

If you have any questions or comments, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn 

 

Data.Table by Example – Part 1

For many years, I actively avoided the data.table package and preferred to utilize the tools available in either base R or dplyr for data aggregation and exploration. However, over the past year, I have come to realize that this was a mistake. Data tables are incredible and provide R users with a syntatically
concise and efficient data structure for working with small, medium, or large datasets. While the package is well documented, I wanted to put together a series of posts that could be useful for those who want to get introduced to the data.table package in a more task oriented format.

For this series of posts, I will be working with data that comes from the Chicago Police Department’s Citizen Law Enforcement Analysis and Reporting system. This dataset contains information on reported incidents of crime that occured in the city of Chicago from 2001 to present. You can use the wget command in the terminal to export it as a csv file.

$ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > rows.csv

This file can be found in your working directory and was saved as “rows.csv”. We will import the data into R with the fread function and look at the first few rows and structure of the data.

dat = fread("rows.csv")

dat[1:3]
str(dat)

Notice that the each of the string variables in the data set was imported as a character and not a factor. With base R functions like read.csv, we have to set the stringsAsFactors argument to TRUE if we want this result.

names(dat) <- gsub(" ", "_", names(dat))

Let’s say that we want to see the frequency distribution of several of these variables. This can be done by using .N in conjunction with the by operator.

dat[, .N, by=.(Arrest)]

In the code below, you can also see how to chain operations togther. We started by finding the count of each response in the variable, ordered the count in descending order, and then selected only those which occured more than 200,000 times.

dat[, .N, by=.(Primary_Type)][order(-N)][N>=200000]

dat[, .N, by=.(Location_Description)][order(-N)][N>=200000]

Let’s say that we want to get a count of prostitution incidents by month. To get the desired results, we will need to modify the date values, filter instances in which the primary type was “prostitution”, and then get a count by each date.

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][
         Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)][]

If you want to plot the results as a line graph, just add another chain which executes the visualization or use the maggritr %>% operator.

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][
             Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)][,
          plot(N, type="l")]

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][
            Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)] %>%
          ggplot(., aes(date2, N)) + geom_line(group=1)

I’ve obviously skipped over a lot and some of the code presented here is more verbose than needed. Even so, beginners to R will hopefully find this useful and it will pique your interest in the beauty of the data table package. Future posts will cover more of the goodies available in the data.table package such as get(), set(), {}, and so forth.

If you have any questions or comments, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn 

R Programming Notes – Part 2

In an older post, I discussed a number of functions that are useful for programming in R. I wanted to expand on that topic by covering other functions, packages, and tools that are useful. Over the past year, I have been working as an R programmer and these are some of the new learnings that have become fundamental in my work.

IS TRUE and IS FALSE

isTRUE is a logical operator that can be very useful in checking whether a condition or variable has been set to true. Lets say that we are writing a script whereby we will take run a generalized linear regression when the parameter run_mod is set to true. The conditional portion of the script can be written as either if(isTRUE(run_mod)) or if(run_mod). I am partial to isTRUE, but this is entirely a matter of personal preference. Users should also be aware of the isFALSE function, which is part of the BBmisc package.


run_mod = TRUE 

if(isTRUE(run_mod){
	tryCatch(
         GLM_Model(full_data=full.df, train_data=train.df, 
                      test_data=test.df), 
    error = function(e) {
         print("error occured")
         print(e)
    })
}

if(BBmisc::isFALSE(check_one) & BBmisc::isFALSE(check_two)){
	data_output.tmp$score = 0.8
}

INVISIBLE

The invisible function can be used to return an object that is not printed out and can be useful in a number of circumstances. For example, it’s useful when you have helper functions that will be utilized within other functions to do calculations. In those cases, it’s often not desireable to print those results. I generally use invisible when I’m checking function arguments. For example, consider a function that takes two arguments and you need to check whether the input is a vector.


if(!check_input(response, type='character', length=1)) {
    stop('something is wrong')
}

The check_input function is something I created and has a few lines which contain invisible. The idea is for check_input to return true or false based on the inputs so that it’ll stop stop the execution when needed.


if(is.null(response) & !length(response)==0) {
    return(FALSE)
} else if (!is.null(response)) {
    return(invisible(TRUE))
}

DEBUG

When I’m putting together new classes or have multiple functions that interact with one another, I ensure that the code includes an comprehensive debugging process. This means that I’m checking my code at various stages so that I can identify when issues arise. Consider that I’m putting together a function that will go through a number of columns in a data frame, summarize those variables, and save the results as a nested list. To effectively put together code without issues, I ensure that the functions takes a debug argument that will run when it’s set to true. In the code below, it will print out values at different stages of the code. Furthermore, the final line of the code will check the resulting data structure.


DSummary_Variable(data_obj, var, debug=TRUE){
             ......
}

if(debug) message('|==========>>>  Processing of the variable. \n')  

if(debug){ 
  if(!missing(var_summary)){ 
      message('|==========>>>  var_summary has been created and 
              has a length of ', length(var_summary), ' and the 
              nested list has a length of ', 
              length(var_summary[['var_properties']]), ' \n')  
  } else {
      stop("var_summary is missing. Please investigate")
  }

If you have multiple functions that interact with one another, it’s a good idea to preface the printed message with the name of the function name.


add_func <- function(a,b) a + b

mult_func <- function(a,b) a * b

main_func <- function(data, cola, colb, debug=TRUE){

	if(debug){
		message("mult_func: checking inputs to be used")
	}

	mult_func(data[,cola], data[,colb])

	if(debug){
		message("mult_add: checking inputs to be used")
	}
}

Stay tuned for part three, where I’ll talk about the testthat and assertive package.

R Programming Notes

I’ve been on a note taking binge recently. This post covers a variety of topics related to programming in R. The contents were gathered from many sources and structured in such a way that it provided the author with a useful reference guide for understanding a number of useful R functions.

DO.CALL

The do.call function executes a function call on a list of arguments.

do.call("R_Function", "List_of_Arguments")

This is equivilant to telling R which arguments the function should operate on.

R_Function( "List_of_Arguments" ){
  ...
}

Consider the following list with four elements. We can use this function to find the total sum across all list elements or bind the rows into a data frame.

x1 <- c(1,2,5)
x2 <- c(1,3,6)
x3 <- c(1,4,7)
x4 <- c(1,5,8)
 
do.call(sum, list(x1,x2,x3,x4))  # sum all list elements
do.call(rbind, list(x1,x2,x3,x4))  # rbind the list elements

Let’s consider a scenario where we have a small data frame and want to run a general linear model on different combintions of attributes. One solution to this problem is to create the formula object as a list within a function, and then utilize do.call to run a linear model on the desired attribute.

dat <- data.frame(x1 = rnorm(100, m=50), x2 = rnorm(100, m=50), x3 = rnorm(100, m=50), y = rnorm(100))
 
new_mod <- function(form){
  lstt = list(formula=as.formula(paste("y ~ ", form, sep="")), data=as.name("dat"))
  summary(do.call("lm", lstt))
}
 
new_mod("x1")
new_mod("x2")
new_mod("x3")

EVAL

The eval function evaluates an expression, an object that represents an action that can be performed in R. An expression is different from an operation, which refers to the execution of an operation. In the following example, we assigned a value to a variable and performed an operation using that variable.

x <- 4
y <- x * 10
y

The expression and quote functions are used to takes an expression as an argument and returns an expression without evaluation.

ee = expression(y~x1)
ee
 
z <- quote(y <- x * 10)
z
 
is.expression(ee)
is.call(z)

In the above example, z is referred to as a call object. They are usually created with the call function. For example, the following code pieces together commands into a call object and evaluates them.

mycall <- call("lm", quote(y ~ x1), data=quote(dat))
mod <- eval(mycall)
summary(mod)

SUBSTITUTE

Another common procedure is to replace certain variables within a user defined function. This can be achieved with the substitute function. The below code replaces x1 and x2 in the expression with the values from the list.

replace <- list(x1 = as.name("alphabet"), x2 = as.name("zoology"))
substitute(expression(x1 + x2 + log(x1) + x3), replace)

SETNAMES

There are a number of commands for working within column and row names in R. It’s generally suggested that the setNames function be used when modifying data structure names within a function.

names_new <- c("one","two","three","four")
new_dat <- expression(data.frame(x1 = rnorm(100, m=50), x2 = rnorm(100, m=50), x3 = rnorm(100, m=50), y = rnorm(100)))
head(setNames(eval(new_dat), names_new))
 
my_lst <- list(lname = as.name("xls"))
setNames(my_lst, "x1")

When working with data within a function, it will often be useful to write code that creates the names and data structure, and then evaluates the pieces together.

MISSING

Use the missing function to check whether an argument was supplied by the user.

func <- function(x){
    if(missing(x)){
        x = 5
    }
    y = x*x; y
}
 
func(2)
func()

STOP

The stop function is used to halt a function, and is usually used when a particular condition is met.

guess <- function(n){
  if( n >= 6 ){
      stop("WRONG!")
}}
 
guess(5)
guess(10)

WARNING

Use the warning function to throw a comment when a condition is met. It does not halt the function.

guess <- function(n){
  if( n >= 6 ){
      warning("BE CAREFUL!")
}}
 
guess(5)
guess(10)

ELIPSES (…)

When functions are evaluated, R scans each function argument it can understand. When ellipsis are added as a function argument, it allows for other arguments to be passed into the function. This technique is frequently used when plotting and should be used when the function is designed to take any number of named or unnamed arguments.

plotter <- function(x, y, ...){
  plot(x, y, ...)
}

To make use of ellipsis, it’s suggested that the user scan through the … and turn them into a list. This is because some of the arguments in dots may be intended for several different functions.

 

Working With SEM Keywords in R

The following post was republished from two previous posts that were on an older blog of mine that is no longer available. These are from several years ago, and related to two critical questions that I encountered. One, how can I automatically generate hundreds of thousands of keywords for a search engine marketing campaign. Two, how can I develop an effective system for examining keywords based on different characteristics.

Generating PPC Keywords in R

Paid search marketing refers to the process of driving traffic to a website by purchasing ads on search engines. Advertisers bid on certain keywords that users might search for, and that determines when and where their ads appear. For example, an individual who owns an auto dealership would want to bid on keywords relating to automobiles that a reasonable people would search for on a search engine. In both Google and Bing, advertisers are able to specify which keywords they would like to bid for and at what amount. If the user decides to bid on just a small number of keywords, they can type that information and specify a bid. However, what if you want to bid on a significant number of keywords. Instead of typing each and every keyword into the Google or Bing dashboard, you could programmatically generate the keywords in R.

Let’s say that I run an online retail establishment that sells mens and womens streetwear and I want to drive more traffic to my online store by placing ads on both Google and Bing. I want to bid on about a number of keywords related to fashion and have created a number of ‘root’ words that will comprise the majority of these keywords. To generate my desired keywords, I have a written a function which will take every single permutation of the root words.

root1 = c("fashion", "streetwear")
root2 = c("karmaloop", "crooks and castles", "swag")
root3 = c("urban clothing", "fitted hats", "snapbacks")
root4 = c("best", "authentic", "low cost") 

myfunc <- function(){
      lst <- list(root1=c(root1), root2=c(root2), root3=c(root3),
            root4=c(root4))
      myone <- function(x, y){
            m1 <- do.call(paste, expand.grid(lst[[x]], lst[[y]]))
            mydf <- data.frame(keyword=c(m1))
      }
      mydf <- rbind(myone("root4","root1"), myone("root2","root1"))
      }

mydat <- myfunc()
mydat

write.table(mydat, "adppc.txt", quote=FALSE, row.names=FALSE)

This isn’t the prettiest code in the world, but it gets the job done. In fact, the same results could have achieved using the following code, which is much more efficient.

root5 = c("%s fashion")
root6 = c("%s streetwear")
adcam1 = sprintf(root5, root2)
adcam2 = sprintf(root6, root2)
df = data.frame(keywords=c(adcam1, adcam2))

write.table(df, "adppc.txt", quote=FALSE, row.names=FALSE)

If you have any suggestions for improving my R code, please mention it in the comment section below.

Creating Tags For PPC Keywords

When performing search engine marketing, it is usually beneficial to construct a system for making sense of keywords and their performance. While one could construct Bayesian Belief Networks to model the process of consumers clicking on ads, I have found that using ’tags’ to categorize keywords is just as useful for conducting post-hoc analysis on the effectiveness of marketing campaigns. By ‘tags,’ I mean identifiers which categorize keywords according to their characteristics. For example, in the following data frame, we have six keywords, our average bids, numbers of clicks, and tags for state, model, car, auto, save, and cheap. What we want to do now is set the boolean for each tag to 1 if and only if that tag is mentioned in the keyword.

# CREATE SOME DATA = 
df = data.frame(keyword=c("best car insurance",
                          "honda auto insurance",
                          "florida car insurance",
                          "cheap insurance online",
                          "free insurance quotes",
                          "iowa drivers save money"),
                average_bid=c(3.12, 2.55, 2.38, 5.99, 4.75, 4.59),
                clicks=c(15, 20, 30, 50, 10, 25),
                conversions=c(5, 2, 10, 15, 3, 5),
                state=0, model=0, car=0, auto=0, save=0, cheap=0)
df

# FUNCTION WHICH SETS EACH TAG TO 1 IF THE SPECIFIED TAG IS PRESENT IN THE KEYWORD
main <- function(df) {
  state <- c("michigan", "missouri", "florida", "iowa", "kansas")
  model <- c("honda", "toyota", "ford", "acura", "audi")
  car <- c("car")
  auto <- c("auto")
  save <- c("save")
  cheap <- c("cheap")
  for (i in 1:nrow(df)) {
    Words = strsplit(as.character(df[i, 'keyword']), " ")[[1]]
    if(any(Words %in% state)) df[i, 'state'] <- 1
    if(any(Words %in% model)) df[i, 'model'] <- 1 
    if(any(Words %in% car)) df[i, 'car'] <- 1
    if(any(Words %in% auto)) df[i, 'auto'] <- 1     
    if(any(Words %in% save)) df[i, 'save'] <- 1
    if(any(Words %in% cheap)) df[i, 'cheap'] <- 1
  }
  return(df)
}

one = main(df)

subset(one, state==TRUE | model==TRUE | auto==TRUE)

# AN ALTERNATE METHOD USING THE STRINGR PACKAGE

df

library(stringr)

# CREATE EACH TAG
state <- c("michigan", "missouri", "florida", "iowa", "kansas")
model <- c("honda", "toyota", "ford", "acura", "audi")
car <- c("car")
auto <- c("auto")
save <- c("save")
cheap <- c("cheap")

state_match <- str_c(state, collapse = "|")
model_match <- str_c(model, collapse = "|")
car_match <- str_c(car, collapse = "|")
auto_match <- str_c(auto, collapse = "|")
save_match <- str_c(save, collapse = "|")
cheap_match <- str_c(cheap, collapse = "|")

#FUNCTION TO SET TAG IF PRESENT IN THE KEYWORD
main <- function(df) {
  df$state <- str_detect(df$keyword, state_match)
  df$model <- str_detect(df$keyword, model_match)
  df$car <- str_detect(df$keyword, car_match)
  df$auto <- str_detect(df$keyword, auto_match)
  df$save <- str_detect(df$keyword, save_match)
  df$cheap <- str_detect(df$keyword, cheap_match)
  df
}

two = main(df2)

subset(two, state==TRUE | model==TRUE | auto==TRUE)

By now, some of you are probably wondering why we don’t just select the keyword directly from the original data frame based on the desired characteristic. Well, that works too, albeit I’ve found that the marketing professionals that I’ve worked with have preferred the ‘tagging’ method.

## Alternate approach - SELECT DIRECTLY

df

main <- function(df) {
  model <- c("honda", "toyota", "ford", "acura", "audi")
  for (i in 1:nrow(df)) {
    Words = strsplit(as.character(df[i, 'keyword']), " ")[[1]]
    if(any(Words %in% model)) return(df[i, c(1:4) ])    
  }}

three = main(df)

So there you have it, a method of ‘tagging’ strings according to a certain set of specified characteristics. The benefit of using ‘tags’ is that it provides you with a systematic way to document how the presence of certain words or phrases impacts performance.

 

 

Using csvkit to Summarize Data: A Quick Example

As data analysts, we’re frequently presented with comma-separated value files and tasked with reporting insights. While it’s tempting to import that data directly into R or Python in order to perform data munging and exploratory data analysis, there are also a number of utilities to examine, fix, slice, transform, and summarize data through the command line. In particular, Csvkit is a suite of python based utilities for working with CSV files from the terminal. For this post, we will grab data using wget, subset rows containing a particular value, and summarize the data in different ways. The goal is to take data on criminal activity, group by a particular offense type, and develop counts to understand the frequency distribution.

Lets start by installing csvkit. Go to your command line and type in the following commands.

$ pip install csvkit

One: Set the working directory.

$ cd /home/abraham/Blog/Chicago_Analysis

Two: Use the wget command to grab data and export it as a csv file entitled rows.

$ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > rows.csv

This dataset contains information on reported incidents of crime that occured in the city of Chicago from 2001 to present. Data comes from the Chicago Police Department’s Citizen Law Enforcement Analysis and Reporting system.

Three: Let’s check to see which files are now in the working directory and how many rows that file contains. We will also use the csvcut command to identify the names of each column within that file.

$ ls
$ wc -l rows.csv
$ csvcut -n rows.csv

Four: Using csvsql, let’s find what unique values are in the sixth column of the file, primary type. Since we’re interested in incidents of prostitution, those observations will be subset using the csvgrep command, and transfered into a csv file entitled rows_pros.

$ csvsql –query “SELECT [Primary Type], COUNT(*) FROM rows GROUP BY [Primary Type]” rows.csv | csvlook
$ csvgrep -c 6 -m PROSTITUTION rows.csv > rows_pros.csv

Five: Use csvlook and head to have a look at the first few rows of the new csv file. The ‘Primary Type’ should only contain information on incidents of crime that involved prostitution.

$ wc -l rows_pros.csv
$ csvlook rows_pros.csv | head

Six: We’ve now got the data we need. So let’s do a quick count of each description that is associated with the prostitution offense. This is done using the csvsql and csvlook command line tools.

$ csvsql –query “SELECT [Primary Type], Description, COUNT(*) FROM rows_pros GROUP BY Description” rows_pros.csv | csvlook

Screenshot from 2015-03-29 23:03:49

This has been a quick example of how the various csvkit utilities can be used to take a large csv file, extract specific observations, and generate summary statistics by executing a SQL query on that data. While this same analysis could have been performed in R or Python in a more efficient manner, it’s important for analysts to remember that the command line offers a variety of important utilities that can simplify their daily job responsibilities.