Using csvkit to Summarize Data: A Quick Example

As data analysts, we’re frequently presented with comma-separated value files and tasked with reporting insights. While it’s tempting to import that data directly into R or Python in order to perform data munging and exploratory data analysis, there are also a number of utilities to examine, fix, slice, transform, and summarize data through the command line. In particular, Csvkit is a suite of python based utilities for working with CSV files from the terminal. For this post, we will grab data using wget, subset rows containing a particular value, and summarize the data in different ways. The goal is to take data on criminal activity, group by a particular offense type, and develop counts to understand the frequency distribution.

Lets start by installing csvkit. Go to your command line and type in the following commands.

$ pip install csvkit

One: Set the working directory.

$ cd /home/abraham/Blog/Chicago_Analysis

Two: Use the wget command to grab data and export it as a csv file entitled rows.

$ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > rows.csv

This dataset contains information on reported incidents of crime that occured in the city of Chicago from 2001 to present. Data comes from the Chicago Police Department’s Citizen Law Enforcement Analysis and Reporting system.

Three: Let’s check to see which files are now in the working directory and how many rows that file contains. We will also use the csvcut command to identify the names of each column within that file.

$ ls
$ wc -l rows.csv
$ csvcut -n rows.csv

Four: Using csvsql, let’s find what unique values are in the sixth column of the file, primary type. Since we’re interested in incidents of prostitution, those observations will be subset using the csvgrep command, and transfered into a csv file entitled rows_pros.

$ csvsql –query “SELECT [Primary Type], COUNT(*) FROM rows GROUP BY [Primary Type]” rows.csv | csvlook
$ csvgrep -c 6 -m PROSTITUTION rows.csv > rows_pros.csv

Five: Use csvlook and head to have a look at the first few rows of the new csv file. The ‘Primary Type’ should only contain information on incidents of crime that involved prostitution.

$ wc -l rows_pros.csv
$ csvlook rows_pros.csv | head

Six: We’ve now got the data we need. So let’s do a quick count of each description that is associated with the prostitution offense. This is done using the csvsql and csvlook command line tools.

$ csvsql –query “SELECT [Primary Type], Description, COUNT(*) FROM rows_pros GROUP BY Description” rows_pros.csv | csvlook

Screenshot from 2015-03-29 23:03:49

This has been a quick example of how the various csvkit utilities can be used to take a large csv file, extract specific observations, and generate summary statistics by executing a SQL query on that data. While this same analysis could have been performed in R or Python in a more efficient manner, it’s important for analysts to remember that the command line offers a variety of important utilities that can simplify their daily job responsibilities.

 

 

The Command Line is Your Friend: A Quick Introduction

The command line can be a scary place for people who are traditionally accustomed to using point-and-click mechanisms for executing tasks on their computer. While the idea of interacting with files and software via text may seem like a terrifying concept, the terminal is a powerful tool that can boost productivity and provide users with greater control of their system. For data analysts, the command line provides tools to perform a wide array of tasks, including file explanation and exploratory data analysis. Getting accustomed with these capabilities will enable users to become more competent in their interactions with the computer.
Screen Shot 2014-11-03 at 10.19.06 PM
Working Directory:
The working directory refers to the folder or files that are currently being utilized. This is usually expressed as a hierarchical path and can be found using the pwd (‘print working directory’) command. The working directory can be changed from the command line using the cd (‘change directory’) command. Once a working directory has been set, use ls to list the contents of the current directory.
$ pwd
/Users/abraham.mathew
$ cd /Users/abraham.mathew/Movies/
$ ls
DDC - Model Visits.xlsx                    ILM Leads.xlsx
DDC - Page Type Views.xlsx               OBI Velocity-Day Supply.xlsx
...
Files and Folders:
The command line offers numerous tools for interacting with files and folders. For example, the mkdir (‘make directory’) command can be used to create an empty directory. Commands like mv and cp can then be used to rename files or copy the file into a new location. One can use the rm command to delete a file and rmdir to delete a directory.
$ mkdir Test_Dir_One
$ mkdir Test_Dir_Two
$ cp history.txt history_new.txt
cp: history.txt: No such file or directory
$ history > history.txt
$ cp history.txt history_new.txt
$ ls
...
$ cp history.txt /Users/abraham.mathew/movies/history_new_two.txt
$ pwd
/Users/abraham.mathew/Movies
$ rm history_new.txt
$ rmdir Test_Dir_Two
Interacting with Files:
The head and tail commands can be used to print the beginning and ending contents of a text or csv file. Furthermore, use the wc (‘word count’) command to find the numbers of lines, words, and characters in a file. The grep command can be used to find certain elements within a file using regular expressions. To combine files side by side, one can use the paste command. Cat, which is typically used to print out the contents of a file, can also be used to concatenate a number of files together.
$ head -n 5 Iris_Data.csv
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
$ head -n 5 Iris_Data.csv > Iris_Subset_One.txt
$ tail -n 5 Iris_Data.csv > Iris_Subset_two.txt
$ wc Iris_Data.csv
     151     151    4209 Iris_Data.csv
$ wc -l Iris_Data.csv
     151 Iris_Data.csv
$ grep "setosa" Iris_Data.csv | wc -l
      50
$ ls -l | grep "Iris"
-rw-r--r--   1 abraham.mathew  1892468438     4209 Nov  3 15:23 Iris_Data.csv
-rw-r--r--   1 abraham.mathew  1892468438      784 Nov  3 15:48 Iris_Subset.csv
-rw-r--r--   1 abraham.mathew  1892468438      157 Nov  3 21:37 Iris_Subset_One.txt
-rw-r--r--   1 abraham.mathew  1892468438      140 Nov  3 21:37 Iris_Subset_two.txt
$ paste Iris_Subset_One.txt Iris_Subset_Two.txt
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species     146,6.7,3,5.2,2.3,virginica
1,5.1,3.5,1.4,0.2,setosa     147,6.3,2.5,5,1.9,virginica
2,4.9,3,1.4,0.2,setosa     148,6.5,3,5.2,2,virginica
3,4.7,3.2,1.3,0.2,setosa     149,6.2,3.4,5.4,2.3,virginica
4,4.6,3.1,1.5,0.2,setosa     150,5.9,3,5.1,1.8,virginica
$ cat Iris_Subset_One.txt Iris_Subset_Two.txt > Iris_New.txt
Other Tools:
In many cases, the user will need to compute multiple commands in one line. This can be done with the semicolon, which acts as a separator between Unix commands. Another important tool is the pipe operator, which takes the output of one command and utilizes it with another command. For example, if a user were looking for all files within a directory that contained a particular string, they could pipe together the ls and grep commands in order to get the desired output. Redirection tasks are performed using the greater than sign, which is used to send the output of a command to a new file.
$ head -n 3 Iris_New.txt ; wc Iris_New.txt
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3,1.4,0.2,setosa
      10      10     297 Iris_New.txt
$ ls -l | grep "Iris"
-rw-r--r--   1 abraham.mathew  1892468438     4209 Nov  3 15:23 Iris_Data.csv
-rw-r--r--   1 abraham.mathew  1892468438      297 Nov  3 21:45 Iris_New.txt
-rw-r--r--   1 abraham.mathew  1892468438      784 Nov  3 15:48 Iris_Subset.csv
-rw-r--r--   1 abraham.mathew  1892468438      157 Nov  3 21:37 Iris_Subset_One.txt
-rw-r--r--   1 abraham.mathew  1892468438      140 Nov  3 21:37 Iris_Subset_two.txt
$ head -n 10 Iris_Data.csv > Iris_Redirection.txt
$ head -n 10 Iris_Redirection.txt
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
There you have it, the basics for getting acquainted with the command line. While there are many other important command line tools, including curl, sed, awk, and wget, the procedures mentioned in this post will provide users with the essential building blocks. There is a steep learning curve, but the long term benefits of using the command line are well worth the short term costs.

Graphviz by Example: Part One

Introduction
GraphViz is an open-source software package developed by AT&T Labs for generating directed graphs and flowcharts. Outputs are created using Dot, a plain text graph description language that is part of the Graphviz package. GraphViz is a powerful application that allows users to create appealing flowcharts without getting hung up on the layout or positioning of the nodes. Graphs in GraphViz are comprosed of nodes and edges (lines), and can be directed (lines with arrows) or undirected (lines without arrows). To create flowcharts with dot, the user must specify the type of graph, the name of the plot, and then define the terms of the graph within braces.

Basics
To generate visualizations using Graphviz, a user must create a dot source file and save it with a .dot extension. In the terminal, set the current working directory to the location of that file and launch dot by running ‘dot -T png -O file_name.dot’.

Example One:

graph name1 {
     Node1 [label="Node_1"];
}

blog_one.dot

Example Two:

digraph name1 { 
      Node1 [label="Node_1"];
      Node2 [label="Node_2"];
      Node1 -> Node2;
}

blog_two.dot

Example Three:

digraph name1 { 
     Node1 [label="Node_1"];
     Node2 [label="Node_2"];
     Node3 [label="Node_3"];
     Node1 -> Node2 -> Node3;
     Node1 -> Node3;
     Node2 -> Node1;
}

blog_three.dot

Example Four:

digraph name1 { 
   Node1 [label="Node_1"];
   Node2 [label="Node_2", shape=diamond];
   Node3 [label="Node_3", color=Blue, fontcolor=Navy, shape=box, style=filled, 
               color="#d3edea"]; 
   Node4 [label="Node_4"];
   Node1 -> Node2 -> Node3;
   Node1 -> Node3 [color=red, label="Best Choice", fontsize=10];
   Node2 -> Node1;
   Node2 -> Node4;
}

blog_four.dot