The Command Line is Your Friend: A Quick Introduction

The command line can be a scary place for people who are traditionally accustomed to using point-and-click mechanisms for executing tasks on their computer. While the idea of interacting with files and software via text may seem like a terrifying concept, the terminal is a powerful tool that can boost productivity and provide users with greater control of their system. For data analysts, the command line provides tools to perform a wide array of tasks, including file explanation and exploratory data analysis. Getting accustomed with these capabilities will enable users to become more competent in their interactions with the computer.
Screen Shot 2014-11-03 at 10.19.06 PM
Working Directory:
The working directory refers to the folder or files that are currently being utilized. This is usually expressed as a hierarchical path and can be found using the pwd (‘print working directory’) command. The working directory can be changed from the command line using the cd (‘change directory’) command. Once a working directory has been set, use ls to list the contents of the current directory.
$ pwd
$ cd /Users/abraham.mathew/Movies/
$ ls
DDC - Model Visits.xlsx                    ILM Leads.xlsx
DDC - Page Type Views.xlsx               OBI Velocity-Day Supply.xlsx
Files and Folders:
The command line offers numerous tools for interacting with files and folders. For example, the mkdir (‘make directory’) command can be used to create an empty directory. Commands like mv and cp can then be used to rename files or copy the file into a new location. One can use the rm command to delete a file and rmdir to delete a directory.
$ mkdir Test_Dir_One
$ mkdir Test_Dir_Two
$ cp history.txt history_new.txt
cp: history.txt: No such file or directory
$ history > history.txt
$ cp history.txt history_new.txt
$ ls
$ cp history.txt /Users/abraham.mathew/movies/history_new_two.txt
$ pwd
$ rm history_new.txt
$ rmdir Test_Dir_Two
Interacting with Files:
The head and tail commands can be used to print the beginning and ending contents of a text or csv file. Furthermore, use the wc (‘word count’) command to find the numbers of lines, words, and characters in a file. The grep command can be used to find certain elements within a file using regular expressions. To combine files side by side, one can use the paste command. Cat, which is typically used to print out the contents of a file, can also be used to concatenate a number of files together.
$ head -n 5 Iris_Data.csv
$ head -n 5 Iris_Data.csv > Iris_Subset_One.txt
$ tail -n 5 Iris_Data.csv > Iris_Subset_two.txt
$ wc Iris_Data.csv
     151     151    4209 Iris_Data.csv
$ wc -l Iris_Data.csv
     151 Iris_Data.csv
$ grep "setosa" Iris_Data.csv | wc -l
$ ls -l | grep "Iris"
-rw-r--r--   1 abraham.mathew  1892468438     4209 Nov  3 15:23 Iris_Data.csv
-rw-r--r--   1 abraham.mathew  1892468438      784 Nov  3 15:48 Iris_Subset.csv
-rw-r--r--   1 abraham.mathew  1892468438      157 Nov  3 21:37 Iris_Subset_One.txt
-rw-r--r--   1 abraham.mathew  1892468438      140 Nov  3 21:37 Iris_Subset_two.txt
$ paste Iris_Subset_One.txt Iris_Subset_Two.txt
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species     146,6.7,3,5.2,2.3,virginica
1,5.1,3.5,1.4,0.2,setosa     147,6.3,2.5,5,1.9,virginica
2,4.9,3,1.4,0.2,setosa     148,6.5,3,5.2,2,virginica
3,4.7,3.2,1.3,0.2,setosa     149,6.2,3.4,5.4,2.3,virginica
4,4.6,3.1,1.5,0.2,setosa     150,5.9,3,5.1,1.8,virginica
$ cat Iris_Subset_One.txt Iris_Subset_Two.txt > Iris_New.txt
Other Tools:
In many cases, the user will need to compute multiple commands in one line. This can be done with the semicolon, which acts as a separator between Unix commands. Another important tool is the pipe operator, which takes the output of one command and utilizes it with another command. For example, if a user were looking for all files within a directory that contained a particular string, they could pipe together the ls and grep commands in order to get the desired output. Redirection tasks are performed using the greater than sign, which is used to send the output of a command to a new file.
$ head -n 3 Iris_New.txt ; wc Iris_New.txt
      10      10     297 Iris_New.txt
$ ls -l | grep "Iris"
-rw-r--r--   1 abraham.mathew  1892468438     4209 Nov  3 15:23 Iris_Data.csv
-rw-r--r--   1 abraham.mathew  1892468438      297 Nov  3 21:45 Iris_New.txt
-rw-r--r--   1 abraham.mathew  1892468438      784 Nov  3 15:48 Iris_Subset.csv
-rw-r--r--   1 abraham.mathew  1892468438      157 Nov  3 21:37 Iris_Subset_One.txt
-rw-r--r--   1 abraham.mathew  1892468438      140 Nov  3 21:37 Iris_Subset_two.txt
$ head -n 10 Iris_Data.csv > Iris_Redirection.txt
$ head -n 10 Iris_Redirection.txt
There you have it, the basics for getting acquainted with the command line. While there are many other important command line tools, including curl, sed, awk, and wget, the procedures mentioned in this post will provide users with the essential building blocks. There is a steep learning curve, but the long term benefits of using the command line are well worth the short term costs.

Examining Email Addresses in R

I don’t normally work with personal identifiable information such as emails. However, the recent data dump from Ashley Madison got me thinking about how I’d examine a data set composed of email addresses. What are the characteristics of an email that I’d look to extract? How would I perform that task in R? Here’s some quick R code to extract the host, address type, and other information from a set of email strings. From there, we can obviously summarize the data according to a number of desired email characteristics. I’d love to dive into the Ashley Madison email dump to find which companies and industries had the highest ratio of executive on that site, but that’s a little beyond my technical skills given the sheer size of the data set. Hopefully someone will complete that analysis soon enough.

df = data.frame(email = c("","","","","",
df$one <- sub("@.*$", "", df$email )
df$two <- sub('.*@', '', df$email )
df$three <- sub('.*\\.', '', df$email )
num <- c(0:9); num
num_match <- str_c(num, collapse = "|"); num_match
df$num_yn <- as.numeric(str_detect(df$email, num_match))
und <- c("_"); und
und_match <- str_c(und, collapse = "|"); und_match
df$und_yn <- as.numeric(str_detect(df$email, und_match))
> df
             email    one        two three num_yn und_yn
1    one   com      0      0
2 two132   com      1      0
3  three   com      0      0
4   four   com      0      0
5   five   net      0      0
6    ten   com      0      0
7 four99   com      1      0
8    two   gov      0      0
9  f_ive   com      0      1
10    six   gov      0      0

What about you? If you regularly work with email addresses and have some useful insights for the rest of us, please leave a comment below. How do you usually attack a data set where it’s just a large number of email addresses?

Homework during the hiring process…no thanks!

Not too long ago, I was on the job market looking for work as an applied statistician or data scientist within the the online marketing industry. One thing I’ve come to expect with almost every company is some sort of homework assignment or challenge where a spreadsheet would be presented along with some guidelines on what type of analysis they would like. Sometimes it’s very open ended and at other times, there are specific tasks and questions which are put forth. Initially, I saw these assignments as something fun where I could showcase my skill set. However, since last month, I’ve come to see them as a nuisance which can’t possible be a good indicator of whether someone is ‘worth hiring’ or not. I get it, companies often get inundated with resumes and they need effective processes to sift through them. And I get the value of getting some document which outlines how an applicant thought about a problem and generated some valuable insights.

With all that said, do we seriously think that homework assignments and challenges during the hiring process are the most effective way of getting the “best candidate” (whatever that means). I don’t have any data to suggest either way, but am inclined to believe that companies and analytics hiring managers need to develop better ways of assessing the quality of candidates. It these assignments are really about assessing who is most serious about a role to spend a few hours of their free time answering some ‘simple’ questions and putting together some basic lines of R or Python code, then so be it. But I think a better process can be put forth that allows companies to find the right candidate.

I’ve been part of the hiring process and I’ve also gone through months of looking for employment. Based on my experiences on both sides of the table, here’s my view of what is most effective when looking for analytics professionals, applied statistician, or data scientists. Ultimately, my feeling is that the only way to assess whether a candidate is worth hiring is by effectively testing prospective candidates in a more formal manner. The key is to have the applicant complete this stuff during the interview as that would remove the task from being characterized as a take home homework assignment.

Part 1: Quantitative Skills
To assess a candidates quantitative proficiency, here are some techniques that work well based on my previous experience.
a. Put together a document with an existing business problem and some of the analysis that’s been put together to answer them. Ask the applicant for suggestions on the limitations of the current approach and what they’d do if that project was handed to them.
b. Put together a basic statistics test which inquires about simple probability theory and inferential statistical principles. Ask the candidate to answer those questions in an informal setting to ascertain what they know and how work through problems when they don’t know the answer.
c. Ask the applicant to read a statistically demanding document and then request a summary plus feedback from the candidate. This should also tell us something about what the candidate knows about statistics and whether they can summarize the relevant parts in a satisfactory manner.

Part 2: Technical Skills
To assess a candidates technical proficiency, here are some techniques that work well based on my previous experience.
a. Show an applicant some imperfect code that is unnecessarily long or could be improved. Ask them to look it over and provide their suggestions on how’d they do things differently.
b.Put together several small code snippets in various programming languages that the candidate may or may not know. Ask them to go through the code, identify what is happening at each step, and explain the final result.
c. Have the applicant share their work on some interesting work or non work related project that they did recently. They can talk about specific aspects of their code and consider if there is anything they’d do differently now.

The possibilities are endless, but there has to be better ways to assess the quality of candidates to analytics roles than the ‘homework assignment.’ In any case, I’ll be refusing to do any more assignments as a part of the hiring process.


PS: In time, I finally opened up to the notion of ‘technical assignments.’ Don’t get me wrong, I have and will never use them when I am in a hiring capacity, but I finally accepted that I’d have to do a few homework assignments here and there.

PPS: Oddly enough, I’ve received a few rough/rude emails from hiring managers regarding this post. If you don’t like my perspective, feel free to write your own blog post about it. Personally, I don’t see the need for someone to send a mouthy email to me just because they have a different perspective.



Wikipedia and the Fashion Weeks: A Look at Usage Patterns

Unlike many of the entries on Wikipedia relating to statistics or computer science, fashion related topics have not not been thoroughly documented. For example, the entries on Martin Margiela and Rei Kawakubo pale in comparison to the breadth of content on John Bayes, structural equation modeling, or R. In lieu of this, I wanted to investigate whether people were using particular fashion related entries on Wikipedia and see how usage patterns had evolved over time. My focus was on the four major fashion weeks given that they are central events within the industry and are paid attention to by tens of millions of people. This analysis is ultimately exploratory and we’re unable to make any inferences about whether an adequate amount of people are using the fashion week entries on Wikipedia or if that’s the result of them not being thoroughly documented. At the end of the day, millions of people use Wikipedia and there’s no doubt that the fashion community needs to be more progressive in ensuring that the fashion related entries on the site are covered in a more cohesive manner.


Unsurprisingly, there are two spikes each year in and around the months where the Fall/Winter and Spring/Summer collections are shown. Of course, the spikes since 2013 have been less pronounced and a gradual trend downwards in visits. This is surprising given the increasing interest in fashion that has occurred over the past five years. This same downward trend also exists in Google search volume on fashion related requests over the past few years.

The line graphs showing visits to the Wikipedia page for the four major fashion weeks are presented below. They each have own characteristics and because these trend charts are explanatory, there’s really no major conclusions to be gleaned from them.

Milan_Fashion_Week Paris_Fashion_Week NewYork_Fashion_WeekLondon_Fashion_Weel

Ultimately, there’s no doubt that high fashion is more popular today than ever before. This is evidenced by sales patterns, amount of media exposure, and the explosion in fashion blogging. This post sought to identify whether people were using Wikipedia to inform themselves about the major fashion weeks and how that trend has changed over time. While those patterns have seen slight increases or remained stagnant, that does not minimize the emergence of high fashion into American popular culture.