Statistics Refresher – Part One


Let’s face it, a good statistics refresher is always worthwhile. There are times we all forget basic concepts and calculations. Therefore, I put together a document that could act as a statistics refresher and thought that I’d share it with the world. This is part one of a two part document that is still being completed. This refresher is based on Principles of Statistics by Balmer and Statistics in Plain English by Brightman.

The Two Concepts of Probability

Statistical Probability

  • Statistical probability pertains to the relative frequency with which an event occurs in the long run.
  • Example:
    Let’s say we flip a coin twice. What is the probability of getting two heads?
    If we flip a coin twice, there are four possible outcomes, [(H,H), (H,T), (T,H), (T,T)] .
    Therefore, the probability of flipping two heads is \frac{(H,H)}{N} = \frac{1}{2}*\frac{1}{2} = \frac{1}{4}

Inductive Probability

  • Inductive probability pertains to the degree of belief which is reasonable to place on a proposition given evidence.
  • Example:
    I’m 95\% certain that the answer to 1 + 1 is between 1.5 and 2.5 .

The Two Laws of Probability

Law of Addition

  • If A and B are mutually exclusive events, the probability that either A  or B will occur is equal to the sum of their separate probabilities.

\displaystyle P(A \space or \space B) = P(A) + P(B)

Law of Multiplication

  • If A and B are two events, the probability that both A and B will occur is equal to the probability that A will occur multiplied by the conditional probability that B  will occur given that A has occured.

P(A \space and \space B) = P(A) * P(B|A)

Conditional Probability

  • The probability of B  given A , or P(B|A) , is the probability that B will occur if we consider only those occasionson which A also occurs. This is defined as \frac{n(A \space and \space B)}{n(A)} .

Random Variables and Probability Distributions

Discrete Variables

  • Variables which arise from counting and can only take integral values (0, 1, 2, \ldots) .
  • A frequency distribution represents the amount of occurences for all the possible values of a variable. This can be represented in a table or graphically as a probability distribution.
  • Associated with any discrete random variable, X , is a corresponding probability function which tells us the probability with which X takes any value. The particular value that X  can take is characterized by x . Based on x , the probability that X will take can be calculated. This measure is the probability function and is defined by P(x) .
  • The cumulative probability function specifies the probability that X is less than or equal to some particular value, x . This is denoted by F(x) . The cumulative probability function can be calculated by summing the probabilities of all values less than or equal to x .

F(x) = Prob[X \leq x]

F(x) = P(0) + P(1) + \ldots + P(x) = \sum_{u \leq x} p(u)

Continuous Variables

  • Variables which arise from measuring and can take any value within a given range.
  • Continuous variables are best graphically represented by a histogram, where the area of each rectangle represents the proportion of observations falling in that interval.
  • The probability density function, f(x) , refers to the smooth continuous curve that is used to describe the relative likelihood a random variable to take on a given value. f(x) can also be used to show the probability that the random variable will lie between x_1 and x_2 .
  • A continuous probability distribution can also be represented by its cumulative probability function, f(x) . which specified the probability that X  is less than or equal to x .
  • A continuous random variable is said to be uniformly distributed between 0 and 1 if it is equally likely to lie anywhere in this interval but cannot lie outside it.

Multivariate Distributions

  • The joint frequency distribution of two random variables is called a bivariate distribution. P(x,y) denotes the probability that simultaneously X will be x and Y will be y . This is expressed through a bivariate distribution table.

P(x,y) = Prob[X == x \space and \space Y == y]

  • In a bivariate distribution table, the right hand margin sums the probabilities in different rows. It expresses the overall probability distribution of x , regardless of the value of y .

p(x) = Prob[X == x] = \sum_{y} p(x,y)

  • In a bivariate distribution table, the bottom margin sums the probabilities in different columns. It expresses the overall probability distribution of y , regardless of the value of x .

p(y) = Prob[Y == y] = \sum_{x} p(x,y)

Properties of Distributions

Measures of Central Tendancy

  • The mean is measured by taking the sum divided by the number of observations.

\bar{x} = \frac{x_1 + x_2 + \ldots + x_n}{n} = \sum_{i=1}^n \frac{x_i}{n}

  • The median is the middle observation in a series of numbers. If the number of observations are even, then the two middle observations would be divided by two.
  • The mode refers to the most frequent observation.
  • The main question of interest is whether the sample mean, median, or mode provides the most accurate estimate of central tendancy within the population.

Measures of Dispersion

  • The standard deviation of a set of observations is the square root of the average of the squared deviations from the mean. The squared deviations from the mean is called the variance.

The Shape of Distributions

  • Unimodal distributions have only one peak while multimodal distributions have several peaks.
  • An observation that is skewed to the right contains a few large values which results in a long tail towards the right hand side of the chart.
  • An observation that is skewed to the left contains a few small values which results in a long tail towards the left hand side of the chart.
  • The kurtosis of a distribution refers to the degree of peakedness of a distribution.

The Binomial, Poisson, and Exponential Distributions

Binomial Distribution

  • Think of a repeated process with two possible outcome, failure (F ) and success (S ). After repeating the experiment n times, we will have a sequence of outcomes that include both failures and successes, SFFFSF . The primary metric of interest is the total number of successes.
  • What is the probability of obtaining x  successes and n-x failures in n  repetitions of the experiment?

Poisson Distribution

  • The poisson distribution is the limiting form of the binomial distribution when there are a large number of trials but only a small probability of success at each of them.

Exponential Distribution

  • A continuous, positive random variable is said to follow an exponential distribution if its probability density function decreases as the values of x go from 0  to \infty . The probability declines from its highest levels at the initial values of x .

The Normal Distribution

Properties of the Normal Distribution

  • The real reason for the importance of the normal distribution lies in the central limit theorem, which states that the sum of a large number of independent random variables will be approximately normally distributed regardless of their individual distributions.
  • A normal distribution is defined by its mean, \mu , and standard deviation, \sigma . A change in the mean shifts the distribution along the x-axis. A change in the standard deviation flattens it or compresses it while leaving its centre in the same position. The totral area under the curve is one and the mean is at the middle and divides the area into halves.
  • One standard deviation above and below the mean of a normal distribution will include 68% of the observations for that variable. For two standard deviates, that value will be 95%, and for three standard deviations, that value will be 99%.

Stay tuned for part two, which will be up next week and cover things like significance tests and elementary inferential statistics. Please leave comments or suggestions below. If you’re looking to hire a marketing scientist, please contact me at mathewanalytics@gmail.com

Logistic Regression in R – Part Two

My previous post covered the basics of logistic regression. We must now examine the model to understand how well it fits the data and generalizes to other observations. The evaluation process involves the assessment of three distinct areas – goodness of fit, tests of individual predictors, and validation of predicted values – in order to produce the most useful model. While the following content isn’t exhaustive, it should provide a compact ‘cheat sheet’ and guide for the modeling process.

Goodness of Fit: Likelihood Ratio Test
A logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This occurs by comparing the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors. The null hypothesis, H_0 holds that the reduced model is true,so an \alpha for the overall model fit statistic that is less than 0.05  would compel us to reject H_0 .

mod_fit_one <- glm(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own +
CreditHistory.Critical, data=training, family="binomial")
 
mod_fit_two <- glm(Class ~ Age + ForeignWorker, data=training, family="binomial")
 
library(lmtest)
lrtest(mod_fit_one, mod_fit_two)

 

Goodness of Fit: Pseudo R^2
With linear regression, the R^2 statistic tells us the proportion of variance in the dependent variable that is explained by the predictors. While no equivilent metric exists for logistic regression, there are a number of R^2 values that can be of value. Most notable is McFadden’s R^2 , which is defined as 1 - \frac{ ln(L_M) }{ ln(L_0) } where ln(L_M) is the log likelihood value for the fitted model and ln(L_0) is the log likelihood for the null model with only an intercept as a predictor. The measure ranges from 0 to just under 1 , with values closer to zero indicating that the model has no predictive power.

library(pscl)
pR2(mod_fit_one) # look for 'McFadden'

Goodness of Fit: Hosmer-Lemeshow Test
The Hosmer-Lemeshow test examines whether the observed proportion of events are similar to the predicted probabilities of occurences in subgroups of the dataset using a pearson chi-square statistic from the 2 x g table of observed and expected frequencies. Small values with large p-values indicate a good fit to the data while large values with p-values below 0.05 indicate a poor fit. The null hypothesis holds that the model fits the data and in the below example we would reject H_0 .

 

library(MKmisc)
HLgof.test(fit = fitted(mod_fit_one), obs = training$Class)
 
library(ResourceSelection)
hoslem.test(training$Class, fitted(mod_fit_one), g=10)

Tests of Individual Predictors: Wald Test
A wald test is used to evaluate the statistical significance of each coefficient in the model and is calculated by taking the ratio of the square of the regression coefficient to the square of the standard error of the coefficient. The idea is to test the hypothesis that the coefficient of an independent variable in the model is not significantly different from zero. If the test fails to reject the null hypothesis, this suggests that removing the variable from the model will not substantially harm the fit of that model.

library(survey)
 
regTermTest(mod_fit_one, "ForeignWorker")
regTermTest(mod_fit_one, "CreditHistory.Critical")

 

Tests of Individual Predictors: Variable Importance
To assess the relative importance of individual predictors in the model, we can also look at the absolute value of the t-statistic for each model parameter. This technique is utilized by the varImp function in the caret package for general and generalized linear models. The t-statistic for each model parameter helps us determine if it’s significantly different from zero.

mod_fit <- train(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own +
CreditHistory.Critical, data=training, method="glm", family="binomial")
 
varImp(mod_fit)

Validation of Predicted Values: Classification Rate
With predictive models, he most critical metric regards how well the model does in predicting the target variable on out of sample observations. The process involves using the model estimates to predict values on the training set. Afterwards, we will compare the predicted target variable versus the observed values for each observation.

pred = predict(mod_fit, newdata=testing)
accuracy <- table(pred, testing[,"Class"])
sum(diag(accuracy))/sum(accuracy)
 
pred = predict(mod_fit, newdata=testing)
confusionMatrix(data=pred, testing$Class)

 

Validation of Predicted Values: ROC Curve
The receiving operating characteristic is a measure of classifier performance. It’s based on the proportion of positive data points that are correctly considered as positive, TPR = \frac{TP}{n(Y=1)} , and the proportion of negative data points that are accuratecly considered as negative, TNR = \frac{TN}{n(Y=0)} . These metrics are expressed through a graphic that shows the trade off between these values. Ultimately, we’re concerned about the area under the ROC curve, or AUROC. That metric ranges from 0.50 to 1.00 , and values above 0.80 indicate that the model does a great job in discriminating between the two categories which comprise our target variable.

library(pROC)
# Compute AUC for predicting Class with the variable CreditHistory.Critical
f1 = roc(Class ~ CreditHistory.Critical, data=training)
plot(f1, col="red")
 
library(ROCR)
# Compute AUC for predicting Class with the model
prob <- predict(mod_fit_one, newdata=testing, type="response")
pred <- prediction(prob, testing$Class)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf)
 
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
auc

 

This post has provided a quick overview of how to evaluate logistic regression models in R. If you have any comments or corrections, please comment below.

 

Logistic Regression in R – Part One

Logistic regression is used to analyze the relationship between a dichotomous dependent variable and one or more categorical or continuous independent variables. It specifies the likelihood of the response variable as a function of various predictors. The model expressed as log(odds) = \beta_0 + \beta_1*x_1 + ... + \beta_n*x_n , where \beta refers to the parameters and x_i represents the independent variables. The log(odds), or log of the odds ratio, is defined as ln[\frac{p}{1-p}]. It expresses the natural logarithm of the ratio between the probability that an event will occur, p(Y=1), to the probability that an event will not occur, p(Y=0).

The models estimates, \beta, express the relationship between the independent and dependent variable on a log-odds scale. A coefficient of 0.020 would indicate that a one unit increase in \beta_i is associated with a log-odds increase in the occurce of Y by 0.020. To get a clearer understanding of the constant effect of a predictor on the likelihood that an outcome will occur, odds-ratios can be calculated. This can be expressed as odds(Y) = \exp(\beta_0 + \beta_1*x_1 + ... + \beta_n*x_n) , which is the exponentiate of the model. Alongside the odd-ratio, it’s often worth calculating predicted probabilities of Y at specific values of key predictors. This is done through p = \frac{1}{1 + \exp^{-z}} where z refers to the log(odds) regression equation.

Using the GermanCredit dataset in the Caret package, we will construct a logistic regression model to estimate the likelihood of a consumer being a good loan applicant based on a number of predictor variables.

library(caret)
data(GermanCredit)
 
# split the data into training and testing datasets 
Train <- createDataPartition(GermanCredit$Class, p=0.6, list=FALSE)
training <- GermanCredit[ Train, ]
testing <- GermanCredit[ -Train, ]
 
# use glm to train the model on the training dataset. make sure to set family to "binomial"
mod_fit_one <- glm(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own +
CreditHistory.Critical, data=training, family="binomial")
 
summary(mod_fit_one) # estimates 
exp(coef(mod_fit_one)) # odds ratios
predict(mod_fit_one, newdata=testing, type="response") # predicted probabilities

Great, we’re all done, right? Not just yet. There are some critical questions that still remain. Is the model any good? How well does the model fit the data? Which predictors are most important? Are the predictions accurate? In the next post, I’ll provide an overview of how to evaluate logistic regression models in R.

The Command Line is Your Friend: A Quick Introduction

The command line can be a scary place for people who are traditionally accustomed to using point-and-click mechanisms for executing tasks on their computer. While the idea of interacting with files and software via text may seem like a terrifying concept, the terminal is a powerful tool that can boost productivity and provide users with greater control of their system. For data analysts, the command line provides tools to perform a wide array of tasks, including file explanation and exploratory data analysis. Getting accustomed with these capabilities will enable users to become more competent in their interactions with the computer.
Screen Shot 2014-11-03 at 10.19.06 PM
Working Directory:
The working directory refers to the folder or files that are currently being utilized. This is usually expressed as a hierarchical path and can be found using the pwd (‘print working directory’) command. The working directory can be changed from the command line using the cd (‘change directory’) command. Once a working directory has been set, use ls to list the contents of the current directory.
$ pwd
/Users/abraham.mathew
$ cd /Users/abraham.mathew/Movies/
$ ls
DDC - Model Visits.xlsx                    ILM Leads.xlsx
DDC - Page Type Views.xlsx               OBI Velocity-Day Supply.xlsx
...
Files and Folders:
The command line offers numerous tools for interacting with files and folders. For example, the mkdir (‘make directory’) command can be used to create an empty directory. Commands like mv and cp can then be used to rename files or copy the file into a new location. One can use the rm command to delete a file and rmdir to delete a directory.
$ mkdir Test_Dir_One
$ mkdir Test_Dir_Two
$ cp history.txt history_new.txt
cp: history.txt: No such file or directory
$ history > history.txt
$ cp history.txt history_new.txt
$ ls
...
$ cp history.txt /Users/abraham.mathew/movies/history_new_two.txt
$ pwd
/Users/abraham.mathew/Movies
$ rm history_new.txt
$ rmdir Test_Dir_Two
Interacting with Files:
The head and tail commands can be used to print the beginning and ending contents of a text or csv file. Furthermore, use the wc (‘word count’) command to find the numbers of lines, words, and characters in a file. The grep command can be used to find certain elements within a file using regular expressions. To combine files side by side, one can use the paste command. Cat, which is typically used to print out the contents of a file, can also be used to concatenate a number of files together.
$ head -n 5 Iris_Data.csv
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
$ head -n 5 Iris_Data.csv > Iris_Subset_One.txt
$ tail -n 5 Iris_Data.csv > Iris_Subset_two.txt
$ wc Iris_Data.csv
     151     151    4209 Iris_Data.csv
$ wc -l Iris_Data.csv
     151 Iris_Data.csv
$ grep "setosa" Iris_Data.csv | wc -l
      50
$ ls -l | grep "Iris"
-rw-r--r--   1 abraham.mathew  1892468438     4209 Nov  3 15:23 Iris_Data.csv
-rw-r--r--   1 abraham.mathew  1892468438      784 Nov  3 15:48 Iris_Subset.csv
-rw-r--r--   1 abraham.mathew  1892468438      157 Nov  3 21:37 Iris_Subset_One.txt
-rw-r--r--   1 abraham.mathew  1892468438      140 Nov  3 21:37 Iris_Subset_two.txt
$ paste Iris_Subset_One.txt Iris_Subset_Two.txt
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species     146,6.7,3,5.2,2.3,virginica
1,5.1,3.5,1.4,0.2,setosa     147,6.3,2.5,5,1.9,virginica
2,4.9,3,1.4,0.2,setosa     148,6.5,3,5.2,2,virginica
3,4.7,3.2,1.3,0.2,setosa     149,6.2,3.4,5.4,2.3,virginica
4,4.6,3.1,1.5,0.2,setosa     150,5.9,3,5.1,1.8,virginica
$ cat Iris_Subset_One.txt Iris_Subset_Two.txt > Iris_New.txt
Other Tools:
In many cases, the user will need to compute multiple commands in one line. This can be done with the semicolon, which acts as a separator between Unix commands. Another important tool is the pipe operator, which takes the output of one command and utilizes it with another command. For example, if a user were looking for all files within a directory that contained a particular string, they could pipe together the ls and grep commands in order to get the desired output. Redirection tasks are performed using the greater than sign, which is used to send the output of a command to a new file.
$ head -n 3 Iris_New.txt ; wc Iris_New.txt
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3,1.4,0.2,setosa
      10      10     297 Iris_New.txt
$ ls -l | grep "Iris"
-rw-r--r--   1 abraham.mathew  1892468438     4209 Nov  3 15:23 Iris_Data.csv
-rw-r--r--   1 abraham.mathew  1892468438      297 Nov  3 21:45 Iris_New.txt
-rw-r--r--   1 abraham.mathew  1892468438      784 Nov  3 15:48 Iris_Subset.csv
-rw-r--r--   1 abraham.mathew  1892468438      157 Nov  3 21:37 Iris_Subset_One.txt
-rw-r--r--   1 abraham.mathew  1892468438      140 Nov  3 21:37 Iris_Subset_two.txt
$ head -n 10 Iris_Data.csv > Iris_Redirection.txt
$ head -n 10 Iris_Redirection.txt
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
There you have it, the basics for getting acquainted with the command line. While there are many other important command line tools, including curl, sed, awk, and wget, the procedures mentioned in this post will provide users with the essential building blocks. There is a steep learning curve, but the long term benefits of using the command line are well worth the short term costs.

Examining Email Addresses in R

I don’t normally work with personal identifiable information such as emails. However, the recent data dump from Ashley Madison got me thinking about how I’d examine a data set composed of email addresses. What are the characteristics of an email that I’d look to extract? How would I perform that task in R? Here’s some quick R code to extract the host, address type, and other information from a set of email strings. From there, we can obviously summarize the data according to a number of desired email characteristics. I’d love to dive into the Ashley Madison email dump to find which companies and industries had the highest ratio of executive on that site, but that’s a little beyond my technical skills given the sheer size of the data set. Hopefully someone will complete that analysis soon enough.

df = data.frame(email = c("one@gkn.com","two132@wern.com","three@fu.com","four@huo.com","five@hoi.net",
                          "ten@hoinse.com","four99@huo.com","two@wern.gov","f_ive@hoi.com","six@ihoio.gov"))
 
df$one <- sub("@.*$", "", df$email )
df$two <- sub('.*@', '', df$email )
df$three <- sub('.*\\.', '', df$email )
 
num <- c(0:9); num
num_match <- str_c(num, collapse = "|"); num_match
df$num_yn <- as.numeric(str_detect(df$email, num_match))
und <- c("_"); und
und_match <- str_c(und, collapse = "|"); und_match
df$und_yn <- as.numeric(str_detect(df$email, und_match))
 
> df
             email    one        two three num_yn und_yn
1      one@gkn.com    one    gkn.com   com      0      0
2  two132@wern.com two132   wern.com   com      1      0
3     three@fu.com  three     fu.com   com      0      0
4     four@huo.com   four    huo.com   com      0      0
5     five@hoi.net   five    hoi.net   net      0      0
6   ten@hoinse.com    ten hoinse.com   com      0      0
7   four99@huo.com four99    huo.com   com      1      0
8     two@wern.gov    two   wern.gov   gov      0      0
9    f_ive@hoi.com  f_ive    hoi.com   com      0      1
10   six@ihoio.gov    six  ihoio.gov   gov      0      0

What about you? If you regularly work with email addresses and have some useful insights for the rest of us, please leave a comment below. How do you usually attack a data set where it’s just a large number of email addresses?

Homework during the hiring process…no thanks!

For the past four months, I’ve been on the job market looking for work as an applied statistician or data scientist within the the online marketing industry. One thing I’ve come to expect with almost every company is some sort of homework assignment or challenge where a spreadsheet would be presented along with some guidelines on what type of analysis they would like. Sometimes it’s very open ended and at other times, there are specific tasks and questions which are put forth. Initially, I saw these assignments as something fun where I could showcase my skill set. However, since last month, I’ve come to see them as a nuisance which can’t possible be a good indicator of whether someone is ‘worth hiring’ or not. I get it, companies often get inundated with resumes and they need effective processes to sift through them. And I get the value of getting some document which outlines how an applicant thought about a problem and generated some valuable insights.

With all that said, do we seriously think that homework assignments and challenges during the hiring process are the most effective way of getting the “best candidate” (whatever that means). I don’t have any data to suggest either way, but am inclined to believe that companies and analytics hiring managers need to develop better ways of assessing the quality of candidates. It these assignments are really about assessing who is most serious about a role to spend a few hours of their free time answering some ‘simple’ questions and putting together some basic lines of R or Python code, then so be it. But I think a better process can be put forth that allows companies to find the right candidate.

I’ve been part of the hiring process and I’ve also gone through months of looking for employment. Based on my experiences on both sides of the table, here’s my view of what is most effective when looking for analytics professionals, applied statistician, or data scientists. Ultimately, my feeling is that the only way to assess whether a candidate is worth hiring is by effectively testing prospective candidates in a more formal manner. The key is to have the applicant complete this stuff during the interview as that would remove the task from being characterized as a take home homework assignment.

Part 1: Quantitative Skills
To assess a candidates quantitative proficiency, here are some techniques that work well based on my previous experience.
a. Put together a document with an existing business problem and some of the analysis that’s been put together to answer them. Ask the applicant for suggestions on the limitations of the current approach and what they’d do if that project was handed to them.
b. Put together a basic statistics test which inquires about simple probability theory and inferential statistical principles. Ask the candidate to answer those questions in an informal setting to ascertain what they know and how work through problems when they don’t know the answer.
c. Ask the applicant to read a statistically demanding document and then request a summary plus feedback from the candidate. This should also tell us something about what the candidate knows about statistics and whether they can summarize the relevant parts in a satisfactory manner.

Part 2: Technical Skills
To assess a candidates technical proficiency, here are some techniques that work well based on my previous experience.
a. Show an applicant some imperfect code that is unnecessarily long or could be improved. Ask them to look it over and provide their suggestions on how’d they do things differently.
b.Put together several small code snippets in various programming languages that the candidate may or may not know. Ask them to go through the code, identify what is happening at each step, and explain the final result.
c. Have the applicant share their work on some interesting work or non work related project that they did recently. They can talk about specific aspects of their code and consider if there is anything they’d do differently now.

The possibilities are endless, but there has to be better ways to assess the quality of candidates to analytics roles than the ‘homework assignment.’ In any case, I’ll be refusing to do any more assignments as a part of the hiring process.

 

 

Wikipedia and the Fashion Weeks: A Look at Usage Patterns

Unlike many of the entries on Wikipedia relating to statistics or computer science, fashion related topics have not not been thoroughly documented. For example, the entries on Martin Margiela and Rei Kawakubo pale in comparison to the breadth of content on John Bayes, structural equation modeling, or R. In lieu of this, I wanted to investigate whether people were using particular fashion related entries on Wikipedia and see how usage patterns had evolved over time. My focus was on the four major fashion weeks given that they are central events within the industry and are paid attention to by tens of millions of people. This analysis is ultimately exploratory and we’re unable to make any inferences about whether an adequate amount of people are using the fashion week entries on Wikipedia or if that’s the result of them not being thoroughly documented. At the end of the day, millions of people use Wikipedia and there’s no doubt that the fashion community needs to be more progressive in ensuring that the fashion related entries on the site are covered in a more cohesive manner.

Fashion_Week

Unsurprisingly, there are two spikes each year in and around the months where the Fall/Winter and Spring/Summer collections are shown. Of course, the spikes since 2013 have been less pronounced and a gradual trend downwards in visits. This is surprising given the increasing interest in fashion that has occurred over the past five years. This same downward trend also exists in Google search volume on fashion related requests over the past few years.

The line graphs showing visits to the Wikipedia page for the four major fashion weeks are presented below. They each have own characteristics and because these trend charts are explanatory, there’s really no major conclusions to be gleaned from them.

Milan_Fashion_Week Paris_Fashion_Week NewYork_Fashion_WeekLondon_Fashion_Weel

Ultimately, there’s no doubt that high fashion is more popular today than ever before. This is evidenced by sales patterns, amount of media exposure, and the explosion in fashion blogging. This post sought to identify whether people were using Wikipedia to inform themselves about the major fashion weeks and how that trend has changed over time. While those patterns have seen slight increases or remained stagnant, that does not minimize the emergence of high fashion into American popular culture.