I don’t normally work with personal identifiable information such as emails. However, the recent data dump from Ashley Madison got me thinking about how I’d examine a data set composed of email addresses. What are the characteristics of an email that I’d look to extract? How would I perform that task in R? Here’s some quick R code to extract the host, address type, and other information from a set of email strings. From there, we can obviously summarize the data according to a number of desired email characteristics. I’d love to dive into the Ashley Madison email dump to find which companies and industries had the highest ratio of executive on that site, but that’s a little beyond my technical skills given the sheer size of the data set. Hopefully someone will complete that analysis soon enough.
df = data.frame(email = c("firstname.lastname@example.org","email@example.com","firstname.lastname@example.org","email@example.com","firstname.lastname@example.org", "email@example.com","firstname.lastname@example.org","email@example.com","firstname.lastname@example.org","email@example.com")) df$one <- sub("@.*$", "", df$email ) df$two <- sub('.*@', '', df$email ) df$three <- sub('.*\\.', '', df$email ) num <- c(0:9); num num_match <- str_c(num, collapse = "|"); num_match df$num_yn <- as.numeric(str_detect(df$email, num_match)) und <- c("_"); und und_match <- str_c(und, collapse = "|"); und_match df$und_yn <- as.numeric(str_detect(df$email, und_match)) > df email one two three num_yn und_yn 1 firstname.lastname@example.org one gkn.com com 0 0 2 email@example.com two132 wern.com com 1 0 3 firstname.lastname@example.org three fu.com com 0 0 4 email@example.com four huo.com com 0 0 5 firstname.lastname@example.org five hoi.net net 0 0 6 email@example.com ten hoinse.com com 0 0 7 firstname.lastname@example.org four99 huo.com com 1 0 8 email@example.com two wern.gov gov 0 0 9 firstname.lastname@example.org f_ive hoi.com com 0 1 10 email@example.com six ihoio.gov gov 0 0
What about you? If you regularly work with email addresses and have some useful insights for the rest of us, please leave a comment below. How do you usually attack a data set where it’s just a large number of email addresses?