I don’t normally work with personal identifiable information such as emails. However, the recent data dump from Ashley Madison got me thinking about how I’d examine a data set composed of email addresses. What are the characteristics of an email that I’d look to extract? How would I perform that task in R? Here’s some quick R code to extract the host, address type, and other information from a set of email strings. From there, we can obviously summarize the data according to a number of desired email characteristics. I’d love to dive into the Ashley Madison email dump to find which companies and industries had the highest ratio of executive on that site, but that’s a little beyond my technical skills given the sheer size of the data set. Hopefully someone will complete that analysis soon enough.
df = data.frame(email = c("email@example.com","firstname.lastname@example.org","email@example.com","firstname.lastname@example.org","email@example.com", "firstname.lastname@example.org","email@example.com","firstname.lastname@example.org","email@example.com","firstname.lastname@example.org")) df$one <- sub("@.*$", "", df$email ) df$two <- sub('.*@', '', df$email ) df$three <- sub('.*\\.', '', df$email ) num <- c(0:9); num num_match <- str_c(num, collapse = "|"); num_match df$num_yn <- as.numeric(str_detect(df$email, num_match)) und <- c("_"); und und_match <- str_c(und, collapse = "|"); und_match df$und_yn <- as.numeric(str_detect(df$email, und_match)) > df email one two three num_yn und_yn 1 email@example.com one gkn.com com 0 0 2 firstname.lastname@example.org two132 wern.com com 1 0 3 email@example.com three fu.com com 0 0 4 firstname.lastname@example.org four huo.com com 0 0 5 email@example.com five hoi.net net 0 0 6 firstname.lastname@example.org ten hoinse.com com 0 0 7 email@example.com four99 huo.com com 1 0 8 firstname.lastname@example.org two wern.gov gov 0 0 9 email@example.com f_ive hoi.com com 0 1 10 firstname.lastname@example.org six ihoio.gov gov 0 0
What about you? If you regularly work with email addresses and have some useful insights for the rest of us, please leave a comment below. How do you usually attack a data set where it’s just a large number of email addresses?