I don’t normally work with personal identifiable information such as emails. However, the recent data dump from Ashley Madison got me thinking about how I’d examine a data set composed of email addresses. What are the characteristics of an email that I’d look to extract? How would I perform that task in R? Here’s some quick R code to extract the host, address type, and other information from a set of email strings. From there, we can obviously summarize the data according to a number of desired email characteristics. I’d love to dive into the Ashley Madison email dump to find which companies and industries had the highest ratio of executive on that site, but that’s a little beyond my technical skills given the sheer size of the data set. Hopefully someone will complete that analysis soon enough.
df = data.frame(email = c("[email protected]","[email protected]","[email protected]","[email protected]","[email protected]", "[email protected]","[email protected]","[email protected]","[email protected]","[email protected]")) df$one <- sub("@.*$", "", df$email ) df$two <- sub('.*@', '', df$email ) df$three <- sub('.*\\.', '', df$email ) num <- c(0:9); num num_match <- str_c(num, collapse = "|"); num_match df$num_yn <- as.numeric(str_detect(df$email, num_match)) und <- c("_"); und und_match <- str_c(und, collapse = "|"); und_match df$und_yn <- as.numeric(str_detect(df$email, und_match)) > df email one two three num_yn und_yn 1 [email protected] one gkn.com com 0 0 2 [email protected] two132 wern.com com 1 0 3 [email protected] three fu.com com 0 0 4 [email protected] four huo.com com 0 0 5 [email protected] five hoi.net net 0 0 6 [email protected] ten hoinse.com com 0 0 7 [email protected] four99 huo.com com 1 0 8 [email protected] two wern.gov gov 0 0 9 [email protected] f_ive hoi.com com 0 1 10 [email protected] six ihoio.gov gov 0 0
What about you? If you regularly work with email addresses and have some useful insights for the rest of us, please leave a comment below. How do you usually attack a data set where it’s just a large number of email addresses?