Examining Email Addresses in R

August 22, 2015August 23, 2015

I don’t normally work with personal identifiable information such as emails. However, the recent data dump from Ashley Madison got me thinking about how I’d examine a data set composed of email addresses. What are the characteristics of an email that I’d look to extract? How would I perform that task in R? Here’s some quick R code to extract the host, address type, and other information from a set of email strings. From there, we can obviously summarize the data according to a number of desired email characteristics. I’d love to dive into the Ashley Madison email dump to find which companies and industries had the highest ratio of executive on that site, but that’s a little beyond my technical skills given the sheer size of the data set. Hopefully someone will complete that analysis soon enough.

df = data.frame(email = c("[email protected]","[email protected]","[email protected]","[email protected]","[email protected]",
                          "[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"))
 
df$one <- sub("@.*$", "", df$email )
df$two <- sub('.*@', '', df$email )
df$three <- sub('.*\\.', '', df$email )
 
num <- c(0:9); num
num_match <- str_c(num, collapse = "|"); num_match
df$num_yn <- as.numeric(str_detect(df$email, num_match))
und <- c("_"); und
und_match <- str_c(und, collapse = "|"); und_match
df$und_yn <- as.numeric(str_detect(df$email, und_match))
 
> df
             email    one        two three num_yn und_yn
1      [email protected]    one    gkn.com   com      0      0
2  [email protected] two132   wern.com   com      1      0
3     [email protected]  three     fu.com   com      0      0
4     [email protected]   four    huo.com   com      0      0
5     [email protected]   five    hoi.net   net      0      0
6   [email protected]    ten hoinse.com   com      0      0
7   [email protected] four99    huo.com   com      1      0
8     [email protected]    two   wern.gov   gov      0      0
9    [email protected]  f_ive    hoi.com   com      0      1
10   [email protected]    six  ihoio.gov   gov      0      0

What about you? If you regularly work with email addresses and have some useful insights for the rest of us, please leave a comment below. How do you usually attack a data set where it’s just a large number of email addresses?

8 thoughts on “Examining Email Addresses in R”

Pingback: Examining Email Addresses in R | Mubashir Qasim
Rewarp says:

August 23, 2015 at 3:34 AM

What a coincidence. I just learnt how to use regexpr() and regmatches() to find and output patterns in a character string. Your method seems a lot more efficient. Could you explain the str_c part of your code though? That doesn’t appear to be part of the base packages.

LikeLike
Pingback: Distilled News | Data Analytics & R
atmathew says:

August 23, 2015 at 6:29 PM

I haven’t done anything with the data and don’t plan to. However, I am curious of the private sector email domains are there. Sure, it’s shaming, but completely valid in my book. It’s not about insights per se and there are issues with the emails not being verified… but I see it as a way to highlight how employees at major corporations use their work computers

LikeLike
patricio fuenmayor says:

August 24, 2015 at 3:37 PM

HI, I work in a data quality area, and we have a process to validate emails:

1. Homologate email (to upper case)
2. frecuency of repetition
3. Detect invalid characters
4. Analice extructure (3 regular expresion to validate)
5. Validate domain
6. validate users default
7. validate intensionality, for example try to write HOTMAIL.COM but wrote HOTAMAIL.COM, with distance characters.

# PACKAGES
require(data.table,quietly=TRUE)
require(rsqlserver,quietly= TRUE)
require(stringr,quietly=TRUE)
require(splitstackshape)
require(stringdist)

# CONEXION SQLSERVER
chn01 <- dbConnect(rsqlserver::SqlServer(),url="Server=ECBPPRQ75\\Q75,10500;Database=BDWH_SOR;Trusted_Connection=True;")

email01 <- data.table(dbGetQuery(chn01,
"select
ltrim(rtrim(eml_usr_id)) as email
from eml_adr",stringsAsFactors=FALSE))

email01[,":="(email=toupper(email))]
email02 <- email01[,.N,by=email][order(-N)]
setnames(email02,c("EMAIL","FRQ"))

# VALIDACION DE EMAIL
rgx03 <- "[#= $%&\$\$+!¡?,:;\\’¿\\/\\{\\}\\[\\]\\*\\ÁÉÍÓÚÀÈÌÒÙÄËÏÖÜÑ`´]”
rgx05a <- "^[\\w!#$%&'*+/=?`{|}~^-]+(?:\\.[\\w!#$%&'*+/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\\.)+[A-Z]{2,6}$"
rgx05b <- "^[-!#$%&'*+/0-9=?A-Z^_{|}~](\\.?[-!#$%&'*+/0-9=?A-Z^_{|}~])*@[A-Z](-?[A-Z0-9])*(\\.[A-Z](-?[A-Z0-9])*)+$"
rgx05c =7,6,NA))][,”:=”(VLD=sum(c(ERR03,ERR05,ERR06),na.rm=TRUE)),by=EMAIL]
email02[,ERR_COD_LS:=gsub(“(\\|NA){1,}|NA\\|”,””,paste(ERR03,ERR05,ERR06,sep=”|”)),by=EMAIL]
email02[,.(.N,FRQ=sum(FRQ)),by=ERR_COD_LS]

email03 <- email02[ERR_COD_LS %in% c("NA","6"),.(EMAIL,FRQ,ERR_COD_LS)]
email04 <- cSplit(email03,"EMAIL","@",drop=FALSE,type.convert=FALSE)
setnames(email04,c("EMAIL","FRQ","ERR_COD_LS","USER","DOMAIN"))

# VALIDATE INTENSIONALITY
dmn01 <- email04[,.(FRQ=sum(FRQ)),by="DOMAIN"][order(-FRQ)]
dmn02 <- c("HOTMAIL.COM","DOMINIO.COM","GMAIL.COM","YAHOO.COM","HOTMAIL.ES","YAHOO.ES","PICHINCHA.COM",
"OUTLOOK.COM","LIVE.COM","OUTLOOK.ES","ANDINANET.NET","MSN.COM","LATINMAIL.COM","UIO.SATNET.NET",
"YAHOO.COM.MX","YAHOO.COM.AR","HOTMAIL.IT","AOL.COM","INTERACTIVE.NET.EC")

for(col in dmn02) email04[,(col):=stringdist((col),DOMAIN)]

With this, we can infer a domain …

gretttings…

LikeLike
atmathew says:

August 26, 2015 at 1:28 PM

Wow! Thanks for sharing.

LikeLike
atmathew says:

August 26, 2015 at 4:05 PM

It’s in the stringr package, which is for compounded regular expressions. Makes life a lot easier.

LikeLike
jep says:

September 1, 2015 at 12:15 AM

patricio fuenmayor, I am really new to R and I cannot get your code to run starting with this line:

rgx05c =7,6,NA))][,”:=”(VLD=sum(c(ERR03,ERR05,ERR06),na.rm=TRUE)),by=EMAIL]

I get this: Error: unexpected ‘,’ in “rgx05c =7,”

LikeLike