Examining Email Addresses in R

I don’t normally work with personal identifiable information such as emails. However, the recent data dump from Ashley Madison got me thinking about how I’d examine a data set composed of email addresses. What are the characteristics of an email that I’d look to extract? How would I perform that task in R? Here’s some quick R code to extract the host, address type, and other information from a set of email strings. From there, we can obviously summarize the data according to a number of desired email characteristics. I’d love to dive into the Ashley Madison email dump to find which companies and industries had the highest ratio of executive on that site, but that’s a little beyond my technical skills given the sheer size of the data set. Hopefully someone will complete that analysis soon enough.

df = data.frame(email = c("[email protected]","[email protected]","[email protected]","[email protected]","[email protected]",
                          "[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"))
 
df$one <- sub("@.*$", "", df$email )
df$two <- sub('.*@', '', df$email )
df$three <- sub('.*\\.', '', df$email )
 
num <- c(0:9); num
num_match <- str_c(num, collapse = "|"); num_match
df$num_yn <- as.numeric(str_detect(df$email, num_match))
und <- c("_"); und
und_match <- str_c(und, collapse = "|"); und_match
df$und_yn <- as.numeric(str_detect(df$email, und_match))
 
> df
             email    one        two three num_yn und_yn
1      [email protected]    one    gkn.com   com      0      0
2  [email protected] two132   wern.com   com      1      0
3     [email protected]  three     fu.com   com      0      0
4     [email protected]   four    huo.com   com      0      0
5     [email protected]   five    hoi.net   net      0      0
6   [email protected]    ten hoinse.com   com      0      0
7   [email protected] four99    huo.com   com      1      0
8     [email protected]    two   wern.gov   gov      0      0
9    [email protected]  f_ive    hoi.com   com      0      1
10   [email protected]    six  ihoio.gov   gov      0      0

What about you? If you regularly work with email addresses and have some useful insights for the rest of us, please leave a comment below. How do you usually attack a data set where it’s just a large number of email addresses?

8 thoughts on “Examining Email Addresses in R

  1. What a coincidence. I just learnt how to use regexpr() and regmatches() to find and output patterns in a character string. Your method seems a lot more efficient. Could you explain the str_c part of your code though? That doesn’t appear to be part of the base packages.

    Like

  2. I haven’t done anything with the data and don’t plan to. However, I am curious of the private sector email domains are there. Sure, it’s shaming, but completely valid in my book. It’s not about insights per se and there are issues with the emails not being verified… but I see it as a way to highlight how employees at major corporations use their work computers

    Like

  3. HI, I work in a data quality area, and we have a process to validate emails:

    1. Homologate email (to upper case)
    2. frecuency of repetition
    3. Detect invalid characters
    4. Analice extructure (3 regular expresion to validate)
    5. Validate domain
    6. validate users default
    7. validate intensionality, for example try to write HOTMAIL.COM but wrote HOTAMAIL.COM, with distance characters.

    # PACKAGES
    require(data.table,quietly=TRUE)
    require(rsqlserver,quietly= TRUE)
    require(stringr,quietly=TRUE)
    require(splitstackshape)
    require(stringdist)

    # CONEXION SQLSERVER
    chn01 <- dbConnect(rsqlserver::SqlServer(),url="Server=ECBPPRQ75\\Q75,10500;Database=BDWH_SOR;Trusted_Connection=True;")

    email01 <- data.table(dbGetQuery(chn01,
    "select
    ltrim(rtrim(eml_usr_id)) as email
    from eml_adr",stringsAsFactors=FALSE))

    email01[,":="(email=toupper(email))]
    email02 <- email01[,.N,by=email][order(-N)]
    setnames(email02,c("EMAIL","FRQ"))

    # VALIDACION DE EMAIL
    rgx03 <- "[#= $%&\\(\\)+!¡?,:;\\’¿\\/\\{\\}\\[\\]\\*\\ÁÉÍÓÚÀÈÌÒÙÄËÏÖÜÑ`´]”
    rgx05a <- "^[\\w!#$%&'*+/=?`{|}~^-]+(?:\\.[\\w!#$%&'*+/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\\.)+[A-Z]{2,6}$"
    rgx05b <- "^[-!#$%&'*+/0-9=?A-Z^_{|}~](\\.?[-!#$%&'*+/0-9=?A-Z^_{|}~])*@[A-Z](-?[A-Z0-9])*(\\.[A-Z](-?[A-Z0-9])*)+$"
    rgx05c =7,6,NA))][,”:=”(VLD=sum(c(ERR03,ERR05,ERR06),na.rm=TRUE)),by=EMAIL]
    email02[,ERR_COD_LS:=gsub(“(\\|NA){1,}|NA\\|”,””,paste(ERR03,ERR05,ERR06,sep=”|”)),by=EMAIL]
    email02[,.(.N,FRQ=sum(FRQ)),by=ERR_COD_LS]

    email03 <- email02[ERR_COD_LS %in% c("NA","6"),.(EMAIL,FRQ,ERR_COD_LS)]
    email04 <- cSplit(email03,"EMAIL","@",drop=FALSE,type.convert=FALSE)
    setnames(email04,c("EMAIL","FRQ","ERR_COD_LS","USER","DOMAIN"))

    # VALIDATE INTENSIONALITY
    dmn01 <- email04[,.(FRQ=sum(FRQ)),by="DOMAIN"][order(-FRQ)]
    dmn02 <- c("HOTMAIL.COM","DOMINIO.COM","GMAIL.COM","YAHOO.COM","HOTMAIL.ES","YAHOO.ES","PICHINCHA.COM",
    "OUTLOOK.COM","LIVE.COM","OUTLOOK.ES","ANDINANET.NET","MSN.COM","LATINMAIL.COM","UIO.SATNET.NET",
    "YAHOO.COM.MX","YAHOO.COM.AR","HOTMAIL.IT","AOL.COM","INTERACTIVE.NET.EC")

    for(col in dmn02) email04[,(col):=stringdist((col),DOMAIN)]

    With this, we can infer a domain …

    gretttings…

    Like

  4. patricio fuenmayor, I am really new to R and I cannot get your code to run starting with this line:

    rgx05c =7,6,NA))][,”:=”(VLD=sum(c(ERR03,ERR05,ERR06),na.rm=TRUE)),by=EMAIL]

    I get this: Error: unexpected ‘,’ in “rgx05c =7,”

    Like

Leave a comment