Examining Email Addresses in R

I don’t normally work with personal identifiable information such as emails. However, the recent data dump from Ashley Madison got me thinking about how I’d examine a data set composed of email addresses. What are the characteristics of an email that I’d look to extract? How would I perform that task in R? Here’s some quick R code to extract the host, address type, and other information from a set of email strings. From there, we can obviously summarize the data according to a number of desired email characteristics. I’d love to dive into the Ashley Madison email dump to find which companies and industries had the highest ratio of executive on that site, but that’s a little beyond my technical skills given the sheer size of the data set. Hopefully someone will complete that analysis soon enough.

df = data.frame(email = c("one@gkn.com","two132@wern.com","three@fu.com","four@huo.com","five@hoi.net",
                          "ten@hoinse.com","four99@huo.com","two@wern.gov","f_ive@hoi.com","six@ihoio.gov"))
 
df$one <- sub("@.*$", "", df$email )
df$two <- sub('.*@', '', df$email )
df$three <- sub('.*\\.', '', df$email )
 
num <- c(0:9); num
num_match <- str_c(num, collapse = "|"); num_match
df$num_yn <- as.numeric(str_detect(df$email, num_match))
und <- c("_"); und
und_match <- str_c(und, collapse = "|"); und_match
df$und_yn <- as.numeric(str_detect(df$email, und_match))
 
> df
             email    one        two three num_yn und_yn
1      one@gkn.com    one    gkn.com   com      0      0
2  two132@wern.com two132   wern.com   com      1      0
3     three@fu.com  three     fu.com   com      0      0
4     four@huo.com   four    huo.com   com      0      0
5     five@hoi.net   five    hoi.net   net      0      0
6   ten@hoinse.com    ten hoinse.com   com      0      0
7   four99@huo.com four99    huo.com   com      1      0
8     two@wern.gov    two   wern.gov   gov      0      0
9    f_ive@hoi.com  f_ive    hoi.com   com      0      1
10   six@ihoio.gov    six  ihoio.gov   gov      0      0

What about you? If you regularly work with email addresses and have some useful insights for the rest of us, please leave a comment below. How do you usually attack a data set where it’s just a large number of email addresses?

Advertisements

8 thoughts on “Examining Email Addresses in R

  1. Pingback: Examining Email Addresses in R | Mubashir Qasim

  2. What a coincidence. I just learnt how to use regexpr() and regmatches() to find and output patterns in a character string. Your method seems a lot more efficient. Could you explain the str_c part of your code though? That doesn’t appear to be part of the base packages.

    Like

  3. Pingback: Distilled News | Data Analytics & R

  4. I haven’t done anything with the data and don’t plan to. However, I am curious of the private sector email domains are there. Sure, it’s shaming, but completely valid in my book. It’s not about insights per se and there are issues with the emails not being verified… but I see it as a way to highlight how employees at major corporations use their work computers

    Like

  5. HI, I work in a data quality area, and we have a process to validate emails:

    1. Homologate email (to upper case)
    2. frecuency of repetition
    3. Detect invalid characters
    4. Analice extructure (3 regular expresion to validate)
    5. Validate domain
    6. validate users default
    7. validate intensionality, for example try to write HOTMAIL.COM but wrote HOTAMAIL.COM, with distance characters.

    # PACKAGES
    require(data.table,quietly=TRUE)
    require(rsqlserver,quietly= TRUE)
    require(stringr,quietly=TRUE)
    require(splitstackshape)
    require(stringdist)

    # CONEXION SQLSERVER
    chn01 <- dbConnect(rsqlserver::SqlServer(),url="Server=ECBPPRQ75\\Q75,10500;Database=BDWH_SOR;Trusted_Connection=True;")

    email01 <- data.table(dbGetQuery(chn01,
    "select
    ltrim(rtrim(eml_usr_id)) as email
    from eml_adr",stringsAsFactors=FALSE))

    email01[,":="(email=toupper(email))]
    email02 <- email01[,.N,by=email][order(-N)]
    setnames(email02,c("EMAIL","FRQ"))

    # VALIDACION DE EMAIL
    rgx03 <- "[#= $%&\\(\\)+!¡?,:;\\’¿\\/\\{\\}\\[\\]\\*\\ÁÉÍÓÚÀÈÌÒÙÄËÏÖÜÑ`´]”
    rgx05a <- "^[\\w!#$%&'*+/=?`{|}~^-]+(?:\\.[\\w!#$%&'*+/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\\.)+[A-Z]{2,6}$"
    rgx05b <- "^[-!#$%&'*+/0-9=?A-Z^_{|}~](\\.?[-!#$%&'*+/0-9=?A-Z^_{|}~])*@[A-Z](-?[A-Z0-9])*(\\.[A-Z](-?[A-Z0-9])*)+$"
    rgx05c =7,6,NA))][,”:=”(VLD=sum(c(ERR03,ERR05,ERR06),na.rm=TRUE)),by=EMAIL]
    email02[,ERR_COD_LS:=gsub(“(\\|NA){1,}|NA\\|”,””,paste(ERR03,ERR05,ERR06,sep=”|”)),by=EMAIL]
    email02[,.(.N,FRQ=sum(FRQ)),by=ERR_COD_LS]

    email03 <- email02[ERR_COD_LS %in% c("NA","6"),.(EMAIL,FRQ,ERR_COD_LS)]
    email04 <- cSplit(email03,"EMAIL","@",drop=FALSE,type.convert=FALSE)
    setnames(email04,c("EMAIL","FRQ","ERR_COD_LS","USER","DOMAIN"))

    # VALIDATE INTENSIONALITY
    dmn01 <- email04[,.(FRQ=sum(FRQ)),by="DOMAIN"][order(-FRQ)]
    dmn02 <- c("HOTMAIL.COM","DOMINIO.COM","GMAIL.COM","YAHOO.COM","HOTMAIL.ES","YAHOO.ES","PICHINCHA.COM",
    "OUTLOOK.COM","LIVE.COM","OUTLOOK.ES","ANDINANET.NET","MSN.COM","LATINMAIL.COM","UIO.SATNET.NET",
    "YAHOO.COM.MX","YAHOO.COM.AR","HOTMAIL.IT","AOL.COM","INTERACTIVE.NET.EC")

    for(col in dmn02) email04[,(col):=stringdist((col),DOMAIN)]

    With this, we can infer a domain …

    gretttings…

    Like

  6. patricio fuenmayor, I am really new to R and I cannot get your code to run starting with this line:

    rgx05c =7,6,NA))][,”:=”(VLD=sum(c(ERR03,ERR05,ERR06),na.rm=TRUE)),by=EMAIL]

    I get this: Error: unexpected ‘,’ in “rgx05c =7,”

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s