# Data.Table by Example – Part 2

In part one, I provided an initial walk through of some nice features that are available within the data.table package. In particular, we saw how to filter data and get a count of rows by the date.

dat = fread("rows.csv")
names(dat) <- gsub(" ", "_", names(dat))
dat[1:3]


Let us now add a few columns to our dataset on reported crimes in the city of Chicago. There are many ways to do do this but they involve the use of the := operator. Since data.table updates values by reference, we do not need to save the results as another variable. This is a very desirable feature.

dat[, c("value1", "value2", "value3") := sample(1:50, nrow(dat), replace=TRUE)]

dat[, :=(value1 = sample(1:50, nrow(dat), replace=TRUE),
value2 = sample(1:50, nrow(dat), replace=TRUE),
value3 = sample(1:50, nrow(dat), replace=TRUE))]


You can also just use the traditional base R solution for adding new columns as data.tables are also data frames.

dat$value1 <- sample(1:50, nrow(dat), replace=TRUE) dat$value2 <- sample(1:50, nrow(dat), replace=TRUE)
dat$value3 <- sample(1:50, nrow(dat), replace=TRUE)  In any case, we now have three new columns with randomly selected values between 1 and 50. We can now look to summarize these values and see how they differ across the primary arrest type and other categorical variables. dat[, .(mean = mean(value1, na.rm = TRUE), median = median(value1, na.rm = TRUE), min = min(value1, na.rm = TRUE), max = max(value1, na.rm = TRUE))] dat[Primary_Type=="PROSTITUTION", .(mean = mean(value1, na.rm = TRUE), median = median(value1, na.rm = TRUE), min = min(value1, na.rm = TRUE), max = max(value1, na.rm = TRUE))]  The above code allows us to get the mean, median, min, and max values from the value1 column for all rows and for when the primary type is prostitution. To get these summaries across multiple columns, we can use lapply, .SD, and .SDcols. Notice that in the second line below, we passed primary type to the by operator so that we get mean value for each of the arrest type categories. dat[, lapply(.SD, mean, na.rm=TRUE), .SDcols=c("value1", "value2", "value3")] dat[, lapply(.SD, mean, na.rm=TRUE), by=Primary_Type, .SDcols=c("value1", "value2", "value3")][1:4]  But wait, that only provides me with the mean of a single value column. If you want to apply multiple functions to columns, the user is required to write a function that can then be used within lapply. my.summary <- function(x){ c(mean = mean(x), median = median(x), min = min(x), max = max(x)) } dat[, lapply(.SD, my.summary), .SDcols=c("value1", "value2", "value3")] dat[, lapply(.SD, my.summary), by=.(Primary_Type), .SDcols=c("value1", "value2", "value3")]  Perfect! As you can see, the syntax is concise and is very easy to work with. In the next part of this series, I’ll cover a few advanced features like get() and {}. If you have any questions or comments, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn # Data.Table by Example – Part 1 For many years, I actively avoided the data.table package and preferred to utilize the tools available in either base R or dplyr for data aggregation and exploration. However, over the past year, I have come to realize that this was a mistake. Data tables are incredible and provide R users with a syntatically concise and efficient data structure for working with small, medium, or large datasets. While the package is well documented, I wanted to put together a series of posts that could be useful for those who want to get introduced to the data.table package in a more task oriented format. For this series of posts, I will be working with data that comes from the Chicago Police Department’s Citizen Law Enforcement Analysis and Reporting system. This dataset contains information on reported incidents of crime that occured in the city of Chicago from 2001 to present. You can use the wget command in the terminal to export it as a csv file. $ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > rows.csv


This file can be found in your working directory and was saved as “rows.csv”. We will import the data into R with the fread function and look at the first few rows and structure of the data.

dat = fread("rows.csv")

dat[1:3]
str(dat)


Notice that the each of the string variables in the data set was imported as a character and not a factor. With base R functions like read.csv, we have to set the stringsAsFactors argument to TRUE if we want this result.

names(dat) <- gsub(" ", "_", names(dat))


Let’s say that we want to see the frequency distribution of several of these variables. This can be done by using .N in conjunction with the by operator.

dat[, .N, by=.(Arrest)]


In the code below, you can also see how to chain operations togther. We started by finding the count of each response in the variable, ordered the count in descending order, and then selected only those which occured more than 200,000 times.

dat[, .N, by=.(Primary_Type)][order(-N)][N>=200000]

dat[, .N, by=.(Location_Description)][order(-N)][N>=200000]


Let’s say that we want to get a count of prostitution incidents by month. To get the desired results, we will need to modify the date values, filter instances in which the primary type was “prostitution”, and then get a count by each date.

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][ Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)][]  If you want to plot the results as a line graph, just add another chain which executes the visualization or use the maggritr %>% operator. dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][
Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)][,
plot(N, type="l")]

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][ Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)] %>% ggplot(., aes(date2, N)) + geom_line(group=1)  I’ve obviously skipped over a lot and some of the code presented here is more verbose than needed. Even so, beginners to R will hopefully find this useful and it will pique your interest in the beauty of the data table package. Future posts will cover more of the goodies available in the data.table package such as get(), set(), {}, and so forth. If you have any questions or comments, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn # Working With SEM Keywords in R The following post was republished from two previous posts that were on an older blog of mine that is no longer available. These are from several years ago, and related to two critical questions that I encountered. One, how can I automatically generate hundreds of thousands of keywords for a search engine marketing campaign. Two, how can I develop an effective system for examining keywords based on different characteristics. Generating PPC Keywords in R Paid search marketing refers to the process of driving traffic to a website by purchasing ads on search engines. Advertisers bid on certain keywords that users might search for, and that determines when and where their ads appear. For example, an individual who owns an auto dealership would want to bid on keywords relating to automobiles that a reasonable people would search for on a search engine. In both Google and Bing, advertisers are able to specify which keywords they would like to bid for and at what amount. If the user decides to bid on just a small number of keywords, they can type that information and specify a bid. However, what if you want to bid on a significant number of keywords. Instead of typing each and every keyword into the Google or Bing dashboard, you could programmatically generate the keywords in R. Let’s say that I run an online retail establishment that sells mens and womens streetwear and I want to drive more traffic to my online store by placing ads on both Google and Bing. I want to bid on about a number of keywords related to fashion and have created a number of ‘root’ words that will comprise the majority of these keywords. To generate my desired keywords, I have a written a function which will take every single permutation of the root words. root1 = c("fashion", "streetwear") root2 = c("karmaloop", "crooks and castles", "swag") root3 = c("urban clothing", "fitted hats", "snapbacks") root4 = c("best", "authentic", "low cost") myfunc <- function(){ lst <- list(root1=c(root1), root2=c(root2), root3=c(root3), root4=c(root4)) myone <- function(x, y){ m1 <- do.call(paste, expand.grid(lst[[x]], lst[[y]])) mydf <- data.frame(keyword=c(m1)) } mydf <- rbind(myone("root4","root1"), myone("root2","root1")) } mydat <- myfunc() mydat write.table(mydat, "adppc.txt", quote=FALSE, row.names=FALSE) This isn’t the prettiest code in the world, but it gets the job done. In fact, the same results could have achieved using the following code, which is much more efficient. root5 = c("%s fashion") root6 = c("%s streetwear") adcam1 = sprintf(root5, root2) adcam2 = sprintf(root6, root2) df = data.frame(keywords=c(adcam1, adcam2)) write.table(df, "adppc.txt", quote=FALSE, row.names=FALSE) If you have any suggestions for improving my R code, please mention it in the comment section below. Creating Tags For PPC Keywords When performing search engine marketing, it is usually beneficial to construct a system for making sense of keywords and their performance. While one could construct Bayesian Belief Networks to model the process of consumers clicking on ads, I have found that using ’tags’ to categorize keywords is just as useful for conducting post-hoc analysis on the effectiveness of marketing campaigns. By ‘tags,’ I mean identifiers which categorize keywords according to their characteristics. For example, in the following data frame, we have six keywords, our average bids, numbers of clicks, and tags for state, model, car, auto, save, and cheap. What we want to do now is set the boolean for each tag to 1 if and only if that tag is mentioned in the keyword. # CREATE SOME DATA = df = data.frame(keyword=c("best car insurance", "honda auto insurance", "florida car insurance", "cheap insurance online", "free insurance quotes", "iowa drivers save money"), average_bid=c(3.12, 2.55, 2.38, 5.99, 4.75, 4.59), clicks=c(15, 20, 30, 50, 10, 25), conversions=c(5, 2, 10, 15, 3, 5), state=0, model=0, car=0, auto=0, save=0, cheap=0) df # FUNCTION WHICH SETS EACH TAG TO 1 IF THE SPECIFIED TAG IS PRESENT IN THE KEYWORD main <- function(df) { state <- c("michigan", "missouri", "florida", "iowa", "kansas") model <- c("honda", "toyota", "ford", "acura", "audi") car <- c("car") auto <- c("auto") save <- c("save") cheap <- c("cheap") for (i in 1:nrow(df)) { Words = strsplit(as.character(df[i, 'keyword']), " ")[[1]] if(any(Words %in% state)) df[i, 'state'] <- 1 if(any(Words %in% model)) df[i, 'model'] <- 1 if(any(Words %in% car)) df[i, 'car'] <- 1 if(any(Words %in% auto)) df[i, 'auto'] <- 1 if(any(Words %in% save)) df[i, 'save'] <- 1 if(any(Words %in% cheap)) df[i, 'cheap'] <- 1 } return(df) } one = main(df) subset(one, state==TRUE | model==TRUE | auto==TRUE) # AN ALTERNATE METHOD USING THE STRINGR PACKAGE df library(stringr) # CREATE EACH TAG state <- c("michigan", "missouri", "florida", "iowa", "kansas") model <- c("honda", "toyota", "ford", "acura", "audi") car <- c("car") auto <- c("auto") save <- c("save") cheap <- c("cheap") state_match <- str_c(state, collapse = "|") model_match <- str_c(model, collapse = "|") car_match <- str_c(car, collapse = "|") auto_match <- str_c(auto, collapse = "|") save_match <- str_c(save, collapse = "|") cheap_match <- str_c(cheap, collapse = "|") #FUNCTION TO SET TAG IF PRESENT IN THE KEYWORD main <- function(df) { df$state <- str_detect(df$keyword, state_match) df$model <- str_detect(df$keyword, model_match) df$car <- str_detect(df$keyword, car_match) df$auto <- str_detect(df$keyword, auto_match) df$save <- str_detect(df$keyword, save_match) df$cheap <- str_detect(df$keyword, cheap_match) df } two = main(df2) subset(two, state==TRUE | model==TRUE | auto==TRUE) By now, some of you are probably wondering why we don’t just select the keyword directly from the original data frame based on the desired characteristic. Well, that works too, albeit I’ve found that the marketing professionals that I’ve worked with have preferred the ‘tagging’ method. ## Alternate approach - SELECT DIRECTLY df main <- function(df) { model <- c("honda", "toyota", "ford", "acura", "audi") for (i in 1:nrow(df)) { Words = strsplit(as.character(df[i, 'keyword']), " ")[[1]] if(any(Words %in% model)) return(df[i, c(1:4) ]) }} three = main(df) So there you have it, a method of ‘tagging’ strings according to a certain set of specified characteristics. The benefit of using ‘tags’ is that it provides you with a systematic way to document how the presence of certain words or phrases impacts performance. # Using csvkit to Summarize Data: A Quick Example As data analysts, we’re frequently presented with comma-separated value files and tasked with reporting insights. While it’s tempting to import that data directly into R or Python in order to perform data munging and exploratory data analysis, there are also a number of utilities to examine, fix, slice, transform, and summarize data through the command line. In particular, Csvkit is a suite of python based utilities for working with CSV files from the terminal. For this post, we will grab data using wget, subset rows containing a particular value, and summarize the data in different ways. The goal is to take data on criminal activity, group by a particular offense type, and develop counts to understand the frequency distribution. Lets start by installing csvkit. Go to your command line and type in the following commands.$ pip install csvkit

One: Set the working directory.

$cd /home/abraham/Blog/Chicago_Analysis Two: Use the wget command to grab data and export it as a csv file entitled rows.$ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > rows.csv

This dataset contains information on reported incidents of crime that occured in the city of Chicago from 2001 to present. Data comes from the Chicago Police Department’s Citizen Law Enforcement Analysis and Reporting system.

Three: Let’s check to see which files are now in the working directory and how many rows that file contains. We will also use the csvcut command to identify the names of each column within that file.

$ls$ wc -l rows.csv
$csvcut -n rows.csv Four: Using csvsql, let’s find what unique values are in the sixth column of the file, primary type. Since we’re interested in incidents of prostitution, those observations will be subset using the csvgrep command, and transfered into a csv file entitled rows_pros.$ csvsql –query “SELECT [Primary Type], COUNT(*) FROM rows GROUP BY [Primary Type]” rows.csv | csvlook
$csvgrep -c 6 -m PROSTITUTION rows.csv > rows_pros.csv Five: Use csvlook and head to have a look at the first few rows of the new csv file. The ‘Primary Type’ should only contain information on incidents of crime that involved prostitution.$ wc -l rows_pros.csv
$csvlook rows_pros.csv | head Six: We’ve now got the data we need. So let’s do a quick count of each description that is associated with the prostitution offense. This is done using the csvsql and csvlook command line tools.$ csvsql –query “SELECT [Primary Type], Description, COUNT(*) FROM rows_pros GROUP BY Description” rows_pros.csv | csvlook

This has been a quick example of how the various csvkit utilities can be used to take a large csv file, extract specific observations, and generate summary statistics by executing a SQL query on that data. While this same analysis could have been performed in R or Python in a more efficient manner, it’s important for analysts to remember that the command line offers a variety of important utilities that can simplify their daily job responsibilities.

# Examining Website Pathing Data Using Markov Chains

A markov model can be used to examine a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Let’s define a stochastic process $(X_{n}, n = 0, 1, 2, ...)$ that takes on a finite number of possible values which are nonnegative integers. Each state, $X_{n}$, represents it’s value in time period $n$. If the probability of being in $X_{n+1}$ is dependent on $X_{n}$, it’s refered to as the first-order Markov property. We are interested in estimating $P_{ij}$, which is the fixed probability that $X$ at time $i$ will be followed by state $j$. These $n$ step transition probabilities are calculated through the Chapman-Kolmogorov equations, which relates the joint probability distributions of different sets of coordinates on a stochastic process. Markov chains are generally represented as a state diagram or transition matrix where every row of the matrix, $P$, is a conditional probability mass function.

Let’s consider an example using website pathing data from an ecommerce website. The set of possible outcomes, or sample space, is defined below. For this example, $S_{i}$ takes the values of each page on the site. This likely violates the Markov property, given that pages on an ecommerce website aren’t generally dependent on the previous page visited, but let’s proceed anyways.

$S = (Home, About, Shoes, Denim, Cart)$

Given a series of clickstreams, a markov chain can be fit in order to predict the next page visited. Below are the state diagram and transition matrix for this data. It suggests that from the home state, there is a 83% probability that a visit to the shoes state will be next.

To be honest, I’m not certain whether is the best technique to model how consumers utilize a website. Markov chains seem to be promising, but I’m a little uneasy about whether our assumptions are met in such scenarios. If you have ideas for such problems, please comment below.

# Statistics Refresher

Let’s face it, a good statistics refresher is always worthwhile. There are times we all forget basic concepts and calculations. Therefore, I put together a document that could act as a statistics refresher and thought that I’d share it with the world. This is part one of a two part document that is still being completed. This refresher is based on Principles of Statistics by Balmer and Statistics in Plain English by Brightman.

### The Two Concepts of Probability

#### Statistical Probability

• Statistical probability pertains to the relative frequency with which an event occurs in the long run.
• Example:
Let’s say we flip a coin twice. What is the probability of getting two heads?
If we flip a coin twice, there are four possible outcomes, $[(H,H), (H,T), (T,H), (T,T)]$.
Therefore, the probability of flipping two heads is $\frac{(H,H)}{N} = \frac{1}{2}*\frac{1}{2} = \frac{1}{4}$

#### Inductive Probability

• Inductive probability pertains to the degree of belief which is reasonable to place on a proposition given evidence.
• Example:
I’m $95\%$ certain that the answer to $1 + 1$ is between $1.5$ and $2.5$.

### The Two Laws of Probability

• If $A$ and $B$ are mutually exclusive events, the probability that either $A$ or $B$ will occur is equal to the sum of their separate probabilities.

$\displaystyle P(A \space or \space B) = P(A) + P(B)$

#### Law of Multiplication

• If $A$ and $B$ are two events, the probability that both $A$ and $B$ will occur is equal to the probability that $A$ will occur multiplied by the conditional probability that $B$ will occur given that $A$ has occured.

$P(A \space and \space B) = P(A) * P(B|A)$

#### Conditional Probability

• The probability of $B$ given $A$, or $P(B|A)$, is the probability that $B$ will occur if we consider only those occasionson which $A$ also occurs. This is defined as $\frac{n(A \space and \space B)}{n(A)}$.

### Random Variables and Probability Distributions

#### Discrete Variables

• Variables which arise from counting and can only take integral values $(0, 1, 2, \ldots)$.
• A frequency distribution represents the amount of occurences for all the possible values of a variable. This can be represented in a table or graphically as a probability distribution.
• Associated with any discrete random variable, $X$, is a corresponding probability function which tells us the probability with which $X$ takes any value. The particular value that $X$ can take is characterized by $x$. Based on $x$, the probability that $X$ will take can be calculated. This measure is the probability function and is defined by $P(x)$.
• The cumulative probability function specifies the probability that $X$ is less than or equal to some particular value, $x$. This is denoted by $F(x)$. The cumulative probability function can be calculated by summing the probabilities of all values less than or equal to $x$.

$F(x) = Prob[X \leq x]$

$F(x) = P(0) + P(1) + \ldots + P(x) = \sum_{u \leq x} p(u)$

#### Continuous Variables

• Variables which arise from measuring and can take any value within a given range.
• Continuous variables are best graphically represented by a histogram, where the area of each rectangle represents the proportion of observations falling in that interval.
• The probability density function, $f(x)$, refers to the smooth continuous curve that is used to describe the relative likelihood a random variable to take on a given value. $f(x)$ can also be used to show the probability that the random variable will lie between $x_1$ and $x_2$.
• A continuous probability distribution can also be represented by its cumulative probability function, $f(x)$. which specified the probability that $X$ is less than or equal to $x$.
• A continuous random variable is said to be uniformly distributed between $0$ and $1$ if it is equally likely to lie anywhere in this interval but cannot lie outside it.

#### Multivariate Distributions

• The joint frequency distribution of two random variables is called a bivariate distribution. $P(x,y)$ denotes the probability that simultaneously $X$ will be $x$ and $Y$ will be $y$. This is expressed through a bivariate distribution table.

$P(x,y) = Prob[X == x \space and \space Y == y]$

• In a bivariate distribution table, the right hand margin sums the probabilities in different rows. It expresses the overall probability distribution of $x$, regardless of the value of $y$.

$p(x) = Prob[X == x] = \sum_{y} p(x,y)$

• In a bivariate distribution table, the bottom margin sums the probabilities in different columns. It expresses the overall probability distribution of $y$, regardless of the value of $x$.

$p(y) = Prob[Y == y] = \sum_{x} p(x,y)$

### Properties of Distributions

#### Measures of Central Tendancy

• The mean is measured by taking the sum divided by the number of observations.

$\bar{x} = \frac{x_1 + x_2 + \ldots + x_n}{n} = \sum_{i=1}^n \frac{x_i}{n}$

• The median is the middle observation in a series of numbers. If the number of observations are even, then the two middle observations would be divided by two.
• The mode refers to the most frequent observation.
• The main question of interest is whether the sample mean, median, or mode provides the most accurate estimate of central tendancy within the population.

#### Measures of Dispersion

• The standard deviation of a set of observations is the square root of the average of the squared deviations from the mean. The squared deviations from the mean is called the variance.

#### The Shape of Distributions

• Unimodal distributions have only one peak while multimodal distributions have several peaks.
• An observation that is skewed to the right contains a few large values which results in a long tail towards the right hand side of the chart.
• An observation that is skewed to the left contains a few small values which results in a long tail towards the left hand side of the chart.
• The kurtosis of a distribution refers to the degree of peakedness of a distribution.

### The Binomial, Poisson, and Exponential Distributions

#### Binomial Distribution

• Think of a repeated process with two possible outcome, failure ($F$) and success ($S$). After repeating the experiment $n$ times, we will have a sequence of outcomes that include both failures and successes, $SFFFSF$. The primary metric of interest is the total number of successes.
• What is the probability of obtaining $x$ successes and $n-x$ failures in $n$ repetitions of the experiment?

#### Poisson Distribution

• The poisson distribution is the limiting form of the binomial distribution when there are a large number of trials but only a small probability of success at each of them.

#### Exponential Distribution

• A continuous, positive random variable is said to follow an exponential distribution if its probability density function decreases as the values of $x$ go from $0$ to $\infty$. The probability declines from its highest levels at the initial values of $x$.

### The Normal Distribution

#### Properties of the Normal Distribution

• The real reason for the importance of the normal distribution lies in the central limit theorem, which states that the sum of a large number of independent random variables will be approximately normally distributed regardless of their individual distributions.
• A normal distribution is defined by its mean, $\mu$, and standard deviation, $\sigma$. A change in the mean shifts the distribution along the x-axis. A change in the standard deviation flattens it or compresses it while leaving its centre in the same position. The totral area under the curve is one and the mean is at the middle and divides the area into halves.
• One standard deviation above and below the mean of a normal distribution will include 68% of the observations for that variable. For two standard deviates, that value will be 95%, and for three standard deviations, that value will be 99%.

There you have it, a quick review of basic concepts in statistics and probability. Please leave comments or suggestions below. If you’re looking to hire a marketing scientist, please contact me at mathewanalytics@gmail.com

# Introduction to the RMS Package

The rms package offers a variety of tools to build and evaluate regression models in R. Originally named ‘Design’, the package accompanies the book “Regression Modeling Strategies” by Frank Harrell, which is essential reading for anyone who works in the ‘data science’ space. Over the past year or so, I have transitioned my personal modeling scripts to rms as it makes things such as bootstrapping, model validation, and plotting predicted probabilities easier to do. While the package is fairly well documented, I wanted to put together a ‘simpler’ and more accessible introduction that would explain to R-beginners how they could start using the rms package. For those with limited statistics training, I strongly suggest reading “Clinical Prediction Models” and working your way up to “Regression Modeling Strategies”.

We start this introduction to the rms package with the datadist function, which computes statistical summaries of predictors to automate estimation and plotting of effects. The user will generally supply the final data frame to the datadist function and set the data distribution using the options function. Note that if you modify the data in your data frame, then you will need to reset the distribution summaries using datadist.


my.df = data.frame(one=c(rnorm(100)), two=c(rnorm(100)), y=rnorm(100))

my.df_new = subset(my.df, two >= 0.05)
ddd



The main functions to estimate models in rms are ols for linear models and lrm for logistic regression or ordinal logistic regression. There are also a few other functions for performing survival analysis,but they will not be covered in this post.


getHdata(prostate)



Using the prostate data, we built a linear model using ordinary least squares estimation. The argument x and y must be set to true if we plan on evaluate the model in later stages using the validate and calibrate functions. Because we haven’t altered the default contrasts or incorporated any smoothing splines, the coefficients and standard errors should be identical to the results of lm. Use the model variable name to see the estimates from the model and use the summary function to get an overview of the effects of each predictor on the response variable. One important thing to note that the effect point estimates in the summary.rms output relate to the estimated effect of an inter-quartile range increase in the predictor variable.


lmod = ols(wt ~ age + sbp + rx, data=prostate, x=TRUE, y=TRUE)
lmod
summary(lmod)
summary(lmod, age=c(50,70))



This may not seem like anything to write home about. But what makes the rms package special is that it makes the modeling process significantly easier. For the above linear regression model, let’s plot the predicted values and perform internal bootstrapped validation of the model. In the following code, the validate function is used to assess model fit and calibrate is used to assess if the observed responses agree with predicted responses.


plot(anova(lmod), what='proportion chisq') # relative importance
plot(Predict(lmod)) # predicted values
rms::validate(lmod, method="boot", B=500) # bootstrapped validation
my.calib <- rms::calibrate(lmod, method="boot", B=500) # model calibration
plot(my.calib, las=1)
vif(lmod) # test for multicolinearity
Predict(lmod)



Let us now build a logistic regression model using the lrm function, plot the expected probabilities, and evaluate the model. We also use the pentrace function to perform logistic regression with penalized maximum likelihood estimation.


mod1 = lrm(as.factor(bm) ~ age + sbp + rx, data=prostate, x=TRUE, y=TRUE)
mod1
summary(mod1)

plot(anova(mod1), what='proportion chisq') # relative importance
plot(Predict(mod1, fun=plogis)) # predicted values
rms::validate(mod1, method="boot", B=500) # bootstrapped validation
my.calib <- rms::calibrate(mod1, method="boot", B=500) # model calibration
plot(my.calib, las=1)

penalty <- pentrace(mod1, penalty=c(0.5,1,2,3,4,6,8,12,16,24), maxit=25)
mod1_pen <- update(mod1, penalty=penalty\$penalty)
effective.df(mod1_pen)
mod1_pen



There you have it, a very basic introduction to the rms package for beginners to the R programming language. Once again, I strongly suggest that readers who are not trained statisticians should read and fully comprehend “Clinical Prediction Models” and “Regression Modeling Strategies” by Frank Harrell. You can also access a number of handouts and lecture notes at here.