The rms package offers a variety of tools to build and evaluate regression models in R. Originally named ‘Design’, the package accompanies the book “Regression Modeling Strategies” by Frank Harrell, which is essential reading for anyone who works in the ‘data science’ space. Over the past year or so, I have transitioned my personal modeling scripts to rms as it makes things such as bootstrapping, model validation, and plotting predicted probabilities easier to do. While the package is fairly well documented, I wanted to put together a ‘simpler’ and more accessible introduction that would explain to R-beginners how they could start using the rms package. For those with limited statistics training, I strongly suggest reading “Clinical Prediction Models” and working your way up to “Regression Modeling Strategies”. We start this introduction to the rms package with the datadist function, which computes statistical summaries of predictors to automate estimation and plotting of effects. The user will generally supply the final data

# R

## Batch Forecasting in R

Given a data frame with multiple columns which contain time series data, let’s say that we are interested in executing an automatic forecasting algorithm on a number of columns. Furthermore, we want to train the model on a particular number of observations and assess how well they forecast future values. Based upon those testing procedures, we will estimate the full model. This is a fairly simple undertaking, but let’s walk through this task. My preference for such procedures is to loop through each column and append the results into a nested list. First, let’s create some data. ddat <- data.frame(date = c(seq(as.Date(“2010/01/01”), as.Date(“2010/03/02”), by=1)), value1 = abs(round(rnorm(61), 2)), value2 = abs(round(rnorm(61), 2)), value3 = abs(round(rnorm(61), 2))) head(ddat) tail(ddat) We want to forecast future values of the three columns. Because we want to save the results of these models into a list, lets begin by creating a list that contains the same number of elements as our data frame. lst.names <-

## Statistical Reading Rainbow

For those of us who received statistical training outside of statistics departments, it often emphasized procedures over principles. This entailed that we learned about various statistical techniques and how to perform analysis in a particular statistical software, but glossed over the mechanisms and mathematical statistics underlying these practices. While that training methodology (hereby referred to as the ‘heuristic method’) has value, it has many drawbacks when the ultimate goal is to perform sound statistical analysis that is valid and thorough. Even in my current role as a data scientist at a technology company in the San Francisco Bay Area, I have had to go back and understand various procedures and metrics instead of just “doing data analysis”. Given this realization, I have dedicated hours of time outside of work over the last couple years to “re-training” myself on many of the important concepts in both descriptive and inferential statistics. This post will give brief mention to the books that have

## Weekly R-Tips: Visualizing Predictions

Lets say that we estimated a linear regression model on time series data with lagged predictors. The goal is to estimate sales as a function of inventory, search volume, and media spend from two months ago. After using the lm function to perform linear regression, we predict sales using values from two month ago. If this model is estimated weekly or monthly, we will eventually want to understand how well our model did in predicting actual sales from month to month. To perform this task, we must regularly maintain a spreadsheet or data structure (RDS object) with actual predicted sales figures for each time period. That data can be used to create line graphs that visualize both the actual versus predicted values. Here is what the original spreadsheet looked like. Transform that data into long format using whatever package you prefer. This will provide a data frame with three columns. We can utilize the ggplot2 package to create visualizations. Above

## Weekly R-Tips: Importing Packages and User Inputs

Number 1: Importing Multiple Packages Anyone who has used R for some time has written code that required the use of multiple packages. In most cases, this will be done by using the library or require function to bring in the appropriate extensions. That’s nice and gets the desired result, but can’t we just import all the packages we need in one or two lines. Yes we can, and here is the one line of code to do that. Number 2: User Input One side project that I hope to start on is a process whereby I can interact with R and select options that will result in particular outcomes. For example, let’s say you’re trying to put together a script that manages a weekly list. A good first step would be a list of options that the user would see and be prompted to select an option. Here is how R can be used to get user input in

## Applied Statistical Theory: Quantile Regression

This is part two of the ‘applied statistical theory’ series that will cover the bare essentials of various statistical techniques. As analysts, we need to know enough about what we’re doing to be dangerous and explain approaches to others. It’s not enough to say “I used X because the misclassification rate was low.” Standard linear regression summarizes the average relationship between a set of predictors and the response variable. represents the change in the mean value of given a one unit change in . A single slope is used to describe the relationship. Therefore, linear regression only provides a partial view of the link between the response variable and predictors. This is often inadaquete when there is heterogenous variance between and . In such cases, we need to examine how the relationship between and changes depending on the value of . For example, the impact of education on income may be more pronounced for those at higher income levels than

## Applied Statistical Theory: Belief Networks

Applied statistical theory is a new series that will cover the basic methodology and framework behind various statistical procedures. As analysts, we need to know enough about what we’re doing to be dangerous and explain approaches to others. It’s not enough to say “I used X because the misclassification rate was low.” At the same time, we don’t need to have doctoral level understanding of approach X. I’m hoping that these posts will provide a simple, succinct middle ground for understanding various statistical techniques. Probabilistic grphical models represent the conditional dependencies between random variables through a graph structure. Nodes correspond to random variables and edges represent statistical dependencies between the variables. Two variables are said to be conditionally dependent if they have a direct impact on each others’ values. Therefore, a graph with directed edges from parent and child denotes a causal relationship. Two variables are conditionally independent if the link between those variables are conditional on another. For a

## Basic Forecasting

Forecasting refers to the process of using statistical procedures to predict future values of a time series based on historical trends. For businesses, being able gauge expected outcomes for a given time period is essential for managing marketing, planning, and finances. For example, an advertising agency may want to utilizes sales forecasts to identify which future months may require increased marketing expenditures. Companies may also use forecasts to identify which sales persons met their expected targets for a fiscal quarter. There are a number of techniques that can be utilized to generate quantitative forecasts. Some methods are fairly simple while others are more robust and incorporate exogenous factors. Regardless of what is utilized, the first step should always be to visualize the data using a line graph. You want to consider how the metric changes over time, whether there is a distinct trend, or if there are distinct patterns that are noteworthy. There are several key concepts that we should

## A Few Days of Python: Using R in Python

Using R Functions in Python

## Logistic Regression in R – Part Two

My previous post covered the basics of logistic regression. We must now examine the model to understand how well it fits the data and generalizes to other observations. The evaluation process involves the assessment of three distinct areas – goodness of fit, tests of individual predictors, and validation of predicted values – in order to produce the most useful model. While the following content isn’t exhaustive, it should provide a compact ‘cheat sheet’ and guide for the modeling process. Goodness of Fit: Likelihood Ratio Test A logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This occurs by comparing the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors. The null hypothesis, holds that the reduced model is true,so an for the overall model fit statistic that is less than would compel us to reject .