Logistic Regression in R – Part One

Logistic regression is used to analyze the relationship between a dichotomous dependent variable and one or more categorical or continuous independent variables. It specifies the likelihood of the response variable as a function of various predictors. The model expressed as $latex log(odds) = \beta_0 + \beta_1*x_1 + … + \beta_n*x_n $, where $latex \beta$ refers to the parameters and $latex x_i$ represents the independent variables. The $latex log(odds)$, or log of the odds ratio, is defined as $latex ln[\frac{p}{1-p}]$. It expresses the natural logarithm of the ratio between the probability that an event will occur, $latex p(Y=1)$, to the probability that an event will not occur, $latex p(Y=0)$.

The models estimates, $latex \beta$, express the relationship between the independent and dependent variable on a log-odds scale. A coefficient of $latex 0.020$ would indicate that a one unit difference in $latex \beta_i$ is associated with a log-odds increase in the occurce of $latex Y$ by $latex 0.020$. To get a clearer understanding of the constant effect of a predictor on the likelihood that an outcome will occur, odds-ratios can be calculated. This can be expressed as $latex odds(Y) = \exp(\beta_0 + \beta_1*x_1 + … + \beta_n*x_n) $, which is the exponentiate of the model. Alongside the odd-ratio, it’s often worth calculating predicted probabilities of $latex Y$ at specific values of key predictors. This is done through $latex p = \frac{1}{1 + \exp^{-z}} $ where z refers to the $latex log(odds)$ regression equation.

Using the GermanCredit dataset in the Caret package, we will construct a logistic regression model to estimate the likelihood of a consumer being a good loan applicant based on a number of predictor variables.

library(caret)
data(GermanCredit)
 
# split the data into training and testing datasets 
Train <- createDataPartition(GermanCredit$Class, p=0.6, list=FALSE)
training <- GermanCredit[ Train, ]
testing <- GermanCredit[ -Train, ]
 
# use glm to train the model on the training dataset. make sure to set family to "binomial"
mod_fit_one <- glm(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own +
CreditHistory.Critical, data=training, family="binomial")
 
summary(mod_fit_one) # estimates 
exp(coef(mod_fit_one)) # odds ratios
predict(mod_fit_one, newdata=testing, type="response") # predicted probabilities

Great, we’re all done, right? Not just yet. There are some critical questions that still remain. Is the model any good? How well does the model fit the data? Which predictors are most important? Are the predictions accurate? In the next post, I’ll provide an overview of how to evaluate logistic regression models in R.

9 thoughts on “Logistic Regression in R – Part One”

  1. Pingback: Logistic Regression in R – Part One | Mubashir Qasim

  2. Hello,
    I installed the caret package and its dependencies. I then ran your code as it appears. After “exp(coef(mod_fit$finalModel)) # odds ratios” I received the error “Error in coef(mod_fit$finalModel) : object ‘mod_fit’ not found”.

    1. Try exp(coef(mod_fit_one))

      The original code was for getting the odds ratios from a model trained using the caret package, and I had called it mod_fit. With that package, you have to specify model$finalModel.

      Thanks for reading and catching that error.

  3. This is a great post, are you planning on comparing the 400 testing data against the 400 values predicted by the model and showing them on the same graphic to see how close the predicted are from the real values? If so, that would be awesome.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top