For those of us who received statistical training outside of statistics departments, it often emphasized procedures over principles. This entailed that we learned about various statistical techniques and how to perform analysis in a particular statistical software, but glossed over the mechanisms and mathematical statistics underlying these practices. While that training methodology (hereby referred to as the ‘heuristic method’) has value, it has many drawbacks when the ultimate goal is to perform sound statistical analysis that is valid and thorough. Even in my current role as a data scientist at a technology company in the San Francisco Bay Area, I have had to go back and understand various procedures and metrics instead of just “doing data analysis”.
Given this realization, I have dedicated hours of time outside of work over the last couple years to “re-training” myself on many of the important concepts in both descriptive and inferential statistics. This post will give brief mention to the books that have been most useful is helping me develop a fuller understanding of the statistical sciences. These books have also helped me fill in the gaps and deficiencies from my statistical training in university and graduate school. Furthermore, these are the texts that I often revisit when I need a reference on some statistical topic of interest. This is at a minimum a a six year journey, so I have a long way to go until I am able to stand solidly in my understanding of statistics. While I am sacrificing a lot of my free time to this undertaking, it will certainly improve my knowledge and help prepare me for graduate school (PhD) in biostatistics, which I hope to attend in around five to six years.
Please note that I am not taking issue with the ‘heuristic method’ of statistical training. It certainly has its place and provides students with the immediate knowledge required to satisfactorily prepare for work in private industry. In fact, I prefer the ‘heuristic method’ and still rely on straight forward rules in my day to day work as that ensures that best practices are followed and satisfactory analysis is performed. Furthermore, I certainly believe that it is superior to the hack-ey nature of data mining and data science education, but that is a different story.
Statistics in Plain English – Urdan
Clear, concise, and covers all the fundamental items that one would need to know. Everything from descriptive statistics to linear regression are covered, with many good examples. Even if you never use ANOVA or factor analysis, this is a good book to review and one that I strongly recommend to people who are interested in data science.
Principles of Statistics – Balmer
This is a classic text that offers a good treatment of probability theory, distributions, and statistical inference. The text contains a bit more math than ‘Statistics in Plain English’, so I think it should be read after completing the previous book.
Fundamentals of Modern Statistical Methods – Wilcox
This book reviews ‘traditional’ parametric statistics and provides a good overview of robust statistical methods. There is a fair amount on the historical evolution of various techniques, and I found that a bit unnecessary. But overall, this is still a solid introductory text to learn about statistical inference using robust techniques.
Mostly Harmless Econometrics – Angrist
While I don’t regularly work with instrumental variables, generalized methods of moments, or regression discontinuity, this book is a great high level introduction to econometrics. The chapters on regression theory and quantile regression are phenomenal.
Regression Modeling Strategies – Harrell
This is my most referenced book and the one that really helped in my overall development as an applied statistician. All the important topics are covered, from imputation, regression splines, and so forth. This book includes R code for performing analysis using the RMS package. I end up citing this book quite a lot. For example, in a recent work email, I mentioned that Harrell “also says on page 61 that “narrowly distributed predictor variables will require higher sample sizes.”” Essential reading in my opinion.
Data Analysis Using Regression and Multilevel/Hierarchical Models – Gelman and Hill
The first half of this book cover statistical inference using single level models and the second half is dedicated to multilevel methods. Given that I am rarely work with panel data, I use the first half of this book a reference for things that I may need a quick refresher on. It is very accessible and has plenty of examples with R code.
Semiparametric Regression for the Social Sciences – Keele
This is one of my favorite statistical books. Well written and easy to comprehend, but still rigorous. Covers local regression, splines, and generalized additive models. There is also a solid chapter on the use of bootstrapping with semiparametric and nonparametric models.
Statistical Learning from a Regression Perspective – Berk
As a skeptic who is wary of every hype machine, I really enjoyed Berks preface in which he discusses the “dizzying array of new statistical procedures” that have been introduced over the past several decades with “the hype of a big-budget movie.” I got this text for its treatment of topics such as boosting, bagging, random forest, and support vector machines. I will probably need to reread this book several more times before I fully comprehend everything.
Time Series: a Biostatistical Introduction – Diggle
The lack of quality time series books is really infuriating. Don’t get me wrong, there are some good texts on forecasting, such as the free online book from Hyndaman. However, I’ve yet to find a really good intermediate level treatment of time series analysis besides this one. Contains good coverage of repeated measurements, ARIMA modeling, and forecasting.
Statistical Rethinking – McElreath
While I was introduced to robust techniques and nonparametric statistics in graduate school, there was nothing on Bayesian methods. Due to a fear of the topic, I avoided learning about it until this past year. This book by McElreath has been great as it is very accessible and provides code for understanding various principles. Over the next year, I am hoping to dive deeper into Bayesian techniques and this was a good first step.