Tutorial description
Ultimately, data analysis is about understanding relationships among
variables. Exploring data with multiple variables requires new, more
complex tools, but enables a richer set of comparisons. In this
tutorial, you will learn how to describe relationships between two
numerical quantities. You will characterize these relationships
graphically, in the form of summary statistics, and through simple
linear regression models.
In this tutorial you’ll also take your skills with simple linear
regression to the next level. By learning multiple and logistic
regression techniques you will gain the skills to model and predict both
numeric and categorical outcomes using multiple input variables. You’ll
also learn how to fit, visualize, and interpret these models. Then
you’ll apply your skills to learn about Italian restaurants in New York
City!
Learning objectives
- Visualize, measure, and characterize bivariate relationships
- Fit, interpret, and assess simple linear regression models
- Measure and interpret model fit
- Identify and attend to the disparate impact that unusual data
observations may have on a regression model
- Compute with
lm
objects in R
- Compute and visualize model predictions
- Visualize, fit, interpret, and assess a variety of multiple
regression models, including those with interaction terms
- Visualize, fit, interpret, and assess logistic regression
models
- Understand the relationship between R modeling syntax and geometric
and mathematical specifications for models
Lessons
- Quantify the strength of a linear relationship
- Compute and interpret Pearson Product-Moment correlation
- Identify spurious correlations
- Interpret the meaning of coefficients in a regression model
- Understand the impact of units and scales
- Work with
lm
objects in R
- Make predictions from regression models
- Overlay a regression model on a scatterplot
- Assess the quality of fit of a regression model
- Interpret \(R^2\)
- Measure leverage
and influence
- Identify and attend to outliers
- Visualize, fit, and interpret a parallel slopes model, which has one
numeric and one categorical explanatory variable
- Describe a model in three different ways: mathematically,
graphically, and through R syntax
- Assess and interpret model fit
- Compute residuals and predictions
- Fit and interpret interaction models
- Recognize Simpson’s
paradox
- Visualize, fit, and interpret a multiple regression model with two
numeric explanatory variables
- Visualize, fit, and interpret a parallel planes model with two
numeric explanatory variables and a categorical variable
- Fit and interpret multiple regression models in higher
dimensions
- Understand and identify multicollinearity
- Visualize, fit, and interpret logistic regression models
- Interpret coefficients on three different scales
- Make predictions from a logistic regression model
- Explore the relationship between price and the quality of food,
service, and decor for Italian restaurants in New York
City
Instructor
Benjamin S. Baumer
Smith College
Benjamin S.
Baumer is an associate
professor in the Statistical & Data
Sciences program at Smith College. He has been a practicing data
scientist since 2004, when he became the first full-time statistical
analyst for the New York Mets. Ben
is a co-author of The
Sabermetric Revolution, Modern Data Science
with R, and the second edition of Analyzing
Baseball Data with R. He received his Ph.D. in Mathematics from
the City University of New York in
2012, and is accredited as a professional statistician by the American Statistical Association. His
research interests include sports analytics, data science, statistics
and data science education, statistical computing, and network
science.
Ben won the Waller
Education Award from the ASA Section on Statistics and Data Science
Education, and the Significant
Contributor Award from the ASA Section on Statistics in Sports in
2019. He shared the 2016
Contemporary Baseball Analysis Award from the Society for American
Baseball Research. Currently, Ben is the primary investigator on a
three-year, nine-institution, $1.2 million award
from the National Science Foundation for workforce development under
the Data
Science Corps program.