What they forgot to teach you in your introduction to linear regression with R
June 29, 2025
There exists a handful of functions in base-R that are closely related to the typical introductory linear regression curriculum. More often than not, I feel that these functions are not exposed to students in any introductory linear regression course where R is used for the computational component. For students who are taking or have recently finished an introductory linear regression course, if there were any instances where you came across some code that made you think, "there must be an easier, cleaner way to do this", you were probably right!
There are a number of reasons why I have wanted to make this post for a while:
-
The functions that I will discuss in this post can be extended to other contexts, such as GLMs, with one particular function being applicable to more general contexts.
-
It encourages you early on to write clean code and manage your workspace effectively. In cases where you come across code that feels busy, inefficient, or clutters up your workspace with unnecessary transitory variables, take the initiative to seek out cleaner alternatives.
-
Follow the philosophy of "work smarter, not harder".
-
I have met a lot of students who are just starting to use R and already dread using/learning it. What they don't know is that this is often a result of the inefficient misuse by instructors who barely know the language themselves (it happened more often than I would have hoped). Therefore, I hope this post gives you an ounce of hope that R is not as messy/ugly as you might have been shown, and that you do not become a hateR.
Setup
We will be using the
{dplyr} package for a tiny amount of data
wrangling. It can be done in base-R, if that is your preference.
library(dplyr)
If you do not have {dplyr} installed, you can install it by calling install.packages("dplyr")
in the console.
For the following demonstrations, we will be using the mtcars data set. This data set is included
in base-R.
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
For our base linear model, we will use mpg as the response and disp as a predictor.
base_model <- lm(mpg ~ disp, data=mtcars)
summary(base_model)
##
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## disp -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
update()
The update() function allows you to add or modify arguments used to construct a linear model and
subsequently re-fit the model using these new values. This can include modifying the response
variable, the covariates, the data set used, or even other options used to fit the model. There are
two benefits to using update() rather than building a new model from scratch.
-
You specify only the arguments whose values should change. As such, all other aspects of the existing model that should remain unchanged do not need to be re-typed.
-
Since the
update()function requires the user to specify an object to update, from a workflow perspective, the code reveals the evolutionary relationship between models in your workspace by highlighting what is changing between model builds, and provides a "story" for the model building process.
Suppose we wanted to rebuild my base_model using a subset of the mtcars data set that includes
only four and eight cylinder cars.
mtcars_4or8cyl <- mtcars %>%
filter(cyl %in% c(4, 8))
# Or in base-R
# mtcars_4or8cyl <- mtcars[mtcars$cyl %in% c(4, 8),]
Instead of building a linear model from scratch again by calling:
second_model <- lm(mpg ~ disp, data=mtcars_4or8cyl)
We can call:
second_model <- update(base_model, data=mtcars_4or8cyl)
If we wanted to update the formula instead of the data set, we can take advantage of the
.-shorthand, which means to keep everything from the respective side of the formula.
third_model <- update(base_model, . ~ . + hp)
Since the formula supplied to base_model was mpg ~ disp, . ~ . + hp will translate to
mpg ~ disp + hp.
add1() / drop1()
The add1() and drop1() functions are useful functions for performing the required hypothesis
tests to add or drop a covariate from a set of candidate covariates. Too often have I seen the
following workflow being recommended to students to perform a single variable addition:
model4a <- lm(mpg ~ disp + hp, data=mtcars)
model4b <- lm(mpg ~ disp + drat, data=mtcars)
model4c <- lm(mpg ~ disp + wt, data=mtcars)
model4d <- lm(mpg ~ disp + qsec, data=mtcars)
anova(model4a, base_model)
anova(model4b, base_model)
anova(model4c, base_model) # Has the smallest p-value
anova(model4d, base_model)
model4 <- model4c
You do not need to do all this work! In addition to being a lot of typing (or copying, pasting, and replacing, which is error-prone), you also end up cluttering your workspace with four additional models, three of which you will never use again. After deciding on a fourth model, if you had decided you wanted to proceed with adding a third covariate, you'd continue cluttering up your workspace with even more models!
The following is all you need:
add1(base_model, ~ . + hp + drat + wt + qsec, test="F")
## Single term additions
##
## Model:
## mpg ~ disp
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 317.16 77.397
## hp 1 33.665 283.49 75.806 3.4438 0.073679 .
## drat 1 14.263 302.90 77.925 1.3655 0.252097
## wt 1 70.476 246.68 71.356 8.2852 0.007431 **
## qsec 1 3.622 313.54 79.030 0.3350 0.567196
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This single line of code builds the required models and returns the p-value of the corresponding
F-test. From the output, we can see that wt has the smallest p-value and should be added to the
model. We can then use the update() function to create our second model.
fourth_model <- update(base_model, . ~ . + wt)
For single variable deletions, a similar procedure can be followed using drop1(). For this, I
build what I consider to be a "full model", which has all the covariates of interest: disp, hp,
drat, wt, and qsec.
full_model <- lm(mpg ~ disp + hp + drat + wt + qsec, data=mtcars)
If all covariates are eligible to be dropped, we can call:
drop1(full_model, ~ ., test="F")
## Single term deletions
##
## Model:
## mpg ~ disp + hp + drat + wt + qsec
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 170.13 65.466
## disp 1 3.974 174.10 64.205 0.6074 0.442812
## hp 1 11.886 182.01 65.627 1.8165 0.189360
## drat 1 15.506 185.63 66.258 2.3697 0.135791
## wt 1 81.394 251.52 75.978 12.4390 0.001584 **
## qsec 1 12.708 182.84 65.772 1.9422 0.175233
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we only want to consider dropping one of either, for example, wt or qsec, we can call:
drop1(full_model, ~ wt + qsec, test="F")
## Single term deletions
##
## Model:
## mpg ~ disp + hp + drat + wt + qsec
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 170.13 65.466
## wt 1 81.394 251.52 75.978 12.4390 0.001584 **
## qsec 1 12.708 182.84 65.772 1.9422 0.175233
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As before, we can finalise the model by making a call to update():
fifth_model <- update(full_model, . ~ . - qsec)
Optional tidy extensions
The
{broom} package from
{tidymodels} contains tools for tidying model output into
tibbles (enhanced data frames) for when you require further (possibly programmatic) wrangling of
model output.
library(broom)
When calling summary() on a linear model, a lot of information is returned:
summary(full_model)
##
## Call:
## lm(formula = mpg ~ disp + hp + drat + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5404 -1.6701 -0.4264 1.1320 5.4996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.53357 10.96423 1.508 0.14362
## disp 0.00872 0.01119 0.779 0.44281
## hp -0.02060 0.01528 -1.348 0.18936
## drat 2.01578 1.30946 1.539 0.13579
## wt -4.38546 1.24343 -3.527 0.00158 **
## qsec 0.64015 0.45934 1.394 0.17523
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.558 on 26 degrees of freedom
## Multiple R-squared: 0.8489, Adjusted R-squared: 0.8199
## F-statistic: 29.22 on 5 and 26 DF, p-value: 6.892e-10
This information can be returned as two separate tibbles via tidy() and glance().
tidy()
We can call tidy() to obtain the coefficient table, standard errors, values of the test
statistics, and p-values.
tidy(full_model)
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 16.5 11.0 1.51 0.144
## 2 disp 0.00872 0.0112 0.779 0.443
## 3 hp -0.0206 0.0153 -1.35 0.189
## 4 drat 2.02 1.31 1.54 0.136
## 5 wt -4.39 1.24 -3.53 0.00158
## 6 qsec 0.640 0.459 1.39 0.175
Optionally, with tidy(), we can also obtain the confidence intervals:
tidy(full_model, conf.int=TRUE)
## # A tibble: 6 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 16.5 11.0 1.51 0.144 -6.00 39.1
## 2 disp 0.00872 0.0112 0.779 0.443 -0.0143 0.0317
## 3 hp -0.0206 0.0153 -1.35 0.189 -0.0520 0.0108
## 4 drat 2.02 1.31 1.54 0.136 -0.676 4.71
## 5 wt -4.39 1.24 -3.53 0.00158 -6.94 -1.83
## 6 qsec 0.640 0.459 1.39 0.175 -0.304 1.58
This is useful since base-R's confint() function actually returns a matrix:
confint(full_model)
## 2.5 % 97.5 %
## (Intercept) -6.00372864 39.07086783
## disp -0.01427933 0.03171968
## hp -0.05201291 0.01081675
## drat -0.67585793 4.70740704
## wt -6.94137515 -1.82955263
## qsec -0.30404593 1.58434573
class(confint(full_model))
## [1] "matrix" "array"
Compared to base-R's confint(x), tidy(x, conf.int = TRUE) saves us a few steps of needing to
convert the matrix of confidence intervals to data frames (or tibbles) and then binding the results
back to the coefficient table so that they can be viewed together.
glance()
To obtain other miscellaneous attributes of the fitted linear model that we usually see in the
output of summary(), we can pass it to glance():
glance(full_model)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.849 0.820 2.56 29.2 6.89e-10 5 -72.1 158. 169.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
augment()
The augment() function is a quick way to get the fitted values column-binded to the data frame
used to fit the model, i.e. you get all covariate values plus the fitted value, plus additional
diagnostic values.
augment(full_model)
## # A tibble: 32 × 13
## .rownames mpg disp hp drat wt qsec .fitted .resid .hat .sigma
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 160 110 3.9 2.62 16.5 22.6 -1.57 0.148 2.59
## 2 Mazda RX4 W… 21 160 110 3.9 2.88 17.0 21.8 -0.812 0.131 2.60
## 3 Datsun 710 22.8 108 93 3.85 2.32 18.6 25.1 -2.26 0.0674 2.57
## 4 Hornet 4 Dr… 21.4 258 110 3.08 3.22 19.4 21.1 0.329 0.143 2.61
## 5 Hornet Spor… 18.7 360 175 3.15 3.44 17.0 18.2 0.473 0.175 2.61
## 6 Valiant 18.1 225 105 2.76 3.46 20.2 19.7 -1.57 0.219 2.58
## 7 Duster 360 14.3 360 245 3.21 3.57 15.8 15.6 -1.28 0.153 2.59
## 8 Merc 240D 24.4 147. 62 3.69 3.19 20 22.8 1.61 0.129 2.59
## 9 Merc 230 22.8 141. 95 3.92 3.15 22.9 24.6 -1.75 0.488 2.56
## 10 Merc 280 19.2 168. 123 3.92 3.44 18.3 20.0 -0.792 0.147 2.60
## # ℹ 22 more rows
## # ℹ 2 more variables: .cooksd <dbl>, .std.resid <dbl>
You can also obtain confidence or prediction intervals for each observation:
augment(full_model, interval="confidence")
## # A tibble: 32 × 15
## .rownames mpg disp hp drat wt qsec .fitted .lower .upper .resid
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 160 110 3.9 2.62 16.5 22.6 20.5 24.6 -1.57
## 2 Mazda RX4 W… 21 160 110 3.9 2.88 17.0 21.8 19.9 23.7 -0.812
## 3 Datsun 710 22.8 108 93 3.85 2.32 18.6 25.1 23.7 26.4 -2.26
## 4 Hornet 4 Dr… 21.4 258 110 3.08 3.22 19.4 21.1 19.1 23.1 0.329
## 5 Hornet Spor… 18.7 360 175 3.15 3.44 17.0 18.2 16.0 20.4 0.473
## 6 Valiant 18.1 225 105 2.76 3.46 20.2 19.7 17.2 22.1 -1.57
## 7 Duster 360 14.3 360 245 3.21 3.57 15.8 15.6 13.5 17.6 -1.28
## 8 Merc 240D 24.4 147. 62 3.69 3.19 20 22.8 20.9 24.7 1.61
## 9 Merc 230 22.8 141. 95 3.92 3.15 22.9 24.6 20.9 28.2 -1.75
## 10 Merc 280 19.2 168. 123 3.92 3.44 18.3 20.0 18.0 22.0 -0.792
## # ℹ 22 more rows
## # ℹ 4 more variables: .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
Finally, you can also use augment to obtain the column-binded fitted values of unseen data by
passing it to newdata:
fake_cars <- tibble(
mpg = c(26, 25),
disp = c(170, 265),
hp = c(100, 105),
drat = c(2.7, 3.11),
wt = c(3.10, 2.98),
qsec = c(19.1, 20.3)
)
augment(full_model, newdata=fake_cars)
## # A tibble: 2 × 8
## mpg disp hp drat wt qsec .fitted .resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 26 170 100 2.7 3.1 19.1 20.0 5.97
## 2 25 265 105 3.11 2.98 20.3 22.9 2.12
If we know the true values of the response variable (in this case, mpg), we can subsequently
compare the mpg values with the .fitted values to evaluate predictive performance via the
yardstick package. For example,
we might do something like:
more_unseen_data <- tibble(
. . .
)
augment(full_model, newdata=more_unseen_data) %>%
yardstick::rmse(truth=mpg, estimate=.fitted)
Supplementary notes
-
update(),add1(), anddrop1()also work with glms! -
update()is a generic method. It works with lms, glms, and any other objects that have acallelement. Theupdate()method can often be found in other modelling packages for use with those package-specific models to perform updates in a similar fashion.