In this chapter, we are going to talk about two (continuous) numerical variables.
y
which is dependent.x
which is independent and a predictor.
Through out the chapter, we’ll do some practices on several datasets listed below inside openintro
package.
ncbirths
: A random sample of 1,000 cases taken from a larger dataset collected in 2004 describing the birth of a single child born in North Carolina, along with various characteristics of the child.mammals
: Contains information about 39 species of mammals, including their body weight, brain weight, gestation time, etc.mlbBat10
: Contains batting statistics for 1,199 Major League Baseball players during the 2010 season.bdims
: Contains body girth and skeletal diameter measurements for 507 physically active individuals.smoking
: Contains information on the smoking habits of 1,691 citizens of the United Kingdom.Using the ncbirths
dataset, make a scatterplot using ggplot() to illustrate how the birth weight of these babies varies according to the number of weeks of gestation.
library(ggplot2)
library(openintro)
library(dplyr)
# Use geom_plot() to make a scatterplot.
ggplot(ncbirths, aes(x = weeks, y = weight)) +
geom_point()
# Use cut() to break a continuous variable to a discrete one.
ggplot(data = ncbirths,
aes(x = cut(weeks, breaks = 5), y = weight)) +
geom_boxplot()
In the scatterplot, it seems that there is a positive relationship between weeks
& weight
.
However, after using cut()
discretizing continuous variables and drawing boxplot, the relationship no longer seems linear.
Using the mammals
dataset, create a scatterplot illustrating how the brain weight of a mammal varies as a function of its body weight.
# Scatterplot with original scale.
ggplot(mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point()
# Scatterplot with coord_trans().
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point() +
coord_trans(x = "log10", y = "log10")
# Scatterplot with scale_x_log10() and scale_y_log10().
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point() +
scale_x_log10() + scale_y_log10()
Because of extreme values in this dataset, it’s hard to see the relationship between BodyWt
& BrainWt
.
We can put these variables on a log scale and noticed that the relation is much clearer now.
coord_trans()
: Coord_() function to put it on a log10 scale.scale_x_continuous()
Scale_() function can do the same thing with different lables on axises.Using the mlbBat10
dataset, create a scatterplot illustrating how the slugging percentage (SLG) of a player varies as a function of his on-base percentage (OBP).
# Scatterplot with outliers.
ggplot(mlbBat10, aes(x = OBP, y = SLG)) +
geom_point()
# After removing those outliers.
mlbBat10 %>%
filter(AB >= 200) %>%
ggplot(aes(x = OBP, y = SLG)) +
geom_point()
In the first scatterplot, most of the points were clustered in the lower left corner of the plot, making it difficult to see the general pattern of the majority of the data. This difficulty was caused by a few outlying players whose on-base percentages (OBPs) were exceptionally high.
After removing outliers, we can finally see the general pattern in the dataset, we’ll discuss the influence of outliers later.
-1
and 1
.cor(x, y)
Compute the Pearson product-moment correlation between variables, x
and y
, and the order doesn’t matter.
cor()
will return NA
as a default if there is missing data.use = "pairwise.complete.obs"
when there is missing values in the dataset and cor()
will ignore those values.ncbirths %>%
summarize(N = n(), r = cor(weight, weeks, use = "pairwise.complete.obs"))
## N r
## 1 1000 0.6701013
# Correlation among mammals, with and without log
mammals %>%
summarize(N = n(),
r = cor(BodyWt, BrainWt),
r_log = cor(log(BodyWt), log(BrainWt)))
## N r r_log
## 1 62 0.9341638 0.9595748
In 1973, Francis Anscombe famously created four datasets with remarkably similar numerical properties, but obviously different graphic relationships.
For how to tidy up anscombe
to Anscombe
using here, look for it in the RMD of this course.
# Plot of Anscombe and wrapping according to set.
ggplot(data = Anscombe, aes(x = x, y = y)) +
geom_point() +
facet_wrap(~ set)
Anscombe %>%
group_by(set) %>%
summarize(N = n(), mean(x), sd(x), mean(y), sd(y), cor(x, y))
## # A tibble: 4 x 7
## set N `mean(x)` `sd(x)` `mean(y)` `sd(y)` `cor(x, y)`
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 11 9 3.32 7.50 2.03 0.816
## 2 2 11 9 3.32 7.50 2.03 0.816
## 3 3 11 9 3.32 7.5 2.03 0.816
## 4 4 11 9 3.32 7.50 2.03 0.817
Although their plots look totally different, all of the measures are identical (ignoring rounding error) across the four different sets.
In this section, we want to find out the ‘best fit’ line by a method called Linear Regression.
The simple linear regression model for a numeric response as a function of a numeric explanatory variable can be visualized on the corresponding scatterplot by a straight line. This is a “best fit” line that cuts through the data in a way that minimizes the distance between the line and the data points.
response = f(explanatory) + noise
response = intercept + (slope * explanatory) + noise
In this section we’ll use Least Squares as a method of linear regression.
y
) - Expected(y-hat
)(mean(x), mean(y))
Fortunately, we don’t need to calculate those values on our own.
In ggplot2
we can simply call out geom_smooth(method = 'lm')
to find the ‘best fit’ line in a blink.
method = 'lm'
means using a linear model.# Scatterplot with regression line using geom_smooth()
ggplot(data = bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE)
Besides using geom_smooth()
to get the ‘best fit’ line, we can set it up manually.
If we choose to calculate on ourselves, we need to know how a regression line comes out first.
Y = m ⋅ X + k
m = r(X,Y) ⋅ SD(Y)/SD(X)
m
) and intercept(k
) of a simple linear regression model from some basic summary statistics.r(x,y)
represents the correlation of X
& Y
.
cor(x,y)
to compute the correlation coefficient.SD(X)
& SD(Y)
represents the standard deviation of X
& Y
.
sd()
to compute the standard deviation.mean(X)
& mean(Y)
is always on the least squares regression line.geom_abline(slope = *, intersect = *)
to add a line on the plot manually.# Calculate summary statistics
bdims_summary <- summarize(bdims, N = n(), r = cor(hgt, wgt),
mean_hgt = mean(hgt), sd_hgt = sd(hgt),
mean_wgt = mean(wgt), sd_wgt = sd(wgt))
bdims_summary <- mutate(bdims_summary, slope = r*sd_wgt/sd_hgt,
intercept = mean_wgt - r*sd_wgt/sd_hgt*mean_hgt)
bdims_summary
## N r mean_hgt sd_hgt mean_wgt sd_wgt slope intercept
## 1 507 0.7173011 171.1438 9.407205 69.14753 13.34576 1.017617 -105.0113
# Scatterplot with regression line using geom_abline()
ggplot(data = bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_abline(slope = as.numeric(bdims_summary$slope), intercept = as.numeric(bdims_summary$intercept))
Regression to the mean is a concept attributed to Sir Francis Galton.
The basic idea is that extreme random observations will tend to be less extreme upon a second trial.
library(HistData)
# Height of children vs. parent
ggplot(data = Galton, aes(x = parent, y = child)) +
geom_jitter(alpha = 0.3) +
# Add a diagonal line which slopes = 1
geom_abline(slope = 1, intercept = 0) +
geom_smooth(method = 'lm', se = FALSE)
Because the slope of the regression line is smaller than 1 (the slope of the diagonal line), we can verify Sir Francis Galton’s regression to the mean concept!
“Regression to the mean is so powerful that once-in-a-generation talent basically never sires once-in-a-generation talent.It explains why Michael Jordan’s sons were middling college basketball players and Jakob Dylan wrote two good songs. It is why there are no American parent-child pairs among Hall of Fame players in any major professional sports league.”
The author is arguing that because of regression to the mean, an outstanding basketball player is likely to have sons that are good at basketball, but not as good as him.
lm()
While the geom_smooth(method = "lm")
function is useful for drawing linear models on a scatterplot, it doesn’t actually return the characteristics of the model. On the other hand, function lm()
does!
lm(Response(y) ~ Explanatory(x), data = *)
(y ~ x)
: A formula
that specifies the model.data
: Argument for the data frame that contains the data you want to use to fit the model."lm"
.coef(model)
: Display the coefficients of the model.summary(model)
: Display the full regression output of the model.Residuals = Observations - Expected
Obsrvation
are the values in the dataset.Expected
are fitted values fits this model.fitted.values(model)
: Returns the expected(fitted) values of the model.residuals(model)
: Returns the residuals of the model.augment(model)
: Returns a dataframe containing the data on which the model was fit and several quantities specific to the regression model.augment(model)
is from the broom
package.# Linear model for weight as a function of height
mod <- lm(wgt ~ hgt, data = bdims)
mod
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Coefficients:
## (Intercept) hgt
## -105.011 1.018
Recall that a general linear model lookes like this:
Y = m ⋅ X + k
(Intercept)
: refers to the intercept(k
) of the model.hgt
: refers to the slope(m
) of the model in this example.# Show the coefficients
coef(mod)
## (Intercept) hgt
## -105.011254 1.017617
# Show the full output
summary(mod)
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
# Mean of weights equal to mean of fitted values?
mean(bdims$wgt) == mean(fitted.values(mod))
## [1] TRUE
Not surprisingly that mean(observations)
equals mean(fitted.values)
.
If not, why are they called “fitted.values” to the model?
# Mean of the residuals
mean(residuals(mod))
## [1] -1.266971e-15
The least squares fitting procedure guarantees that the mean of the residuals is zero (numerical instability may result in the computed values not being exactly zero).
library(broom)
# Create bdims_tidy
bdims_tidy <- augment(mod)
# Glimpse the resulting data frame
glimpse(bdims_tidy)
## Observations: 507
## Variables: 9
## $ wgt <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62....
## $ hgt <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 18...
## $ .fitted <dbl> 72.05406, 73.37697, 91.89759, 84.77427, 85.48661, 7...
## $ .se.fit <dbl> 0.4320546, 0.4520060, 1.0667332, 0.7919264, 0.81834...
## $ .resid <dbl> -6.4540648, -1.5769666, -11.1975919, -12.1742745, -...
## $ .hat <dbl> 0.002154570, 0.002358152, 0.013133942, 0.007238576,...
## $ .sigma <dbl> 9.312824, 9.317005, 9.303732, 9.301360, 9.312471, 9...
## $ .cooksd <dbl> 5.201807e-04, 3.400330e-05, 9.758463e-03, 6.282074e...
## $ .std.resid <dbl> -0.69413418, -0.16961994, -1.21098084, -1.31269063,...
Once we have fit the model, we can now compute expected values for observations that were not present in the data on which the model was fit.
predict(model, newdata = *)
ben <- data.frame('wgt' = 74.8, 'hgt' = 182.8)
predict(mod, newdata = ben)
## 1
## 81.00909
According to this model, Ben should be 81 kg.
However, he is only 74.8 kg. So the residual here equals 74.8 - 81 = -6.2
.
Here we’re going to use geom_abline()
again.
Now it’s time to combined what we’ve learned so far.
ggplot(data = bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_abline(intercept = coef(mod)[1], slope = coef(mod)[2], color = "dodgerblue")
After setting a model fit for our data, can we assess the quality of it?
One way to assess strength of fit is to consider how far off the model is for a typical case. That is, for some observations, the fitted value will be very close to the actual value, while for others it will not. The magnitude of a typical residual can give us a sense of generally how close our estimates are.When trying to set up a linear model, what we actually do is finding the line has the least sums of squared deviation.
For each obsevation which is represented by a point on a scatterplot,
the residual is simply the vertical distance between that point and the line.
In this example, we’ve highlighted residuals with gray arrows.
If we can find a line makes the sums of residuals ^ 2
smaller, it’s a better model.
Also we can just minimize the sum of residuals
which is always zero since the positive one and the negative one often offset each other when adding up in the ‘best fitted’ model.
In fact ,we’ve already discussed this before: mean(residuals)
Conventionally, the sum of squared deviation are called Sum of Squared Errors.
It can be easily computed after using augment()
to our model.
SSE = sum(.resid ^2) = (n()-1) * var(.resid)
augment(mod) %>%
summarize(SSE = (n()-1) * var(.resid),
SSE_also = sum(.resid ^2))
## # A tibble: 1 x 2
## SSE SSE_also
## <dbl> <dbl>
## 1 43753. 43753.
SSE is a single number that captures how much are model missed by.
Unfortunately, it’s hard to interpret since the units have been squared.
Besides SSE, another common measurement of the accuracy of a model is – RMSE.
RMSE is basicallay the standard deviation of residuals
RMSE = sqrt(sum(.resid ^ 2)/d.f) = sqrt(SEE/d.f)
summary(mod)
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
Usingsummary(mod)
can easily get the value of RMSE which is named Residual Standard Error in the summary.
Conveniently, RMSE returns in the unit of reponse.
In this case, RMSE = 9.30804
which means:
The typical difference between the observed wgt
and the wgt
predicted by the model is about 9.30804 'kg'
.
And we can try to calculte RMSE manually using the formula above.
To get degrees of freedom
we can use df.residual(model)
. The meaning of d.f will not be explained here.
# Compute RMSE manually
sqrt(sum(residuals(mod)^2) / df.residual(mod))
## [1] 9.30804
If you have to predict without any regressed model, what would your prediction be?
For most people, the answer is mean
and that’s Null Model.
y-hat(Expected) = y-mean
# A Linear Model of bdims (same as before)
mod_smooth <- lm(wgt ~ hgt, data = bdims)
mod_smooth %>%
augment() %>%
summarise(SSE = sum(.resid ^ 2))
## # A tibble: 1 x 1
## SSE
## <dbl>
## 1 43753.
# A Null Model of bdims
mod_null <- lm(wgt ~ 1, data = bdims)
mod_null %>%
augment() %>%
summarise(SSE = sum(.resid ^ 2))
## # A tibble: 1 x 1
## SSE
## <dbl>
## 1 90123.
Since the SSE in mod_smooth
is much smaller than the one in mod_null
, mod_smooth
is a better fit to bdims
.
R^2 = 1 - (SSE/SST) = 1 - var(e)/var(y)
r^2(x,y) = R^2
The R^2 is also stored in summary(model)
Let’s have a look again.
summary(mod)
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
R^2 is named ‘R-squared’ is the summary.
In this case, we can interpret that about 51%
of the variability in wgt
can be explained by hgt
.
You don’t need to understand the formula totally, but you should recognize that:
augment()
and it’s named by .hat
.As noted previously, observations of high leverage may or may not be influential.
y
) and the fitted value(y-hat
).augment()
and it’s named by .cooksd
.Let’s take a look at both leverage & influence.
lm(formula = SLG ~ OBP, data = filter(mlbBat10, AB >= 10)) %>%
augment() %>%
arrange(desc(.hat), .cooksd) %>%
select(SLG, OBP, .fitted, .hat, .cooksd) %>%
head()
## # A tibble: 6 x 5
## SLG OBP .fitted .hat .cooksd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 -0.0374 0.0194 0.00277
## 2 0 0 -0.0374 0.0194 0.00277
## 3 0 0 -0.0374 0.0194 0.00277
## 4 0.308 0.55 0.690 0.0164 0.243
## 5 0 0.037 0.0115 0.0150 0.000202
## 6 0.038 0.038 0.0128 0.0149 0.000953
Observations can be outliers for a number of different reasons.
Statisticians must always be careful—and more importantly, transparent—when dealing with outliers.
Sometimes, a better model fit can be achieved by simply removing outliers and re-fitting the model.
However, one must have strong justification for doing this.
A desire to have a higher R^2 is not a good enough reason!
We’re looking for the best model not the best fit model!
Here’s a simple example of removing ‘Bobby Scales’ from mlbBat10
,
but it’s just an example since there is nothing to suggest that it is not a valid data point.
# Create nontrivial_players
nontrivial_players <- mlbBat10 %>%
filter(AB >= 10 & OBP < 0.5)
# Fit model to new data
mod_cleaner <- lm(SLG ~ OBP, data = nontrivial_players)
# View model summary
summary(mod_cleaner)
##
## Call:
## lm(formula = SLG ~ OBP, data = nontrivial_players)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31383 -0.04165 -0.00261 0.03992 0.35819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.043326 0.009823 -4.411 1.18e-05 ***
## OBP 1.345816 0.033012 40.768 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07011 on 734 degrees of freedom
## Multiple R-squared: 0.6937, Adjusted R-squared: 0.6932
## F-statistic: 1662 on 1 and 734 DF, p-value: < 2.2e-16
# Visualize new model
ggplot(nontrivial_players, aes(x = OBP, y = SLG)) +
geom_point() +
geom_smooth(method = 'lm')