Visualizing Bivariate Relationships

In this chapter, we are going to talk about two (continuous) numerical variables.


Through out the chapter, we’ll do some practices on several datasets listed below inside openintro package.


Example1 : ncbirths

Using the ncbirths dataset, make a scatterplot using ggplot() to illustrate how the birth weight of these babies varies according to the number of weeks of gestation.

library(ggplot2)
library(openintro)
library(dplyr)
# Use geom_plot() to make a scatterplot.
ggplot(ncbirths, aes(x = weeks, y = weight)) +
  geom_point()

# Use cut() to break a continuous variable to a discrete one.
ggplot(data = ncbirths, 
       aes(x = cut(weeks, breaks = 5), y = weight)) + 
  geom_boxplot()

In the scatterplot, it seems that there is a positive relationship between weeks & weight.
However, after using cut() discretizing continuous variables and drawing boxplot, the relationship no longer seems linear.


Example2: Mammals

Using the mammals dataset, create a scatterplot illustrating how the brain weight of a mammal varies as a function of its body weight.

# Scatterplot with original scale.
ggplot(mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point()

# Scatterplot with coord_trans().
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() + 
  coord_trans(x = "log10", y = "log10")

# Scatterplot with scale_x_log10() and scale_y_log10().
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() +
  scale_x_log10() + scale_y_log10()

Because of extreme values in this dataset, it’s hard to see the relationship between BodyWt & BrainWt.
We can put these variables on a log scale and noticed that the relation is much clearer now.


Example3: mlbBat10

Using the mlbBat10 dataset, create a scatterplot illustrating how the slugging percentage (SLG) of a player varies as a function of his on-base percentage (OBP).

# Scatterplot with outliers.
ggplot(mlbBat10, aes(x = OBP, y = SLG)) +
  geom_point()

# After removing those outliers.
mlbBat10 %>%
  filter(AB >= 200) %>%
  ggplot(aes(x = OBP, y = SLG)) +
  geom_point()

In the first scatterplot, most of the points were clustered in the lower left corner of the plot, making it difficult to see the general pattern of the majority of the data. This difficulty was caused by a few outlying players whose on-base percentages (OBPs) were exceptionally high.

After removing outliers, we can finally see the general pattern in the dataset, we’ll discuss the influence of outliers later.


Correlation

What is correlation?

  • Correlation coefficient between -1 and 1.
  • Sign -> Direction.
  • Magnitude -> Strength.
  • cor(x, y) Compute the Pearson product-moment correlation between variables, x and y, and the order doesn’t matter.
    • cor() will return NA as a default if there is missing data.
    • Use use = "pairwise.complete.obs" when there is missing values in the dataset and cor() will ignore those values.


Example1: ncbirths

ncbirths %>%
  summarize(N = n(), r = cor(weight, weeks, use = "pairwise.complete.obs"))
##      N         r
## 1 1000 0.6701013

Example2: mammals

# Correlation among mammals, with and without log
mammals %>%
  summarize(N = n(), 
            r = cor(BodyWt, BrainWt), 
            r_log = cor(log(BodyWt), log(BrainWt)))
##    N         r     r_log
## 1 62 0.9341638 0.9595748


Anscombe

In 1973, Francis Anscombe famously created four datasets with remarkably similar numerical properties, but obviously different graphic relationships.
For how to tidy up anscombe to Anscombe using here, look for it in the RMD of this course.

# Plot of Anscombe and wrapping according to set.
ggplot(data = Anscombe, aes(x = x, y = y)) +
  geom_point() +
  facet_wrap(~ set)

Anscombe %>%
  group_by(set) %>%
  summarize(N = n(), mean(x), sd(x), mean(y), sd(y), cor(x, y))
## # A tibble: 4 x 7
##     set     N `mean(x)` `sd(x)` `mean(y)` `sd(y)` `cor(x, y)`
##   <int> <int>     <dbl>   <dbl>     <dbl>   <dbl>       <dbl>
## 1     1    11         9    3.32      7.50    2.03       0.816
## 2     2    11         9    3.32      7.50    2.03       0.816
## 3     3    11         9    3.32      7.5     2.03       0.816
## 4     4    11         9    3.32      7.50    2.03       0.817

Although their plots look totally different, all of the measures are identical (ignoring rounding error) across the four different sets.


Simple Linear Regression

Method of Linear Regression

In this section, we want to find out the ‘best fit’ line by a method called Linear Regression.
The simple linear regression model for a numeric response as a function of a numeric explanatory variable can be visualized on the corresponding scatterplot by a straight line. This is a “best fit” line that cuts through the data in a way that minimizes the distance between the line and the data points.

Generic Statistical Model

response = f(explanatory) + noise

Generic Linear Model

response = intercept + (slope * explanatory) + noise

Least Squares

In this section we’ll use Least Squares as a method of linear regression.

  • Easy, deterministic, unique solutions.
  • Residuals sum to zero.
    • Residuals = Observation(y) - Expected(y-hat)
    • Y-hat is expected value given corresponding X.
    • Beta-hats are estimates of true, unknown betas.
    • Residuals are estimates of true, unknown epsilons.
      • ‘Error’ is misleading, it should be called ‘Noise’.
    • Reference: Chi-Squared Test
  • Line must pass through (mean(x), mean(y))


Automatic - Smooth

Fortunately, we don’t need to calculate those values on our own.
In ggplot2 we can simply call out geom_smooth(method = 'lm') to find the ‘best fit’ line in a blink.


Example: bdims

# Scatterplot with regression line using geom_smooth()
ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_smooth(method = 'lm', se = FALSE)


Manual - Abline

Besides using geom_smooth() to get the ‘best fit’ line, we can set it up manually.
If we choose to calculate on ourselves, we need to know how a regression line comes out first.

Linear Model

Y = m ⋅ X + k
m = r(X,Y) ⋅ SD(Y)/SD(X)

  • Two facts to compute: the slope(m) and intercept(k) of a simple linear regression model from some basic summary statistics.
  • r(x,y) represents the correlation of X & Y.
    • cor(x,y) to compute the correlation coefficient.
  • SD(X) & SD(Y) represents the standard deviation of X & Y.
    • sd() to compute the standard deviation.
  • mean(X) & mean(Y) is always on the least squares regression line.
  • Use geom_abline(slope = *, intersect = *) to add a line on the plot manually.


Example: bdims

# Calculate summary statistics
bdims_summary <- summarize(bdims, N = n(), r = cor(hgt, wgt),
                           mean_hgt = mean(hgt), sd_hgt = sd(hgt),
                           mean_wgt = mean(wgt), sd_wgt = sd(wgt))
bdims_summary <- mutate(bdims_summary, slope = r*sd_wgt/sd_hgt, 
                        intercept = mean_wgt - r*sd_wgt/sd_hgt*mean_hgt)
bdims_summary
##     N         r mean_hgt   sd_hgt mean_wgt   sd_wgt    slope intercept
## 1 507 0.7173011 171.1438 9.407205 69.14753 13.34576 1.017617 -105.0113
# Scatterplot with regression line using geom_abline()
ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_abline(slope = as.numeric(bdims_summary$slope), intercept = as.numeric(bdims_summary$intercept))


Regression to mean

Regression to the mean is a concept attributed to Sir Francis Galton.
The basic idea is that extreme random observations will tend to be less extreme upon a second trial.


Example1: Galton_men & Galton_women

library(HistData)
# Height of children vs. parent
ggplot(data = Galton, aes(x = parent, y = child)) +
  geom_jitter(alpha = 0.3) + 
  # Add a diagonal line which slopes = 1
  geom_abline(slope = 1, intercept = 0) + 
  geom_smooth(method = 'lm', se = FALSE)

Because the slope of the regression line is smaller than 1 (the slope of the diagonal line), we can verify Sir Francis Galton’s regression to the mean concept!


Example2: The New York Times in 2015

“Regression to the mean is so powerful that once-in-a-generation talent basically never sires once-in-a-generation talent.It explains why Michael Jordan’s sons were middling college basketball players and Jakob Dylan wrote two good songs. It is why there are no American parent-child pairs among Hall of Fame players in any major professional sports league.”

The author is arguing that because of regression to the mean, an outstanding basketball player is likely to have sons that are good at basketball, but not as good as him.


Coefficients in Regression Models

Linear Model - lm()

While the geom_smooth(method = "lm") function is useful for drawing linear models on a scatterplot, it doesn’t actually return the characteristics of the model. On the other hand, function lm() does!

  • lm(Response(y) ~ Explanatory(x), data = *)
    • (y ~ x): A formula that specifies the model.
    • data: Argument for the data frame that contains the data you want to use to fit the model.

  • Return a model object having class "lm".

  • Contains lots of information about your regression model.
    • The data used to fit the model.
    • The specification of the model.
    • The fitted values and residuals, etc.


  • coef(model): Display the coefficients of the model.

  • summary(model): Display the full regression output of the model.

  • Recall that Residuals = Observations - Expected
    • Obsrvation are the values in the dataset.
    • Expected are fitted values fits this model.
    • fitted.values(model): Returns the expected(fitted) values of the model.
    • residuals(model): Returns the residuals of the model.

  • Tidy linear model:
    • augment(model): Returns a dataframe containing the data on which the model was fit and several quantities specific to the regression model.
    • augment(model) is from the broom package.


Example: bdims

# Linear model for weight as a function of height
mod <- lm(wgt ~ hgt, data = bdims)
mod
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Coefficients:
## (Intercept)          hgt  
##    -105.011        1.018

Recall that a general linear model lookes like this:

Y = m ⋅ X + k

  • (Intercept): refers to the intercept(k) of the model.
  • hgt: refers to the slope(m) of the model in this example.
# Show the coefficients
coef(mod)
## (Intercept)         hgt 
## -105.011254    1.017617
# Show the full output
summary(mod)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16
# Mean of weights equal to mean of fitted values?
mean(bdims$wgt) == mean(fitted.values(mod))
## [1] TRUE

Not surprisingly that mean(observations) equals mean(fitted.values).
If not, why are they called “fitted.values” to the model?

# Mean of the residuals
mean(residuals(mod))
## [1] -1.266971e-15

The least squares fitting procedure guarantees that the mean of the residuals is zero (numerical instability may result in the computed values not being exactly zero).

library(broom)
# Create bdims_tidy
bdims_tidy <- augment(mod)

# Glimpse the resulting data frame
glimpse(bdims_tidy)
## Observations: 507
## Variables: 9
## $ wgt        <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62....
## $ hgt        <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 18...
## $ .fitted    <dbl> 72.05406, 73.37697, 91.89759, 84.77427, 85.48661, 7...
## $ .se.fit    <dbl> 0.4320546, 0.4520060, 1.0667332, 0.7919264, 0.81834...
## $ .resid     <dbl> -6.4540648, -1.5769666, -11.1975919, -12.1742745, -...
## $ .hat       <dbl> 0.002154570, 0.002358152, 0.013133942, 0.007238576,...
## $ .sigma     <dbl> 9.312824, 9.317005, 9.303732, 9.301360, 9.312471, 9...
## $ .cooksd    <dbl> 5.201807e-04, 3.400330e-05, 9.758463e-03, 6.282074e...
## $ .std.resid <dbl> -0.69413418, -0.16961994, -1.21098084, -1.31269063,...


Use of Regression Model

Prediction: Called out-of-sample

Once we have fit the model, we can now compute expected values for observations that were not present in the data on which the model was fit.

  • predict(model, newdata = *)
ben <- data.frame('wgt' = 74.8, 'hgt' = 182.8)
predict(mod, newdata = ben)
##        1 
## 81.00909

According to this model, Ben should be 81 kg.
However, he is only 74.8 kg. So the residual here equals 74.8 - 81 = -6.2.


Adding Regression Line to Plot Manually

Here we’re going to use geom_abline() again.
Now it’s time to combined what we’ve learned so far.

ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_abline(intercept = coef(mod)[1], slope = coef(mod)[2],  color = "dodgerblue")


Model Fit

Assess Model Fit

After setting a model fit for our data, can we assess the quality of it?

One way to assess strength of fit is to consider how far off the model is for a typical case. That is, for some observations, the fitted value will be very close to the actual value, while for others it will not. The magnitude of a typical residual can give us a sense of generally how close our estimates are.

SSE: Sum of Squared Errors(Residuals)

When trying to set up a linear model, what we actually do is finding the line has the least sums of squared deviation.

For each obsevation which is represented by a point on a scatterplot,
the residual is simply the vertical distance between that point and the line.

In this example, we’ve highlighted residuals with gray arrows.
If we can find a line makes the sums of residuals ^ 2 smaller, it’s a better model.

Also we can just minimize the sum of residuals
which is always zero since the positive one and the negative one often offset each other when adding up in the ‘best fitted’ model.
In fact ,we’ve already discussed this before: mean(residuals)

Conventionally, the sum of squared deviation are called Sum of Squared Errors.
It can be easily computed after using augment() to our model.

SSE = sum(.resid ^2) = (n()-1) * var(.resid)

augment(mod) %>%
  summarize(SSE = (n()-1) * var(.resid),
            SSE_also = sum(.resid ^2))
## # A tibble: 1 x 2
##      SSE SSE_also
##    <dbl>    <dbl>
## 1 43753.   43753.

SSE is a single number that captures how much are model missed by.
Unfortunately, it’s hard to interpret since the units have been squared.


RMSE: Root Mean Squared Error

Besides SSE, another common measurement of the accuracy of a model is – RMSE.
RMSE is basicallay the standard deviation of residuals

RMSE = sqrt(sum(.resid ^ 2)/d.f) = sqrt(SEE/d.f)

summary(mod)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

Usingsummary(mod) can easily get the value of RMSE which is named Residual Standard Error in the summary.
Conveniently, RMSE returns in the unit of reponse.
In this case, RMSE = 9.30804 which means:
The typical difference between the observed wgt and the wgt predicted by the model is about 9.30804 'kg'.

And we can try to calculte RMSE manually using the formula above.
To get degrees of freedom we can use df.residual(model). The meaning of d.f will not be explained here.

# Compute RMSE manually
sqrt(sum(residuals(mod)^2) / df.residual(mod))
## [1] 9.30804

Compare Model Fit

Null(Average) Model

If you have to predict without any regressed model, what would your prediction be?
For most people, the answer is mean and that’s Null Model.

  • For all observations.
  • y-hat(Expected) = y-mean
# A Linear Model of bdims (same as before)
mod_smooth <- lm(wgt ~ hgt, data = bdims)
mod_smooth %>%
  augment() %>%
  summarise(SSE = sum(.resid ^ 2))
## # A tibble: 1 x 1
##      SSE
##    <dbl>
## 1 43753.
# A Null Model of bdims
mod_null <- lm(wgt ~ 1, data = bdims)
mod_null %>%
  augment() %>%
  summarise(SSE = sum(.resid ^ 2))
## # A tibble: 1 x 1
##      SSE
##    <dbl>
## 1 90123.

Since the SSE in mod_smooth is much smaller than the one in mod_null, mod_smooth is a better fit to bdims.


R^2: Coefficient of Determination

R^2 = 1 - (SSE/SST) = 1 - var(e)/var(y)

  • R^2 is a measure of variability in a response variable.
  • SSE for a null model is often called SST: Total Sums of Regress
  • It’s a common measure of the fit of the regression model.
  • For simple lunear model r^2(x,y) = R^2
  • A high/low R^2 doesn’t mean taht the model is excellent/lousy.

The R^2 is also stored in summary(model) Let’s have a look again.

summary(mod)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

R^2 is named ‘R-squared’ is the summary.
In this case, we can interpret that about 51% of the variability in wgt can be explained by hgt.


Unsual Points

Leverage

You don’t need to understand the formula totally, but you should recognize that:

  • The leverage score for observation is about the distance between the value and the mean of explanatory data.
    • Points are close to the horizontal center(mean of x) of the scatterplot have low leverages.
    • Points are far from the horizontal center(mean of x) of the scatterplot have high leverages.
    • Nothing to do with y-coordinate.
  • Points of high leverage may or may not be influential.
  • Leverage can be retrieve by using augment() and it’s named by .hat.


Influence

As noted previously, observations of high leverage may or may not be influential.

  • The influence of an observation depends on:
    • Leverage: Takes into account the explanatory variable.
    • Magnitude of its residual: depends on the response variable(y) and the fitted value(y-hat).
  • Influential points are likely to have high leverage and deviate from the general relationship between the two variables.
  • Influence can be retrieve by using augment() and it’s named by .cooksd.

Let’s take a look at both leverage & influence.

lm(formula = SLG ~ OBP, data = filter(mlbBat10, AB >= 10)) %>%
  augment() %>%
  arrange(desc(.hat), .cooksd) %>%
  select(SLG, OBP, .fitted, .hat, .cooksd) %>%
  head()
## # A tibble: 6 x 5
##     SLG   OBP .fitted   .hat  .cooksd
##   <dbl> <dbl>   <dbl>  <dbl>    <dbl>
## 1 0     0     -0.0374 0.0194 0.00277 
## 2 0     0     -0.0374 0.0194 0.00277 
## 3 0     0     -0.0374 0.0194 0.00277 
## 4 0.308 0.55   0.690  0.0164 0.243   
## 5 0     0.037  0.0115 0.0150 0.000202
## 6 0.038 0.038  0.0128 0.0149 0.000953


Removing Outliers

Observations can be outliers for a number of different reasons.
Statisticians must always be careful—and more importantly, transparent—when dealing with outliers.
Sometimes, a better model fit can be achieved by simply removing outliers and re-fitting the model.
However, one must have strong justification for doing this.
A desire to have a higher R^2 is not a good enough reason!
We’re looking for the best model not the best fit model!

Here’s a simple example of removing ‘Bobby Scales’ from mlbBat10,
but it’s just an example since there is nothing to suggest that it is not a valid data point.

# Create nontrivial_players
nontrivial_players <- mlbBat10 %>%
  filter(AB >= 10 & OBP < 0.5)

# Fit model to new data
mod_cleaner <- lm(SLG ~ OBP, data = nontrivial_players)

# View model summary
summary(mod_cleaner)
## 
## Call:
## lm(formula = SLG ~ OBP, data = nontrivial_players)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31383 -0.04165 -0.00261  0.03992  0.35819 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.043326   0.009823  -4.411 1.18e-05 ***
## OBP          1.345816   0.033012  40.768  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07011 on 734 degrees of freedom
## Multiple R-squared:  0.6937, Adjusted R-squared:  0.6932 
## F-statistic:  1662 on 1 and 734 DF,  p-value: < 2.2e-16
# Visualize new model
ggplot(nontrivial_players, aes(x = OBP, y = SLG)) +
  geom_point() +
  geom_smooth(method = 'lm')