Statistics

Two categories of Statistics Layers Functions:

Stats & Geoms

As we seen before in Histograms, we’ve already encountered stats functions when we used geom_histogram().
Recall that under the hood, this function called stat = 'bin' to summarize the total count & proportion in each group.
And we can get those inner data by using ..count.. or ..density...
Similarly, as we used geom_bar(), its default stats is set to 'bin'.

Let’s take a look at this example, we got 3 same plots with different codes.
As usual, the default of bin equals range/30. As binwidth becomes larger, the graph becomes smoother.

library(ggplot2)
plot <- ggplot(iris, aes(x = Sepal.Width))
plot + geom_histogram()
plot + geom_bar()
plot + stat_bin()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here are some other examples:

stat_ geom_
stat_bin() geom_bar()
stat_bin() geom_histogram()
stat_bin() geom_freqpoly()
stat_smooth() geom_smooth()
stat_boxplot() geom_boxplot()
stat_bindplot() geom_dotplot()
stat_bin2d() geom_bin2d()
stat_binhex() geom_hex()
stat_contour() geom_contour()
stat_quantile() geom_quantile()
stat_sum() geom_count()

Smooth

  • When we use geom_smooth(), we call stat_smmoth() simultaneously.
  • se: The standard error shown as a gray ribbon behind our smooth, is by default the 95 percent confidence interval.
    • se = Fales: To remove this gray ribbon.
  • fullrange: Whether to show the prediction according your dataset. Error increases the further away from our dataset.
  • Default method = 'auto'
    • For largest group < 1000: method = 'loess'
    • For largest group > 1000: method = 'gam'
  • Method = 'loess'
    • Works like passing as a sliding window along the x axis.
    • Within each window, a weighted mean is calculated for the model.
    • span: Controls alpha (the degree of smoothing or windows size), smaller spans are more noisy.

Example1 : mtcars

# Customize color
library(RColorBrewer)
myColors <- c(brewer.pal(3, "Dark2"), "black")

ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE) +
  stat_smooth(method = "loess", 
              aes(group = 1, col="All"), 
              se = FALSE, span = 0.7) +
  # Add correct arguments to scale_color_manual
  scale_color_manual('Cylinders', values = myColors)

Example2: Vocab

library(car)
ggplot(Vocab, aes(x = education, y = vocabulary, col = year)) +
  geom_jitter(alpha = 0.1, size = 1) +
  stat_smooth(method = 'lm', se = F)

When mapping onto color you can sometimes treat a continuous scale, like year, as an ordinal variable, but only if it is a regular series.
The better alternative is to leave it as a continuous variable and use the group aesthetic as a factor to make sure your plot is drawn correctly.

# Specify the invisible group aesthetic so that our linear models are still calculated appropriately.
ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) +
  stat_smooth(method = "lm", se = FALSE, alpha = 0.6, size = 2) +
  scale_color_gradientn(colors = brewer.pal(9, "YlOrRd"))

We used the Vocab dataset and applied linear models describing vocabulary by education for different years.

Quantile

  • stat_quantile() calculates quantiles according ypur data and draw them in linear models.
  • By default it calculates the 1st, 2nd (i.e. median), and 3rd quartiles.

Example: Vocab

ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) +
  # Set the quantiles argument to 0.5 so that only the median is shown.
  stat_quantile(alpha = 0.6, size = 2, quantiles = 0.5) +
  scale_color_gradientn(colors = brewer.pal(9,"YlOrRd"))

Sum

  • stat_sum() calculates the total number of overlapping observations and is another good alternative to overplotting.
  • This maps the overall count of each dot onto size.

Example: Vocab

ggplot(Vocab, aes(x = education, y = vocabulary)) +
  geom_jitter(size = 0.1, alpha = 0.1) +
  stat_sum() + # sum statistic
  scale_size(range = c(1, 10)) +  # set size scale
  stat_smooth(method = "loess", aes(col = "LOESS"), se = F) +
  stat_smooth(method = "lm", aes(col = "lm"), se = F) +
  scale_color_discrete("Model")

Independent Stats

As mentioned before, there are two categorical function in Statistics.
It’s time to look at the second group of stats function: Independent Stats.

stat_ Discription
stat_summary() Summarise y values at distinct x values.
stat_function() Compute y values from a function of x values.
stat_qq() Perform calculations for a quantile-quantile plot.

Summary

Here we’ll look at stat_summary() in action.

  • Default geom = 'pointrange'.
  • fun.data = mean_sdl, fun.args = list(mult = 1): Calculate mean & sd.
  • fun.data = mean_cl_normal: Calculate mean and 95% confidence interval.

Example: mtcars

Before we start, there are some preparation need to be done.

# Require 'Hsimc' package here.
library(Hmisc)
# Convert cyl and am to factors
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$am <- as.factor(mtcars$am)

# Define positions
posn.d <- position_dodge(width = 0.1)
posn.jd <- position_jitterdodge(jitter.width = 0.1, dodge.width = 0.2)
posn.j <- position_jitter(width = 0.2)

# Base layers
wt.cyl.am <- ggplot(mtcars, aes(x = cyl, y = wt, col = am, fill = am, group = am))
# Mean and 95% CI - the easy way
wt.cyl.am +
  geom_point(position = posn.jd, alpha = 0.6) +
  # 95% CI: 'fun.data = mean_cl_normal'
  # Default geom = 'pointrange'
  stat_summary(fun.data = mean_cl_normal, position = posn.d) +
  ggtitle('Mean and 95% CI - the easy way')

# Mean and SD - with T-tipped error bars
wt.cyl.am + 
  # Mean: 'fun.y = mean' -> Uses the y aesthetic
  stat_summary(geom = 'point', fun.y = mean,
               position = posn.d) +
  # sd: 'fun.data = mean_sdl, fun.args = list(mult = 1)'
  # fun.args = list(mult = 1) to have a errorbar that spans over one standard deviation in both directions.
  stat_summary(geom = 'errorbar', fun.data = mean_sdl,
               position = posn.d, fun.args = list(mult = 1), width = 0.1) +
  ggtitle('Mean and SD - with T-tipped error bars')

Notice here, quantile() returns a vector of 5 units: 0%, 25%, 50%, 75%, 100%.
So 1st quartile equals quantile(x)[2], 3rd quartile equals quantile(x)[4].

# Function to save range for use in ggplot
gg_range <- function(x) {
  data.frame(ymin = min(x), # Min
             ymax = max(x)) # Max
}

# Function to get quantiles
med_IQR <- function(x) {
  data.frame(y = median(x), # Median
             ymin = quantile(x)[2], # 1st quartile
             ymax = quantile(x)[4])  # 3rd quartile
}
# Five-number summary(the minimum, 1st quartile, median, 3rd quartile, and the maximum) plot
wt.cyl.am +
  # 1st quartile, 3rd quartile
  stat_summary(geom = 'linerange', fun.data = med_IQR,
               position = posn.d, size = 3) +
  # minimum, maximum
  stat_summary(geom = 'linerange', fun.data = gg_range,
               position = posn.d, size = 3,
               alpha = 0.4) +
  # median
  stat_summary(geom = 'point', fun.y = median,
               position = posn.d, size = 3,
               col = 'black', shape = 'X') +
  ggtitle('Five-number Summary Plot')

Coordinates & Facets

Zooming In

  • scale_x_continuous(limits = c(*,*))
    • Filtered original dataset caused missing data.
    • geom_smooth() might be different.
  • xlim(c(*,*))
    • Similar to the function above, but in a fast & dirty way.
  • coord_cartesin(xlim = c(*,*))
    • Preferred way when zooming in.
    • Truly do simply zoom in on the plot without losing anything.

Example: mtcars

plot2 <- ggplot(mtcars, aes(x = wt, y = hp, col = am)) + geom_point() + geom_smooth()
# Original plot
plot2

When using scale_x_continuous(limits = c(*,*)), you’ll get lots of warnings(do not show here.) and a weird graph.

# Zoom in using scale function.
plot2 +
    scale_x_continuous(limits = c(3, 6), expand = c(0, 0))

coord_cartesin(xlim = c(*,*)) is just what we want.

# Zoom in using coord function.
plot2 + 
  coord_cartesian(xlim = c(3, 6))

Aespect Ratio

Set the aspect ratio of a plot with coord_fixed() or coord_equal(). Both use ratio = 1 as a default.

  • ratio = x/y
  • ratio is not always = 1, there are some exceptions.

Example: iris

base.plot <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
               geom_jitter() +
               geom_smooth(method = "lm", se = FALSE)

# Plot base.plot: default aspect ratio
base.plot

# Fix aspect ratio (1:1) of base.plot
base.plot +
  coord_equal()

Pie Chart

Use coord_polar(theta = '*axis') to transform a bar char onto a polar coordinate system.

# Create an ordinary bar chart and convert it into pie chart.
wide.bar <- ggplot(mtcars, aes(x = 1, fill = cyl)) +
              geom_bar()
wide.bar +
  coord_polar(theta = 'y')

# Set limits on x_scale and convert it into a 'ring' type pie chart.
thin.bar <- ggplot(mtcars, aes(x = 1, fill = cyl)) +
              geom_bar(width = 0.1) +
              scale_x_continuous(limits = c(0.5, 1.5))
thin.bar + 
  coord_polar(theta = 'y')

Facet

The most straightforward way of using facets is facet_grid().
Here we just need to specify the categorical variable to use on rows and columns using standard R formula notation (rows ~ columns). Remember, when faceting in only one direction us . to specify nothing for the unused direction.

Example - Multivariables: mtcars

Here we’ll present a plot with 6 variables in mtcars

# Create cyl_am
mtcars$cyl_am <- paste(mtcars$cyl, mtcars$am, sep = "_")
# Customize color
myCol <- rbind(brewer.pal(9, "Blues")[c(3,6,8)],
               brewer.pal(9, "Reds")[c(3,6,8)])
# Plot with multi variables.
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl_am)) +
  geom_point() +
  scale_color_manual(values = myCol) +
  facet_grid(gear ~ vs)

Example - Dropping Levels: msleep

The variables of interest here are name, which contains the full popular name of each animal, and vore, the eating behavior. We are going to facet the plot with vore. Each animal can only be classified under one eating habit, so if we facet according to vore, we don’t need to repeat the full list in each sub-plot.
Use scale = "free_y" and space = "free_y" to leave out rows for which there is no data.

library(tidyr)
library(dplyr)
# Take a look at msleep
head(msleep, 3)
## # A tibble: 3 x 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
## 1 Cheetah Acin~ carni Carn~ lc                  12.1      NA            NA
## 2 Owl mo~ Aotus omni  Prim~ <NA>                17         1.8          NA
## 3 Mounta~ Aplo~ herbi Rode~ nt                  14.4       2.4          NA
## # ... with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
# Tidy data for plot
colnames(msleep)[6:7] <- c('total', 'rem')
mamsleep <- msleep %>%
              gather(sleep, time, total:rem) %>%
              select(vore, name, sleep, time) %>%
              filter(!is.na(vore))
# After tidy up
head(mamsleep, 3)
## # A tibble: 3 x 4
##   vore  name            sleep  time
##   <chr> <chr>           <chr> <dbl>
## 1 carni Cheetah         total  12.1
## 2 omni  Owl monkey      total  17  
## 3 herbi Mountain beaver total  14.4
plot3 <- ggplot(mamsleep, aes(x = time, y = name, col = sleep)) + geom_point()
# Before dropping levels
plot3 + facet_grid(vore ~ .)

# After dropping levels(free up rows)
plot3 + facet_grid(vore ~ ., scale= 'free_y', space = 'free_y')

Themes

And here is our base plot, the most familiar dataset - mtcars.

library(RColorBrewer)
plot4 <- ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  stat_smooth(method = 'lm', se = F) + 
  scale_color_manual('Cylinders', values = c('#C6DBEF','#6BAED6','#2171B5')) +
  geom_jitter(size = 3.5, alpha = 0.7) +
  scale_x_continuous('Weight(lb/1000)') +
  scale_y_continuous('Miles/(US) gallon') +
  facet_grid(.~cyl)
plot4

Rectangles

plot5 <- plot4 +
  # Theme to remove all rectangles
  theme(rect = element_blank()) +
  # Combine custom themes
  theme(plot.background = element_rect(fill = '#F7FBFf', color = '#DEEBF7', size = 1))
plot5

Lines

plot6 <- plot5 +
        # Remove grid
  theme(panel.grid = element_blank(),
        # Set color of axis.line
        axis.line = element_line(color = '#08519C'),
        #Set color of axis.ticks
        axis.ticks = element_line(color = '#08519C'))
plot6

Text

plot7 <- plot6 +
  theme(strip.text = element_text(size = 13, color = '#08519C'),
        # 'hjust = 0' to put the text in the bottom left corner
        axis.title = element_text(color = '#08519C', hjust = 0, face = 'bold'),
        axis.text = element_text(color = '#08519C'),
        legend.text = element_text(color = '#08519C'),
        legend.title = element_text(color = '#08519C'))
plot7

Position

We can only change the direction & the position of the legend.

  • legend.position for position.
    • Move by location: legend.position = c(0.85,0.85)
    • Move by name: legend.position = 'bottom'
  • legend.direction for direction.

As for the elements of the legend, we need to change it in geom layer using scale or aes becasue they are data ink.

We need library(grid) to pass a unit object in order to adjust position.

  • plot.margin for adjusting plot margin: unit(c(top,right,bottom,left), "cm")
  • panel.spacing for space between facets.
library(grid)
plot8 <- plot7 +
  # Change legend location & direction
  theme(legend.position = 'bottom', legend.direction= 'horizontal') +
  # Adjust the plot margin
  theme(plot.margin = unit(c(1,1.5,0.5,1), 'cm')) +
  # Increase spacing between facets
  theme(panel.spacing.x = unit(1, 'cm'))
plot8

Reuse Theme

  • Assign theme() layers to a variable, then you can reuse it without copying many lines of codes.
  • theme_update(): When you call it and assign it to an object, that object stores the current default theme, and the arguments update the default theme.
  • theme_set(): Apply theme with theme_set().
theme_blue <- 
    theme(rect = element_blank(),
          plot.background = element_rect(fill = '#F7FBFf', color = '#DEEBF7', size = 1),
          panel.grid = element_blank(),
          axis.line = element_line(color = '#08519C'),
          axis.ticks = element_line(color = '#08519C'),
          strip.text = element_text(size = 13, color = '#08519C'),
          axis.title = element_text(color = '#08519C', hjust = 0, face = 'bold'),
          axis.text = element_text(color = '#08519C'),
          legend.text = element_text(color = '#08519C'),
          legend.title = element_text(color = '#08519C'),
          legend.position = 'bottom', 
          legend.direction= 'horizontal',
          plot.margin = unit(c(1,1.5,0.5,1), 'cm'),
          panel.spacing.x = unit(1, 'cm'))

Apply the theme by +.

plot4 + theme_blue

Update the default theme using theme_update(), and store the current theme to original.

original <- theme_update(rect = element_blank(),
          plot.background = element_rect(fill = '#F7FBFf', color = '#DEEBF7', size = 1),
          panel.grid = element_blank(),
          axis.line = element_line(color = '#08519C'),
          axis.ticks = element_line(color = '#08519C'),
          strip.text = element_text(size = 13, color = '#08519C'),
          axis.title = element_text(color = '#08519C', hjust = 0, face = 'bold'),
          axis.text = element_text(color = '#08519C'),
          legend.text = element_text(color = '#08519C'),
          legend.title = element_text(color = '#08519C'),
          legend.position = 'bottom', 
          legend.direction= 'horizontal',
          plot.margin = unit(c(1,1.5,0.5,1), 'cm'),
          panel.spacing.x = unit(1, 'cm'))
plot4

Restore back original using theme_set().

theme_set(original)
plot4

ggthemes

There is a package called ggtheme which includes many theme set you can choose from, and here are some examples.
You can combine with theme_update() and theme_set for default setting.

library(ggthemes)
plot4 + theme_bw()

plot4 + theme_classic()

Common Pitfalls

So far, you’ve learned how to use ggplot2, it’s time to have a deeper look of figuring out what is the best way to graphically represent our data.

Dynamite Plots & Alternatives

As mentioned before, there are two kinds of bar plots:

Absolute Counts(stat = 'bin')
Distribution(stat = 'identity') which is same as geom_col() which takes two variables.

The second one is incredibly common and equally terrible.
Let’s turn to mamsleep for example.

Example1: Mamsleep

sleep <- spread(mamsleep, sleep, time)
plot9 <- ggplot(sleep, aes(x = vore, y = total)) +
  scale_y_continuous("Total Sleep Time (h)",
                     limits = c(0, 24),
                     breaks = seq(0, 24, 3),
                     expand = c(0, 0)) +
  scale_x_discrete('Eating Habits', labels = c('Carnivore', 'Herbivore', 'Insectivore', 'Omnivore'))

plot9 + 
  stat_summary(fun.y = mean, geom = 'bar', fill = 'grey50') +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = 'errorbar', width = 0.2) +
  ggtitle('Dynamite Plots of Mammalian Sleep')

  • First, we spread the mamsleep data for plotting and assign it to a variable called sleep.
  • ggplot(sleep, aes(x = vore, y = total)) make the base of this plot.
    • vore is a discrete variable on x-axis which means the eating habbit.
    • total is a continuous variable on y-axis which means how much sleep the animal have.
  • scale_y_continuous(): Rename the y-axis title and set its limits, breaks ,expand.
  • scale_x_discrete(): Rename the x-axis title and variables’ names.
  • Reference: Scale Function
  • The first stat_summary() use fun.y = mean to calculate their means.
  • The second stat_summary() use fun.data = mean_sdl, fun.args = list(mult = 1) to calculates their standard deviations.

However, there are some subtle problems in this graph:

  • We have no idea how many observations we have in each category.
  • This plot is subtly suggesting that our data is normally distributed by using the mean and the standard deviation but it may be not.
  • The real data is hidden from us.
  • The bars give the impression of having data where there is no data: No mammals which sleep zero hours a day!

Bar plots should be avoided in this plot. Let’s change it geom_point() for more meaningful plot.

plot9 +
  geom_point(alpha = 0.6, position = position_jitter(width = 0.2)) +
  stat_summary(fun.y = mean, geom = 'point', col = "#E41A1C") +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = 'errorbar', width = 0.2, col = "#E41A1C")

  • By using geom_point() we can now see the strange patterns in this dataset.
    • We don’t have that much data about insectivores, and wecan say it’s bimodal(雙峰) reluctantly.
    • It looks like omnivores are positive skewed(右曳).
  • Error bars with points are a much cleaner representation of the data.
  • It’s much better that we plot sleep this way ,and bars are simply not necessary!

Example2: mtcars

Just now we saw why “dynamite plots” (bar plots with error bars) are not well suited for their intended purpose of depicting distributions.
If you really want that, you can still get dynamite plots. However, you’ll need to set the positions manually.
A point geom will typically serve you much better.

It would be nice to give an impression of the number of values in each group here.

# Tidy data first.
mtcars.cyl <- mtcars %>%
  group_by(factor(cyl)) %>%
  summarize(wt.avg = mean(wt), sd = sd(wt), n())
colnames(mtcars.cyl)[c(1, 4)] <- c('Cylinders', 'n')
mtcars.cyl <- mutate(mtcars.cyl, prop = n/sum(n))
mtcars.cyl
## # A tibble: 3 x 5
##   Cylinders wt.avg    sd     n  prop
##   <fct>      <dbl> <dbl> <int> <dbl>
## 1 4           2.29 0.570    11 0.344
## 2 6           3.12 0.356     7 0.219
## 3 8           4.00 0.759    14 0.438
ggplot(mtcars.cyl, aes(x = Cylinders, y = wt.avg)) +
  geom_col(width = mtcars.cyl$prop, fill = '#7F95B5') +
  geom_errorbar(aes(ymin = wt.avg - sd, ymax = wt.avg + sd), width = 0.1, col = '#08306B')

Pie Charts & Alternatives

  • In Coordinates - Pie Charts, we’ve discussed how a pie chart is simply a stacked bar chart plotted onto polar coordinates.
  • When it comes to deciphering a pie chart, people will use angle, area or arc length.
  • All of them are all mediocre encoding elements, so none of them are really ideal.
  • We can’t control which one readers may choose which may cause ambiguity.
  • The only instance where pie charts are suitable is for representing large quantitative at most three groups.

To be brief, here are cons of using pie charts:

  • 3+ pie charts -> Unefficient.
  • Many unideal ways to encode -> Cause ambiguity.

Example: HairCol

Here is the original bar blot for hair color.

HairCol <- as.data.frame(HairEyeColor)
ggplot(HairCol, aes(x = Hair, y = Freq, fill = Hair)) +
  scale_fill_manual('Hair', values = c('slategray', 'rosybrown', 'darkred', 'gold')) +
  geom_col(position = 'dodge') +
  facet_grid(.~Sex)

This is the pie chart version of this data.
There are only two charts so it might not seem that much annoying.
However, imagine that you have lots of pie charts, how could you compare the statistics?

plot10 <- ggplot(HairCol, aes(x = 1, y = Freq, fill = Hair)) +
  scale_fill_manual('Hair', values = c('slategray', 'rosybrown', 'darkred', 'gold')) +
  scale_y_continuous('Proportion') +
  theme_classic() +
  theme(rect = element_blank(),
        strip.text = element_text(size = 13, face = 'bold'))
  
plot10 +  
  geom_col(position = 'fill', width = 0.2) +
  scale_x_continuous(limits = c(0.5, 1.1)) +
  coord_polar(theta = 'y') +
  facet_grid(.~Sex) +
  theme(axis.title.y = element_blank(),
        line = element_blank(),
        axis.text = element_blank())

The preferred way is ‘Fill’ bar plot here.
No matter how many plots you have, it’s much easier to make a comparison.

plot10 + 
  geom_col(position = 'fill', width = 0.3) +
  scale_x_continuous(limits = c(0.5, 1.5)) +
  coord_flip() +
  facet_grid(Sex~.) +
  theme(axis.title.y = element_blank(),
        axis.line = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        legend.position = 'bottom', 
        legend.direction= 'horizontal')

Heat Maps & Alternatives

Heat maps are a surprisingly popular choice for data visualization, but they are probably one of the least effective types of visualization.

  • Our perception changes dependent on the neighboring colors.
  • Heat map only has color for variables which is hard to recognize.
  • Look fancy and complicated - signs of poor communication skills.
  • As a result, basically NO meaningful information in heat map.

Let’s see some alternatives.

Example: Barley Yields

The dataset we’re using here is about the yield of some varieties of barley in different site in 1931 & 1932.
We should call it out by library(lattice).

library(lattice)
head(barley)
##      yield   variety year            site
## 1 27.00000 Manchuria 1931 University Farm
## 2 48.86667 Manchuria 1931          Waseca
## 3 27.43334 Manchuria 1931          Morris
## 4 39.93333 Manchuria 1931       Crookston
## 5 32.96667 Manchuria 1931    Grand Rapids
## 6 28.96667 Manchuria 1931          Duluth

Use geom_tile() to make a heat map.
See? It doesn’t provide any meaningful information at first glance.
We use scale_fill_gradientn() to fill in continuous variables.

  • scale_fill_brewer(palette = "*"), scale_fill_manual('col_name', values = *) for discrete variables.
ggplot(barley, aes(x = year, y = variety, fill = yield)) +
  geom_tile() + # Geom layer
  facet_wrap( ~ site, ncol = 1) + # Facet layer
  scale_fill_gradientn(colors = brewer.pal(9, "Reds")) # Adjust colors

First alternative:

  • This line plot shows the change in yield for each variety over two years.
  • We can see the general data trend in different varieties.
ggplot(barley, aes(x = year, y = yield, col = variety, group = variety)) +
  geom_line() +
  facet_wrap(~site, nrow = 1) +
  scale_x_discrete('Year') +
  scale_y_continuous('Yield (bushels/acre)') +
  scale_color_discrete('Variety')

Second Alternative:

  • If we are not interested in the individual varieties, but rather on the farms, we can make these plots.
    • Using geom = 'errorbar' need to add some dodging.
    • Using geom = 'ribbon'
  • Use some summary statistic, like the mean and standard deviation.
ggplot(barley, aes(x = year, y = yield, col = site, fill = site, group = site)) +
  stat_summary(fun.y = mean, geom = 'line', position = position_dodge(width = 0.2)) +
  stat_summary(fun.y = mean, geom = 'point',position = position_dodge(width = 0.2)) +
  # fun.args = list(mult = 1) to have a errorbar that spans over one standard deviation in both directions.
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = 'errorbar', width = 0.1, position = position_dodge(width = 0.2)) +
  scale_x_discrete('Year', expand = c(0.05, 0.05)) +
  scale_y_continuous('Yield (bushels/acre)') +
  scale_color_discrete('Site') +
  scale_fill_discrete('Site') +
  ggtitle('With Errorbar')

ggplot(barley, aes(x = year, y = yield, col = site, fill = site, group = site)) +
  stat_summary(fun.y = mean, geom = 'line') +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = 'ribbon', alpha = 0.1, col = NA) +
  scale_x_discrete('Year', expand = c(0.05, 0.05)) +
  scale_y_continuous('Yield (bushels/acre)') +
  scale_color_discrete('Site') +
  scale_fill_discrete('Site') +
  ggtitle('With Ribbon')

Case Study - Mosaic Plots

We’ll use 2009 CHIS adult-response dataset for data visualization here.
Let’s get familiar with the dataset now.

Column Name Description
RBMI BMI Category description
BMI_P BMI value
RACEHPR2 race
SRSEX sex
SRAGE_P age
MARIT2 Marital status
AB1 General Health Condition
ASTCUR Current Asthma Status
AB51 Type I or Type II Diabetes
POVLL Poverty level

Chi-Squared Test & Mosaic PLots

Observation: The Contingency Table we applied Chi-Squared Test
Expected: The expected values under the null hypothesis of equal proportions.
Residuals: The difference between the observed data and the expected values

Here’s a table comparing gender and political affiliation, and let’s apply Chi-Squared Test on it.

  • Observation
# Make a table manually
party <- matrix(c(762, 327, 468, 484, 239, 477), ncol = 3, byrow = T)
colnames(party) <- c('Democrat', 'Independent', 'Republican')
rownames(party) <- c('F', 'M')
party <- as.table(party)
# Obsrvation
party
##   Democrat Independent Republican
## F      762         327        468
## M      484         239        477
  • Expected
# Expected
chisq.test(party)$expected
##   Democrat Independent Republican
## F 703.6714    319.6453   533.6834
## M 542.3286    246.3547   411.3166
  • Residuals
# Residuals
chisq.test(party)$residuals
##     Democrat Independent Republican
## F  2.1988558   0.4113702 -2.8432397
## M -2.5046695  -0.4685829  3.2386734

When we perform a Chi-squared test, the results are provided to us with the expected & residuals.
And Mosaic Plot is just a method to visualize Chi-Squared Test.

  • Example of Mosaic Plot
library(vcd)
mosaic(party, shade = T, color = T)

  • Conclusion
    • The higher the residual, the more overrepresented a segment is.
    • The lower the residual, the more underrepresented a segment is.

A mosaic plot shows us not only the proportional size of each group in a contingency table, but we can also use it to report on the underlying statistics.