Introduction

Exploratory & Explanatory

  • Explore: Confirm & Analysis -> Data
  • Explain: Inform & Persuade -> Reader

Explore ggplot2

Compare teo graphs below, see what’s the difference.

library(ggplot2)
ggplot(mtcars, aes(x = cyl, y = mpg)) +
  geom_point()

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_point()

The first plot is not quite right, because ggplot2 treats cyl as a continuous variable, and it gives the impression that there is such a thing as a 5 or 7-cylinder car, which there is not.
After adding factor, ggplot2 now treats cyl as a factor. This time the x-axis does not contain variables like 5 or 7, only the values that are present in the dataset.

Grammar of Graphics

  • Principles
    • Graphics are made up of distinct layers of grammatical elements.
    • Meaningful plots are built around appropriate aesthetic mappings.
    • ggplot(data,aes) + geom + optional layers
  • Elements
    • 3 must have layers & 4 optional.
    • Executed the command with +.
      Required
    • Data: The dataset being plotted.
    • Aesthetics: The scales onto which we map our data
    • Geometries: The visual elements used for our data.
      Optional
    • Facets: Plotting small multiples.
    • Statistics: Representations of our data to aid understanding.
    • Coordinates: The space on which data will be plotted.
    • Themes: All none-data ink.
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.2) +
  geom_smooth(aes(colour = clarity),se = F)#se -> 要不要顯示誤差範圍
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Data

Base plot vs. ggplot2

  • Base plot Limitation

    • Plot doesn’t get redrawn.
    • Plot is drawn as an image.
    • Need to manually add legend.
    • No unified framework for plotting.
  • Advantages of ggplot2

    • Creates plotting objects, which can be manipulated
    • Takes care of a lot of the leg work for you, such as choosing nice color palettes and making legends.
    • Built upon the grammar of graphics plotting philosophy, making it more flexible and intuitive for understanding the relationship between your visuals and your data.

Let’s see how complex it will be if we use the base plot().

#Convert cyl to factor
mtcars$cyl <- as.factor(mtcars$cyl)
#Example from base R
plot(mtcars$wt, mtcars$mpg, col = mtcars$cyl)
abline(lm(mpg ~ wt, data = mtcars), lty = 2)
lapply(mtcars$cyl, function(x) {
  abline(lm(mpg ~ wt, mtcars, subset = (cyl == x)), col = x)
  })

Now try with ggplot2 package.

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
  geom_point() + # Copy from Plot 2
  geom_smooth(method = 'lm', se = F) + # Copy from Plot 2
  geom_smooth(aes(group = 1),method = 'lm', se = F, linetype = 2)

Tidy Data

Reference: Cleaning Data in R

  • gather(df, key, val, -col): Rearranges the data frame by specifying the columns that are categorical variables with a - notation.
  • spread(df, key, val): Spread key-value pairs to columns.
  • separate(df, col, into(c("col1", "col2")), sep = ’’): Separate one column into multiple.
  • unite(df, col, …(bare names of columns), sep = ’’): Unite multiple columns into one.

What kind of graph you want to plot determine how ypu tidy of ypur messy data.

library(tidyr)
#iris.tidy
iris.tidy <- iris %>%
  gather(key, Value, -'Species') %>%
  separate(key, c("Part", "Measure"), "\\.")
head(iris.tidy)
##   Species  Part Measure Value
## 1  setosa Sepal  Length   5.1
## 2  setosa Sepal  Length   4.9
## 3  setosa Sepal  Length   4.7
## 4  setosa Sepal  Length   4.6
## 5  setosa Sepal  Length   5.0
## 6  setosa Sepal  Length   5.4
ggplot(iris.tidy, aes(x = Species, y = Value, col = Part)) +
  geom_jitter() +
  facet_grid(. ~ Measure)

#iris.wide
iris$Flower <- 1:nrow(iris)
iris.wide <- iris %>%
  gather(key, value, -Species, -Flower) %>%
  separate(key, c('Part','Measure'), "\\.") %>%
  spread(Measure, value)
head(iris.wide)
##   Species Flower  Part Length Width
## 1  setosa      1 Petal    1.4   0.2
## 2  setosa      1 Sepal    5.1   3.5
## 3  setosa      2 Petal    1.4   0.2
## 4  setosa      2 Sepal    4.9   3.0
## 5  setosa      3 Petal    1.3   0.2
## 6  setosa      3 Sepal    4.7   3.2
ggplot(iris.wide, aes(x = Length, y = Width, color = Part)) +
  geom_jitter() +
  facet_grid(. ~ Species)

Aesthetics

Typical Aesthetics

  • x: x-axis position.
  • y: y_axis position.
  • colour: colour of dots, outlines or other shapes.
  • fill: fill colour. typically the inside shading.
  • size: the diameter of points, the thickness of lines, and the font size of text.
  • alpha: Transparency (0: transparent, 1: opaque)
  • linetype: line dash patterns.
  • labels: text on a plot or axes.
  • shape: shape of a point.

A word about shapes
Shapes in R can have a value from 1-25. Shapes 1-20 can only accept a color aesthetic, but shapes 21-25 have both a color and a fill aesthetic.

A word about hexadecimal colours
Hexadecimal, literally “related to 16”, is a base-16 alphanumeric counting system. Individual values come from the ranges 0-9 and A-F. This means there are 256 possible two-digit values (i.e. 00 - FF). Hexadecimal colours use this system to specify a six-digit code for Red, Green and Blue values ("#RRGGBB").

Aesthetics and Attributes
All the visible aesthetics can serve as attributes and aesthetics. Variables in a data frame are mapped to aesthetics in aes(). (e.g. aes(col = cyl)) within ggplot(). Visual elements are set by attributes in specific geom layers (geom_point(col = "red")).

# We're focusing on aesthetic mappings here.
ggplot(mtcars, aes(x = mpg, y = qsec, col = factor(cyl), shape = factor(am), size = (hp/wt))) +
  geom_point()

Modify Aesthetics

Position

Position specifies how ggplot will adjust for overlapping bars or points in a single layer.

  • identity: default in scatter plot. The value in the data frame is exactly where the value will be positioned in the plot.
  • jitter: Spread out points that would otherwise be overplotted.
  • stack: Place the bars on top of each other. Counts are used. default in bar plot.
  • dodge: Place the bars next to each other. Counts are used.
  • fill: Place the bars on top of each other, but this time use proportions.
  • jitterdodge:

Scatter Plot: Jitter or Identity?

Jittering is adding a small amount of random noise to data. It is often used to spread out points that would otherwise be overplotted which means when one or more points are in the same place (or close enough to the same place) that you can’t look at the plot and tell how many points are there.

geom_jitter() works on small dataset
plot <- ggplot(iris.wide, aes(x = Length, y = Width, color = Part, fill = Species))
  plot + geom_point(shape = 21, alpha = 0.3)

  plot + geom_jitter(shape = 21, alpha = 0.3)

  # geom_jitter() & geom_point(posiition = 'jitter') are the same.
  # Similarly, geom_point() equals to geom_point(position = 'identity') .
And not working on bigger data
plot2 <- ggplot(diamonds, aes(x = carat, y = price))
  plot2 + geom_point(alpha = 0.1)

  plot2 + geom_jitter(alpha = 0.1)

Bar Plot: Stack, Fill or Dodge?

Besides jittering the plot, bar plots suffer from their own issues of overplotting.
They are stack, fill & dodge.
In this example, it seems that position = 'dodge' is the best choice.

plot3 <- ggplot(mtcars, aes(x = factor(cyl), fill = factor(am)))
  plot3 + geom_bar()# The default postition is 'stack'.

  plot3 + geom_bar(position = 'fill')

  plot3 + geom_bar(position = 'dodge')

Scale Function

All the aesthetics we saw earlier have an associated scale function.
The first argument is always the name of the scale, after that the most common are limits, breaks, expand, and labels.
The third part must match the type of data we are using.

  • scale_x…
    • scale_x_continuous('Length', limits = c(2, 8), breaks = seq(0, 8, by = 2), expand = c(0, 0))
      • Limits: describe the scale’s limits.
      • Breaks: control the breaks in the guide.
      • Expand: expand the range of the scales so that there is a small gap between the data and the axes.
  • scale_y…
  • scale_color…
    • scale_color_discrete('Species', labels = c('Setosa', 'Versicolor', 'Virginica'))
      • Labels: adjust the category names.
  • scale_fill…
  • scale_shape…
  • scale_linetype…

Let’s redrawn the plot about iris.wide and mtcars above by adding more commands.

plot + 
  scale_x_continuous('Length', limits = c(0, 8), breaks = seq(0, 8, by = 2), expand = c(0, 0)) +
  scale_fill_discrete('Species', labels = c('Setosa', 'Versicolor', 'Virginica')) +
  geom_jitter(shape = 21, alpha = 0.3)

plot3 +
  scale_x_discrete("Cylinders") +
  scale_y_continuous("Number") +
  scale_fill_manual('Transmission', 
                    values = c("#E41A1C", "#377EB8"),
                    labels = c("Manual", "Automatic")) +
  geom_bar(position = "dodge")

Which Kind of Aesthetics?

For Continuous Variables

For Categorical Variables

Deal with Overplotting

You’ll have to deal with overplotting when you have:

  • Large datasets.
  • Imprecise data and so points are not clearly separated on your plot (e.g. iris above).
  • Interval data (i.e. data appears at fixed values).
  • Aligned data values on a single axis.

Common Solutions:

  • Use alpha blending (i.e. adding transparency).
  • Use hollow shapes.
  • Use geom_jitter() if it helps.
plot4 <- ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl)))
plot5 <- ggplot(diamonds[sample(nrow(diamonds), 800), ], aes(x = clarity, y = carat, col = price))# Random pick 800 rows in 'diamonds'.

# Hollow circles - an improvement
plot4 + geom_point(shape = 1, size = 4) + scale_color_discrete('Cylinders')

# Add transparency - great
plot4 + geom_point(alpha = 0.3, size = 4) + scale_color_discrete('Cylinders')

# Scatter plot: clarity (x), carat (y), price (color)
plot5 + geom_point(alpha = 0.5)

# Dot plot with jittering
plot5 + geom_point(position = 'jitter', alpha = 0.5)

Geometries

Common Plot Types

  • Scatter Plots: points, jitter, abline.
  • Bar Plots: histogram, bar, errorbar.
  • Line Plots: line

Scatter Plots & Jittering

Jitter

Notice that jitter can be

  • An argument in geom_point(position = 'jitter').
  • A geom itself: geom_jitter().
  • A position function: position_jitter(0.1).

As mentioned above, Let’s use transparncy, shape and jittering to deal with overploting!

library(car)
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  geom_jitter(alpha = 0.2, shape = 1)

And here are some codes of different shapes.

Code of Shape

Notice that from 21 ~ 25 both colour & fill are needed.

Bar Plots

Histograms

  • Histograms are a specialized version of a bar plots, and we wll discuss bar plots later.
  • Notice that geom_histogram() doesn’t actually plot our data, it plots a statistical function to our dataset.
  • Postition: stacl, fill or dodge.

The x axis/aesthetic

  • geom_histogram() only requires one aesthetic:x
    • x is a continuous variable of interest.
    • Histogram shows the binned distribution of x.
  • About binwidth
    • geom_histogram() states the argument stat = "bin" as a default.
    • Histograms cut up a continuous variable into discrete bins - that’s what the stat "bin" is doing.
    • binwidth: default to range/30 e.g. diff(range(iris$Sepal.Width))/30
    • Use binwidth = x to adjust this.
    • Reference: The Difference of Different Binwidth
      • As binwidth becomes larger, the graphh becomes smoother.
  • X-axis labels between the bar which means they represent intervals and not actual values.
  • No spaces between bars which means its a continuous distribution.

The y axis/aesthetic

  • If geom_histogram() only requires x, where does y come from?
    • There is an internal data frame (Summary table) in the backround where this information is stored.
    • count: When geom_histogram() executed the binning statistic, it counted how many values are in each bin. default of y-axis
    • density: The proportional frequency of this bin in relation to the whole data set. geom_histogram(aes(y = ..density..), binwidth = 0.1)
    • .. means that the data is in the internal data frame rather than the original one.
ggplot(mtcars, aes(x = mpg, y = ..density..)) +
  geom_histogram(fill = '#377EB8', binwidth = 1)

Overlapping Histograms

  • Using geom_histogram(position = 'identity', binwidth = 1, alpha = 0.4).
  • Alternative: Drawing a frequency polygon by geom_freqpoly(position = 'identity', binwidth = 1).
# Overlapping Histograms
ggplot(mtcars, aes(mpg, fill = factor(cyl))) +
  geom_histogram(position = 'identity', binwidth = 1, alpha = 0.4) + 
  scale_fill_discrete('Cylinders')

# Unique solution here: a frequency polygon.
ggplot(mtcars, aes(mpg, col = factor(cyl))) +
  geom_freqpoly(position = 'identity', binwidth = 1) +
  scale_color_discrete('Cylinders')

Bar Plots

  • To make a bar plot proper, we need to use geom_bar().
  • Just as histograms, geom_bar() takes one variable only.
  • Stack, fill and dodge are also available in bar plots.
  • Types of bar plots:
    • Absolute Counts(stat = 'bin')
    • Distribution(stat = 'identity') which is same as geom_col() which takes two variables.
library(dplyr)
iris_mean <- iris %>%
  group_by(Species) %>%
  summarize(Avg.Sepal.Length = mean (Sepal.Length))
# Compare with geom_bar(stat = 'identity') & geom_col()
ggplot(iris_mean, aes(x = Species, y = Avg.Sepal.Length)) + geom_bar(stat = 'identity')

ggplot(iris_mean, aes(x = Species, y = Avg.Sepal.Length)) + geom_col()

Overlapping Bar Plots

# Overlapping Bar Plots
posn_d <- position_dodge(width = 0.2)
plot3 + geom_bar(position = posn_d, alpha = 0.6)

Color Ramp

  • Bar plots with color ramp: Use RColorBrewer. The default is 'Blues which has nine colors.
  • colorRampPalette(c(#RRGGBB,#RRGGBB))
    • Manually create a color palette that can generate all the colours you need.
    • Input is a character vector of 2 or more colour values.
  • Combine with Scale Function e.g.scale_fill_manual(values = blue_range(11))
library(RColorBrewer)
# Example of how to use colorRampPalette()
new_col <- colorRampPalette(c("#FFFFFF", "#0000FF"))
munsell::plot_hex(new_col(4))

# Default set of 'Blues'
munsell::plot_hex(brewer.pal(9, "Blues"))

# Example of how to use a brewed color palette
ggplot(mtcars, aes(x = cyl, fill = factor(am))) +
  geom_bar() +
  scale_fill_brewer(palette = "Set1")

# Example of combine scale function with customed palette
# Creating your own palette
blue_col <- colorRampPalette(c('#F7FBFF', '#08306B'))
# Take a look at your own palette.
munsell::plot_hex(blue_col(11))

#Plot with it
ggplot(Vocab, aes(x = factor(education), fill = factor(vocabulary))) +
  geom_bar(position = "fill") +
  scale_fill_manual('Vocabulary', values = blue_col(11)) +
  scale_x_discrete('Education') +
  scale_y_continuous('Proportion')

Line Plots

Let’s look at some line plots and see how to use those functions directly.
And here are some codes of different linetypes.

Code of Linetype

Example1: Unemployment & Recession

library(ggplot2)
recess <- get(load('D:/Downloads/recess.RData'))
ggplot(economics, aes(x = date, y = unemploy/pop)) +
  geom_rect(data = recess,
         aes(xmin = begin, xmax = end, ymin = -Inf, ymax = Inf),
         inherit.aes = FALSE, fill = "red", alpha = 0.2) +
  scale_y_continuous('Unemployment Rate') +
  geom_line()

  • geom_line() for line plots.
  • geom_rect() for rectangle plots which needs for aesthetics: xmin, xmax, ymin and ymax.

The geom_rect() command here shouldn’t inherit aesthetics from the base ggplot() command it belongs to.
You should specify inherit.aes = FALSE in geom_rect().

Example2: Salmon Species

fish <- get(load('D:/Downloads/fish.RData'))
fish.tidy <- gather(fish.species, Species, Capture, -Year)
#Or use gather(fish.species, Species, Capture, Pink: Atlantic)
ggplot(fish.tidy, aes(x = Year, y = Capture, col = Species)) +
  geom_line()

qplot and Wrap-up

qplot

For simple exploratory plots, there are a variety of functions available. ggplot2 offers a powerful and diverse array of functions, but qplot() allows for quick and dirty plots.

For example, let’s look at how to use different function for plotting the same graph.

# Base
plot(mpg ~wt, data = mtcars) # formula notation

# Or using x, y notation: with(mtcars, plot(wt, mpg))

# ggplot
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

# qplot
qplot(wt, mpg, data = mtcars)

Use geom_dotplot() to make a ‘true’ dot plot.

  • The difference is that unlike geom_point(), geom_dotplot() uses a binning statistic.
  • Binning means to cut up a continuous variable.
  • Reference: BinWidth in Histograms.
# qplot with more attributes
# qplot with geom "dotplot", binaxis = "y" and stackdir = "center"
qplot(
  cyl, wt,
  data = mtcars,
  fill = factor(am),
  geom = 'dotplot',
  binaxis = 'y',
  stackdir = "center"
)
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

Wrap-up & Example

Example1: Chicken Weight

ggplot(ChickWeight, aes(x = Time, y = weight, col = Diet)) +
  # To draw one line per chick, add group = Chick.
  geom_line(aes(group = Chick), alpha = 0.3) +
  geom_smooth(lwd = 2, se = F)

Example2: Titanic

library(titanic)
# Rename Survived factor.
factor_titanic_Survied <- factor(titanic_train$Survived)
levels(factor_titanic_Survied) <- c('Dead', 'Live')
titanic_train$Survived <- factor_titanic_Survied
posn.jd <- position_jitterdodge(0.5, 0, 0.6)# Position Function
ggplot(titanic_train, aes(x = Pclass, y = Age, color = Sex)) +
  geom_point(position = posn.jd, size = 3, alpha = 0.5) +
  facet_grid(.~Survived)