Compare teo graphs below, see what’s the difference.
library(ggplot2)
ggplot(mtcars, aes(x = cyl, y = mpg)) +
geom_point()
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_point()
The first plot is not quite right, because ggplot2 treats cyl
as a continuous variable, and it gives the impression that there is such a thing as a 5 or 7-cylinder car, which there is not.
After adding factor
, ggplot2 now treats cyl
as a factor
. This time the x-axis does not contain variables like 5 or 7, only the values that are present in the dataset.
data
,aes
) + geom
+ optional layers
+
.ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 0.2) +
geom_smooth(aes(colour = clarity),se = F)#se -> 要不要顯示誤差範圍
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Base plot Limitation
Advantages of ggplot2
Let’s see how complex it will be if we use the base plot().
#Convert cyl to factor
mtcars$cyl <- as.factor(mtcars$cyl)
#Example from base R
plot(mtcars$wt, mtcars$mpg, col = mtcars$cyl)
abline(lm(mpg ~ wt, data = mtcars), lty = 2)
lapply(mtcars$cyl, function(x) {
abline(lm(mpg ~ wt, mtcars, subset = (cyl == x)), col = x)
})
Now try with ggplot2 package.
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
geom_point() + # Copy from Plot 2
geom_smooth(method = 'lm', se = F) + # Copy from Plot 2
geom_smooth(aes(group = 1),method = 'lm', se = F, linetype = 2)
df
, key
, val
, -col
): Rearranges the data frame by specifying the columns that are categorical variables with a -
notation.df
, key
, val
): Spread key-value pairs to columns.df
, col
, into(c("col1", "col2"))
, sep = ’’): Separate one column into multiple.df
, col
, …(bare names of columns), sep = ’’): Unite multiple columns into one.What kind of graph you want to plot determine how ypu tidy of ypur messy data.
library(tidyr)
#iris.tidy
iris.tidy <- iris %>%
gather(key, Value, -'Species') %>%
separate(key, c("Part", "Measure"), "\\.")
head(iris.tidy)
## Species Part Measure Value
## 1 setosa Sepal Length 5.1
## 2 setosa Sepal Length 4.9
## 3 setosa Sepal Length 4.7
## 4 setosa Sepal Length 4.6
## 5 setosa Sepal Length 5.0
## 6 setosa Sepal Length 5.4
ggplot(iris.tidy, aes(x = Species, y = Value, col = Part)) +
geom_jitter() +
facet_grid(. ~ Measure)
#iris.wide
iris$Flower <- 1:nrow(iris)
iris.wide <- iris %>%
gather(key, value, -Species, -Flower) %>%
separate(key, c('Part','Measure'), "\\.") %>%
spread(Measure, value)
head(iris.wide)
## Species Flower Part Length Width
## 1 setosa 1 Petal 1.4 0.2
## 2 setosa 1 Sepal 5.1 3.5
## 3 setosa 2 Petal 1.4 0.2
## 4 setosa 2 Sepal 4.9 3.0
## 5 setosa 3 Petal 1.3 0.2
## 6 setosa 3 Sepal 4.7 3.2
ggplot(iris.wide, aes(x = Length, y = Width, color = Part)) +
geom_jitter() +
facet_grid(. ~ Species)
x
: x-axis position.y
: y_axis position.colour
: colour of dots, outlines or other shapes.fill
: fill colour. typically the inside shading.size
: the diameter of points, the thickness of lines, and the font size of text.alpha
: Transparency (0: transparent, 1: opaque)linetype
: line dash patterns.labels
: text on a plot or axes.shape
: shape of a point.A word about shapes
Shapes in R can have a value from 1-25. Shapes 1-20 can only accept a color
aesthetic, but shapes 21-25 have both a color
and a fill
aesthetic.
A word about hexadecimal colours
Hexadecimal, literally “related to 16”, is a base-16 alphanumeric counting system. Individual values come from the ranges 0-9 and A-F. This means there are 256 possible two-digit values (i.e. 00 - FF). Hexadecimal colours use this system to specify a six-digit code for Red, Green and Blue values ("#RRGGBB"
).
Aesthetics and Attributes
All the visible aesthetics can serve as attributes and aesthetics. Variables in a data frame are mapped to aesthetics in aes()
. (e.g. aes(col = cyl))
within ggplot()
. Visual elements are set by attributes in specific geom layers (geom_point(col = "red")
).
# We're focusing on aesthetic mappings here.
ggplot(mtcars, aes(x = mpg, y = qsec, col = factor(cyl), shape = factor(am), size = (hp/wt))) +
geom_point()
Position specifies how ggplot will adjust for overlapping bars or points in a single layer.
Jittering is adding a small amount of random noise to data. It is often used to spread out points that would otherwise be overplotted which means when one or more points are in the same place (or close enough to the same place) that you can’t look at the plot and tell how many points are there.
geom_jitter()
works on small datasetplot <- ggplot(iris.wide, aes(x = Length, y = Width, color = Part, fill = Species))
plot + geom_point(shape = 21, alpha = 0.3)
plot + geom_jitter(shape = 21, alpha = 0.3)
# geom_jitter() & geom_point(posiition = 'jitter') are the same.
# Similarly, geom_point() equals to geom_point(position = 'identity') .
plot2 <- ggplot(diamonds, aes(x = carat, y = price))
plot2 + geom_point(alpha = 0.1)
plot2 + geom_jitter(alpha = 0.1)
Besides jittering the plot, bar plots suffer from their own issues of overplotting.
They are stack, fill & dodge.
In this example, it seems that position = 'dodge'
is the best choice.
plot3 <- ggplot(mtcars, aes(x = factor(cyl), fill = factor(am)))
plot3 + geom_bar()# The default postition is 'stack'.
plot3 + geom_bar(position = 'fill')
plot3 + geom_bar(position = 'dodge')
All the aesthetics we saw earlier have an associated scale function.
The first argument is always the name of the scale, after that the most common are limits, breaks, expand, and labels.
The third part must match the type of data we are using.
scale_x_continuous('Length', limits = c(2, 8), breaks = seq(0, 8, by = 2), expand = c(0, 0))
scale_color_discrete('Species', labels = c('Setosa', 'Versicolor', 'Virginica'))
Let’s redrawn the plot about iris.wide
and mtcars
above by adding more commands.
plot +
scale_x_continuous('Length', limits = c(0, 8), breaks = seq(0, 8, by = 2), expand = c(0, 0)) +
scale_fill_discrete('Species', labels = c('Setosa', 'Versicolor', 'Virginica')) +
geom_jitter(shape = 21, alpha = 0.3)
plot3 +
scale_x_discrete("Cylinders") +
scale_y_continuous("Number") +
scale_fill_manual('Transmission',
values = c("#E41A1C", "#377EB8"),
labels = c("Manual", "Automatic")) +
geom_bar(position = "dodge")
You’ll have to deal with overplotting when you have:
iris
above).Common Solutions:
plot4 <- ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl)))
plot5 <- ggplot(diamonds[sample(nrow(diamonds), 800), ], aes(x = clarity, y = carat, col = price))# Random pick 800 rows in 'diamonds'.
# Hollow circles - an improvement
plot4 + geom_point(shape = 1, size = 4) + scale_color_discrete('Cylinders')
# Add transparency - great
plot4 + geom_point(alpha = 0.3, size = 4) + scale_color_discrete('Cylinders')
# Scatter plot: clarity (x), carat (y), price (color)
plot5 + geom_point(alpha = 0.5)
# Dot plot with jittering
plot5 + geom_point(position = 'jitter', alpha = 0.5)
Notice that jitter can be
geom_point(position = 'jitter')
.geom_jitter()
.position_jitter(0.1)
.As mentioned above, Let’s use transparncy, shape and jittering to deal with overploting!
library(car)
ggplot(Vocab, aes(x = education, y = vocabulary)) +
geom_jitter(alpha = 0.2, shape = 1)
And here are some codes of different shapes.
21
~ 25
both colour
& fill
are needed.geom_histogram()
doesn’t actually plot our data, it plots a statistical function to our dataset.The x axis/aesthetic
geom_histogram()
only requires one aesthetic:x
x
is a continuous variable of interest.x
.binwidth
geom_histogram()
states the argument stat = "bin"
as a default.stat "bin"
is doing.diff(range(iris$Sepal.Width))/30
binwidth = x
to adjust this.The y axis/aesthetic
geom_histogram()
only requires x
, where does y
come from?
geom_histogram()
executed the binning statistic, it counted how many values are in each bin. default of y-axisgeom_histogram(aes(y = ..density..), binwidth = 0.1)
..
means that the data is in the internal data frame rather than the original one.ggplot(mtcars, aes(x = mpg, y = ..density..)) +
geom_histogram(fill = '#377EB8', binwidth = 1)
Overlapping Histograms
geom_histogram(position = 'identity', binwidth = 1, alpha = 0.4)
.geom_freqpoly(position = 'identity', binwidth = 1)
.# Overlapping Histograms
ggplot(mtcars, aes(mpg, fill = factor(cyl))) +
geom_histogram(position = 'identity', binwidth = 1, alpha = 0.4) +
scale_fill_discrete('Cylinders')
# Unique solution here: a frequency polygon.
ggplot(mtcars, aes(mpg, col = factor(cyl))) +
geom_freqpoly(position = 'identity', binwidth = 1) +
scale_color_discrete('Cylinders')
geom_bar()
.geom_bar()
takes one variable only.stat = 'bin')
(stat = 'identity')
which is same as geom_col()
which takes two variables.library(dplyr)
iris_mean <- iris %>%
group_by(Species) %>%
summarize(Avg.Sepal.Length = mean (Sepal.Length))
# Compare with geom_bar(stat = 'identity') & geom_col()
ggplot(iris_mean, aes(x = Species, y = Avg.Sepal.Length)) + geom_bar(stat = 'identity')
ggplot(iris_mean, aes(x = Species, y = Avg.Sepal.Length)) + geom_col()
Overlapping Bar Plots
# Overlapping Bar Plots
posn_d <- position_dodge(width = 0.2)
plot3 + geom_bar(position = posn_d, alpha = 0.6)
Color Ramp
RColorBrewer
. The default is 'Blues
which has nine colors.colorRampPalette(c(#RRGGBB,#RRGGBB))
scale_fill_manual(values = blue_range(11))
library(RColorBrewer)
# Example of how to use colorRampPalette()
new_col <- colorRampPalette(c("#FFFFFF", "#0000FF"))
munsell::plot_hex(new_col(4))
# Default set of 'Blues'
munsell::plot_hex(brewer.pal(9, "Blues"))
# Example of how to use a brewed color palette
ggplot(mtcars, aes(x = cyl, fill = factor(am))) +
geom_bar() +
scale_fill_brewer(palette = "Set1")
# Example of combine scale function with customed palette
# Creating your own palette
blue_col <- colorRampPalette(c('#F7FBFF', '#08306B'))
# Take a look at your own palette.
munsell::plot_hex(blue_col(11))
#Plot with it
ggplot(Vocab, aes(x = factor(education), fill = factor(vocabulary))) +
geom_bar(position = "fill") +
scale_fill_manual('Vocabulary', values = blue_col(11)) +
scale_x_discrete('Education') +
scale_y_continuous('Proportion')
Let’s look at some line plots and see how to use those functions directly.
And here are some codes of different linetypes.
library(ggplot2)
recess <- get(load('D:/Downloads/recess.RData'))
ggplot(economics, aes(x = date, y = unemploy/pop)) +
geom_rect(data = recess,
aes(xmin = begin, xmax = end, ymin = -Inf, ymax = Inf),
inherit.aes = FALSE, fill = "red", alpha = 0.2) +
scale_y_continuous('Unemployment Rate') +
geom_line()
geom_line()
for line plots.geom_rect()
for rectangle plots which needs for aesthetics: xmin
, xmax
, ymin
and ymax
.The geom_rect()
command here shouldn’t inherit aesthetics from the base ggplot()
command it belongs to.
You should specify inherit.aes = FALSE
in geom_rect()
.
fish <- get(load('D:/Downloads/fish.RData'))
fish.tidy <- gather(fish.species, Species, Capture, -Year)
#Or use gather(fish.species, Species, Capture, Pink: Atlantic)
ggplot(fish.tidy, aes(x = Year, y = Capture, col = Species)) +
geom_line()
For simple exploratory plots, there are a variety of functions available. ggplot2
offers a powerful and diverse array of functions, but qplot()
allows for quick and dirty plots.
For example, let’s look at how to use different function for plotting the same graph.
# Base
plot(mpg ~wt, data = mtcars) # formula notation
# Or using x, y notation: with(mtcars, plot(wt, mpg))
# ggplot
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()
# qplot
qplot(wt, mpg, data = mtcars)
Use geom_dotplot()
to make a ‘true’ dot plot.
geom_point()
, geom_dotplot()
uses a binning statistic.# qplot with more attributes
# qplot with geom "dotplot", binaxis = "y" and stackdir = "center"
qplot(
cyl, wt,
data = mtcars,
fill = factor(am),
geom = 'dotplot',
binaxis = 'y',
stackdir = "center"
)
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
Example1: Chicken Weight
ggplot(ChickWeight, aes(x = Time, y = weight, col = Diet)) +
# To draw one line per chick, add group = Chick.
geom_line(aes(group = Chick), alpha = 0.3) +
geom_smooth(lwd = 2, se = F)
Example2: Titanic
library(titanic)
# Rename Survived factor.
factor_titanic_Survied <- factor(titanic_train$Survived)
levels(factor_titanic_Survied) <- c('Dead', 'Live')
titanic_train$Survived <- factor_titanic_Survied
posn.jd <- position_jitterdodge(0.5, 0, 0.6)# Position Function
ggplot(titanic_train, aes(x = Pclass, y = Age, color = Sex)) +
geom_point(position = posn.jd, size = 3, alpha = 0.5) +
facet_grid(.~Survived)