Exploring Data

Table & Frequency

Recoding Variables

Currently mpg variable of mtcars is a continuous numeric variable, but it may be more useful if mpg was a categorical variable that immedietly told you if the car had low or high miles per gallon.

# This is useful so that if you make a mistake your original dataset stays in tact!
mtcars2 <- mtcars
# Assign the label "high" to mpgcategory where mpg is greater than or equal to 20
mtcars2$mpgcategory[mtcars$mpg >= 20] <- 'high'
# Assign the label "low" to mpgcategory where mpg is less than 20
mtcars2$mpgcategory[mtcars$mpg < 20] <- 'low'
# Assign mpgcategory as factor to mpgfactor
mtcars2$mpgcategory <- as.factor(mtcars2$mpgcategory)
head(mtcars2$mpgcategory)

## [1] high high high high low  low 
## Levels: high low

Table

Frequency tables show you how often a given value occurs.
table(): The top row of the output is the value, and the bottom row is the frequency of the value.

table(mtcars2$mpgcategory)

## 
## high  low 
##   14   18

Bar Graph

Reference: Data Visualization with ggplot2 (I) (II) (III)

Bar Plot: Categorical Variable

barplot(): Base function for bar plot.

ylab = '*': Add a label to the y axis.
names.arg = '*': X axis labels to the bars.

barplot(table(mtcars2$mpgcategory), ylab = 'Number of Cars', names.arg = c('High','Low'))

Histograms: Continuous Variable

hist(): Base function for histogram.

main = '*': Title.
ylim = c(*, *): Set the range of y axis.
xlab = '*': Title for x axis.

hist(mtcars$carb, main = "Carburetors", ylim = c(0,20), xlab = 'Number of Carburetors', col = 'lightblue')

Numeric Summarise

Mean: mean()
Median: median()
Mode: sort(table(), decreasing = T)[1]
Quantiles: quantile() e.g. the first quartile = quantile()[2]
Outliers: 1.5*IQR below the first quartile or 1.5*IQR above the third quartile
Standard Deviation: sd()
Variance: var() = sum((x - mean(x))^2)/(n-1)
Range: diff(range())
Interquartile Range: IQR()
Z-scores: x - mean(x)/sd(x)
- Distribution & Z-scores
- 95.46%: 2 ~ -2
- Outliers: >3 or <-3

Probability

Basis

Sample Space

You have a bag containing three A and two E scrabble letters.     
Consider the samplespace of drawing three letters from the bag without replacing them(each trial is dependent).

samplespace <- c('AAA', 'AAE', 'AEA', 'EAA', 'EEA', 'EAE', 'AEE')

The sample space contains each combination of A and E,      
except for three E's because this is not possible with only two E letters.

Dependent

In your scrabble letter bag you have three A's and two E's.           
You take one letter out at a time, and return each letter to the bag before continuing with the following draw.

# What is the probability of AAE?
aae <- 3/5 * 3/5 * 2/5
# What is the probability of EAE?
eae <- 2/5 * 2/5 * 3/5
# What is the probability of AAA or EEA?
aaaeea <- 3/5 * 3/5 * 3/5 + 2/5 * 2/5 * 3/5

Independent

This time, you take one letter out of the bag at a time,          
but you do not return the leters to the bag before continuing with the following draw.

# What is the probability of AAE?
aae <- 3/5 * 2/4 * 2/3
# What is the probability of EAE?
eae <- 2/5 * 3/4 * 1/3
# What is the probability of AAA or EEA?
aaaeea <- 3/5 * 2/4 * 1/3 + 2/5 * 1/4 * 3/3

Probability Terms

Complement: 1 - x
Independent intersecting events: Two events that do not influence each other and can occur similtaneously.
Disjoint exhaustive events: Mutually exclusive, so only one of the events can happen at a time.

Rules

Union

P(A or B) = P(A) + P(B) - P(A and B)

The probability of picking a flower that is at least partly blue is 0.4. 
The probability of picking a flower that is at least partly pink is 0.2.
Q: alculate the probability of picking a flower that is blue, pink or both?

0.2 + 0.4 - (0.2 * 0.4)

## [1] 0.52

Conditional Probability

P(A|B) = P(A + B)/P(B)

'|' means 'given', under `B` condition, the probability that `A` will happen.

Independence

P(A|B) = P(A)

The probability of one event is independent of another 
if the probability of the first event occuring is unaffected by whether or not the other event occurs.

Bayesian Probability

P(A|B) = (P(B|A) * P(A))/P(B)

Information about A is can help tell us whether B will happen or not.
Because any information you have will reduces the sample space of possible events that can occur.

Probability Distributions

Probability Density Functions: Relate to probability distributions of continuous variables.
Probability Mass Functions: Relate to the probability distributions discrete variables.

Density Function

Example: Simulated Normal Distribution

dnorm(): Gives the density.
pnorm(): Gives the distribution function.
qnorm(): Gives the quantile function.
rnorm(): Generates random deviates.

To get a probability,

You will need to consider an interval under the curve of the probability density function.
Probabilities here are thus considered surface areas.

# Simulate some random normally distributed data
set.seed(11225)
data <- rnorm(10000)

# Calculate the density of data and store it in the variable density
density <- dnorm(data)

# Make a plot with as x variable data and as y variable density
plot(data, density)

Mass Function

data <- data.frame(outcome = 0:5, probs = c(0.1, 0.2, 0.3, 0.2, 0.1, 0.1))
data

##   outcome probs
## 1       0   0.1
## 2       1   0.2
## 3       2   0.3
## 4       3   0.2
## 5       4   0.1
## 6       5   0.1

Q: Make a histogram of the probability distribution.

barplot(data$probs, names.arg = c(0:5))

Cumulative Probability

Q: Probability that x is 0, smaller/equal to 1, smaller/equal to 2, and smaller/equal to 3.

# Using cumsum() for cumulative sum.
cumsum(data$probs[1:4])

## [1] 0.1 0.3 0.6 0.8

Expected Value - The Mean

The mean of a probability distribution is calculated by taking the weighted average of all possible values that a random variable can take.
In the case of a discrete variable, you calculate the sum of each possible value times its probability.

Q: Calculate the expected probability value.

expected_score <- sum(data$outcome * data$probs)
expected_score

## [1] 2.3

Variance & Standard Deviation

Variance
- The variance is often taken as a measure of spread of a distribution.
- It is the sum of the squared difference between the individual observation and multiplied by their probabilities.
Standard Deviation
- Take variance’s square root.

Q: Calculate the variance and store it in a variable called variance.

variance <- sum((data$outcome -expected_score)^2 * data$probs)
variance

## [1] 2.01

Q: Calculate the standard deviation and store it in a variable called std.

std <- sqrt(variance)
std

## [1] 1.417745

Normal Distribution

`pnorm()` & Cumulative Probability

Hair length is considered to be normally distributed with a mean of 25 cm and a standard deviation of 5. 
Q: The probability that a woman's hair length is less than 30.

pnorm(30, mean = 25, sd = 5)

## [1] 0.8413447

Q: The probability of a woman having a hair length larger or equal to 30 centimers.

pnorm(30, mean = 25, sd = 5, lower.tail = FALSE)

## [1] 0.1586553

Let’s visualize this.
Note that the first example is visualized on the left, while the second example is visualized on the right.

`qnorm()` & Quantile

Sometimes we have a probability that we want to associate with a value.     
This is basically the opposite situation as the situation described in the previous question.     
Q: The value of a woman's hair length that corresponds with the 0.2 quantile.

qnorm(0.2, mean = 25, sd = 5)

## [1] 20.79189

Standard Normal Distribution & Z-scores

A special form of the normal probability distribution is the standard normal distribution, also known as the z-distribution.
A z distribution has a mean of 0 and a standard deviation of 1.
Transform to z-scores.
- The values of a variable subtract the mean.
- Divide this by the standard deviation.

A woman with a hair length of 38 centimers and the average hair length was 25 centimers and the standard deviation was 5 centimers.    
Q: Change to z-scores.

round(((38 - 25) / 5), 1)

## [1] 2.6

Binomial Distribution

The binomial distribution is discrete distribution which is important for discrete variables.
Some conditions that need to be met before you can consider a random variable to binomially distributed.
- Bernoulli trial: A phenomenon or trial with two possible outcomes and a constant probability of success.
- All trials are independent.
n: Observation of a certain number of trials.
x: The number of successes in which we are interested.
Mean

n * p

Standard Deviation

sqrt(n * p * (1-p))

Standard Error

An exam consisting of 25 multiple choice questions. 
Each questions has 5 possible answers. 
This means that the probability of answering a question correctly by chance is `0.2`.

Q: The mean of this distribution.

25 * 0.2

## [1] 5

Q: The standard deviation.

sqrt(25 * 0.2 * 0.8)

## [1] 2

`dbinom()`, `pbinom()` & `qbinom()`

dbinom(): Calculates an exact probability.
pbinom(): Calculate an interval of probabilities.
qbinom(): Calculate the value that is associated with certain quantile.

Q: Probability of answering 5 questions correctly.

dbinom(x = 5, size = 25, prob = 0.2)

## [1] 0.1960151

Q: Probability of answering at least 5 questions correctly.

pbinom(q = 4, size = 25, prob = 0.2, lower.tail = F)

## [1] 0.5793257

Q: Calculate the 60th percentile.

qbinom(p = 0.6, size = 25, prob = 0.2)

## [1] 5

Sampling Distributions

Population & Sampling

In this lab we have access to the entire population.
In real life, this is rarely the case. Often we gather information by taking a sample from a population.

Sampling from Population

In this section you can see that the more you’ve sampled, the closer value to the actual mean you’ll get.

library(ggplot2)
# Use sample(), randomly sample 200 diamonds' prices.    
sample_price <- sample(diamonds$price, size = 200)
mean(sample_price)

## [1] 3509.43

# Use for loop do many samples and compare their means.
# Empty vector sample means
sample_means <- NULL

# Take 200 samples from diamonds' prices.
for (i in 1:500){
  samp <- sample(diamonds$price, 200)
  sample_means[i] <- mean(samp)
}

# Calculate the prices mean, that is, the mean of diamonds' prices and print it.
mean(diamonds$price)

## [1] 3932.8

# Calculate the mean of the sample means, that is, sample_means.
mean(sample_means)

## [1] 3926.979

Summary Statistics of Sampling

So far, you should be familiar with Variance & Standard Deviation.
But the caculation of Population & Sample is subtly different.

Variance & Standard Deviation of Population

The denominator is n when computing statistics of population.

Variance & Standard Deviation of Sampling

However, the denominator is n - 1 when computing statistics of sampling.
The functions in R base: var() & sd() are calculating statistics of sampling.

Standard Error of the Mean: Standard Deviation of Sampling

The standard deviation measures the amount of variability or dispersion for a subject set of data from the mean.
The standard error of the mean measures how far the sample mean of the data is likely to be from the true population mean.
The SEM is always smaller than the SD.

The Central Limit Theorem

“Provided that the sample size is sufficiently large, the sampling distribution of the sample mean is approximately normally distributed even if the variable of interest is not normally distributed in the population”

The sampling distribution, just as the central limit theorem states, is normally distributed!

# A Histogram of Diamonds' Prices
hist(diamonds$price)

# A Histogram of sample_means
hist(sample_means)

Interpretation of Z-scores

Z-score: a margin of error.
We can use pnorm() to transfer z-scores into probability.
Here’s an example diamond which costs 4400 dollars.

# Z-score caculation.
sampling_sd <- sd(diamonds$price)/sqrt(200)
Z <- (4400 - mean(diamonds$price)) / sampling_sd
Z

## [1] 1.656175

# Calculate the area under the curve left of the observation
pnorm(q = Z, lower.tail = TRUE)

## [1] 0.9511568

# Calculate the area under the curve right of the observation
pnorm(q = Z, lower.tail = FALSE)

## [1] 0.04884321

Sampling Distributions and Proportions

In all the examples that we have seen, we worked with continuous variables such as diamonds’ prices.
However, in practice we often work with proportions such as percentage of hipsters in the population of London.
Here’s the formula of standard deviation working on Proportions

NOTE: This formula is also the standard error for a binomial distribution.

We took a sample of 200 from the population of London and based on this sample we concluded that London's population is 10% hipster. 
The mean of the sampling distribution thus is 10%.

# Sample proportion
proportion_hipsters <- 0.10

# Standard deviation of the sampling distribution
sample_sd <- sqrt(0.1 * 0.9 / 200)

After that, we took a random sample of 200 people from London's general population, and a proportion of 0.13 of these people were hipsters. 
We however know that in the population of London, the proportion of hipsters is 0.10. 
What is the probability of finding a sample of 200 with a proportion of 0.13 or more hipsters?

# Calculate the probability
pnorm(0.13, mean = 0.10, sd = sample_sd, lower.tail = FALSE)

## [1] 0.0786496

Confidence Intervals

Definition of CI

A confidence interval tells you how confident you are about an interval.
The wider the interval is the confidence is higher. On the countrary, the narrower the interval is, the more precise it is.

Your road rage study suggested that a proportion of 0.375 of your sample felt that they had road rage. 
The 95% confidence interval was from 0.308 to 0.442. 
Q: What does this mean?

We are 95% sure that the road rage proportion in the population will lie between 0.308 and 0.442.

Z-score, CI & Normal Distribution

We’ve discussed Standard Normal Distribution & Z-scores before,
and now you should realize the realtionship between SD & Z-scores.

68.26% of data: A SD from the mean.
95.46% of data: Two SDs from the mean.
99.74% of data: Three SDs from the mean.
- observations above this threshold are considered as outliers.

For more presice definition of the Z-scores of 68%, 95% & 99%.

CI on Proportion

According to a questionnaire sampling from `200` people, `35%` said that they had roadrage.
Q: Find the standard error of p.

se <- sqrt(0.35*0.65/200)
se

## [1] 0.03372684

Q: Calculate the upper & lower level of the 95% confidence interval

# Upper level CI
0.35 + 1.96*se

## [1] 0.4161046

# Lower level of CI
0.35 -1.96*se

## [1] 0.2838954

Determination on Sample Size

You're interested in looking at how many days in the week students drink alcohol, 
and need to know what kind of sample size to use.
You expect that about 95% of people will consume an alcoholic drink between 1 and 6 days in the week.

Q: What is the expected mean number of drinking days?

(1 + 6) / 2

## [1] 3.5

Q: What is the standard deviation number of drinking days?

s <- (3.5 - 1) / 2
s

## [1] 1.25

If the data is normally distributed betwen 1 and 6,
you would expect that the midpoint of these values would be the mean.
If 95% of normally distributed values lie between 1 and 6,
then the distance between either of these values and the mean must be two standard deviations.

Q: Calculate the neccessary sample size with a confidence interval of 95% & a margin of error of 0.2.

For answering this question, you need the following formula.

Sample Size = Population SD ^2 * Z-score ^2 / margin of error ^2

# Assign the standard deviation squared to new object "ss"
ss <- 1.25^2
# Assign the value of the Z-score squared to new object "zs"
zs <- 1.96^2
# Assign the value of the margin of error squared to the new object "ms"
ms <- 0.2^2
# Calculate the neccessary sample size
ss * zs / ms

## [1] 150.0625

Sample Size: Safe Approach

This example work on proportion:
For safe approach p = 0.5 which is the value above which the output p*(1-p) cannot get any larger.

Now you're conducting a study on what proportion of students drink alcohol and want to know what sample size to use for a confidence interval of `95%`, with a margin of error of `0.05`.

# Assign the value of p(1-p) to object "p"
p <- 0.5*0.5
# Assign the value of the Z-score to new object "z"
z <- 1.96
# Assign the value of the margin of error squared to the new object "ms"
ms <-0.0025
# Calculate the neccessary sample size
p * z^2 / ms

## [1] 384.16

Hypothesis Testing

Significance Testing

Often in quantitative social science research, you will deal with a null hypothesis (H0) and an alternative hypothesis (H1).
The way we do hypothesis testing in most of social science research is:

We assume that the null hypothesis is true and that we look at the probability of our data under the null hypothesis.
If this probability is very small, we reject the null hypothesis and at least temporarily accept the alternative hypothesis.
We take a significance level(α = 0.05 or 0.01).
We compare the p value that we find to the significance level against which we testing.

Given that we have tested against a significance level of 0.05 and found a p value of 0.03
Q: Do we accept the null hypothesis or the alternative hypothesis? 
Q: What does a p value of 0.03 mean?

We accept the alternative hypothesis. A p value of `0.03` means that there is only a 3% probability of obtaining a result equally or more extreme assuming that the null hypothesis is true.

One-sided vs. Two-tailed

An example the sampling distribution of the beard length of samples of `40` scandinavian hipsters.
The mean here is `25` and the standard deviation of population is `3.5`.
So the stand error is `0.55` = `round(3.5 / sqrt(40), 2)`

One-sided

The red area is considered the rejection region when we are doing a one-sided hypothesis where the alternative hypothesis checks whether the population mean of the beard length of scandinavian hipsters is larger than 25 millimeters.

# Calculate the value of cut_off
qnorm(0.95, mean = 25, sd = 0.55)

## [1] 25.90467

25.90 is the starting value of the rejection region.
Consider our example mentioned above with a mean beard length of 25 and a standard error of 0.55.

Imagine that we found a sample mean of 25.95 with a sample size of 40. 
Whether this result is significant depends on the test we do. 
If we would do a one-sided hypothesis test against a 5% significance level, 
we would only have to test for the effect in one direction.

# Calculate the z score and assign it to a variable called z_value
z_value <- round((25.95 - 25) / round(3.5 / sqrt(40), 2), 2)
z_value

## [1] 1.73

# Calculate the corresponding p value and store it in the variable called p_value
# We set the lower tail equal to FALSE 
# because pnorm() calculates the cumulative probability until the value of 1.81 
# and we want to know the probability of finding a value of 1.81 or larger.
pnorm(z_value, lower.tail = F)

## [1] 0.04181514

See that the p value is smaller than 0.05?
When testing against a significance level of 0.05, we have actually found a significant result.

Two-tailed

The red area is considered the rejection region when we are doing a two-tailed hypothesis test.   
This corresponds to the alternative hypothesis which checks whether the population mean of the beard length of scandinavian hipsters is not equal to 25 millimeters. As you can see the 0.05 probability is divided into two chunks of 0.025.

# Calculate the value of the variable lower_cut_off
round(qnorm(0.025, mean = 25, sd = 0.55), 2)

## [1] 23.92

# Calculate the value of the variable upper_cut_off
round(qnorm(0.975, mean = 25, sd = 0.55), 2)

## [1] 26.08

23.92 & 26.08 indicate the start of the rejection region.
Consider our example mentioned above with a mean beard length of 25 and a standard error of 0.55.

If we would however do a two-sided hypothesis test, we should not only look for P(>z_value). 
In this case we should test for both P(>z_value) & (<-z_value).
Since the Z distribution we are working with is symmetric, we could multiply the outcome by 2.

# Calculate the z score and assign it to a variable called z_value
z_value <- round((25.95 - 25) / round(3.5 / sqrt(40), 2), 2)
z_value

## [1] 1.73

# Calculate the corresponding p value and store it in the variable called p_value
pnorm(z_value, lower.tail = F) * 2

## [1] 0.08363028

This time, p value is larger than 0.05.
In this case would work reject the null hypothesis if we would do a one-sided hypothesis test.
However, we would fail to reject it if we would do a two-sided hypothesis test.

Hypothesis Testing and Binomial Distribution

Recap: What is Binomial Distribution?

Deals with discrete random variables.
Gives probabilities for counts with binary data.

Let’s go back to an example we worked with in earlier labs.

Each question has 5 options, it makes sense that the probability of guessing 1 question correctly is 0.2. 
Now suppose we have a student who answered 12 out of 25 correctly and we believe that this student did better than merely guessing.

Q: What could be our corresponding hypotheses?

A: H0: p = 0.20, H1: p > 0.20

Q: Calculate the probability of answering 12 or more questions given that the student is merely guessing.

pbinom(11, size = 25, prob = 0.2, lower.tail = F )

## [1] 0.001540051

pbinom(12, size = 25, prob = 0.20) calculates the area under the curve for values below 12 and equal to 12.
However, we need to know the area ABOVE 12.
We can get the value by setting lower.tail = F or we can subtract the probability(<=12) from the total area of 1.
The remaining value will be the probability of a score that is equal to or larger than 12.

Since p_value = 0.0015 < 0.05, H0 (the student is merely guessing) is rejected.

Here’s another example using summary statistics of sampling on proportion:

Calculate the mean (the expected probability) of our distribution.
Its standard deviation.
Verify how many standard deviations the observed probability is removed from the expected probability (the z score).

# Calculate the mean (expected probability) and assign it to a variable called average
average <- 0.2

# Calculate the standard error and assign it to a variable called se
se <- sqrt(0.2 * 0.8 / 25)

# Calculate the z value and assign it to a variable z_value
z_value <- (12/25 - average) / se

# Calculate the p value and store it in a variable p_value
pnorm(z_value, lower.tail = F)

## [1] 0.0002326291

T Distribution

When comparing means of continuous variables we use a t distribution instead of the normal distribution.
The main reason to use the t distribution here is because we often have to deal with small samples.
Functions of T ditrubution:
- dt(): Gives the density.
- pt(): Gives the distribution function.
- qt(): Gives the quantile function.
- rt(): Generates random deviates.
Stand Error of T distribution

T Value

 They say that Dutch people are among the tallest in the world with an average male height of 185 centimeters with a standard deviation of 5 centimers. 
 We take a sample of 50 males from this population and find an average height of 186.5 centimeters which is above the population mean.

For t distribution, we must caculate degrees-of-freedom(df) which is equal to the sample size - 1.
Q: We want to do a one-sided hypothesis test where we check whether the population mean of Dutch male height is larger than 185 and we use a significance level of 0.05.

# Calculate the critical cut off value
round(qt(p = 0.95, df = 49), 2)

## [1] 1.68

Q: T test statistics and whether this statistics is larger than the cut-off value.

# Calculate the standard error and store it in the variable se
se <- 5/sqrt(50)
se

## [1] 0.7071068

# Calculate the t value and store it in a variable called t_value
t_value <- (186.5 - 185) / se
t_value

## [1] 2.12132

# Calculate the p value and store it in a variable called p_value
pt(t_value, df = 49, lower.tail = F)

## [1] 0.01948903

Apparently, since p-value = 0.019 < 0.05 this statistics is larger than the cut-off value.

T Distribution & CI

You can calculate a confidence interval by taking the mean and adding and subtracting its standard error multiplied by the given t value or z value.
Usually confidence intervals are expressed as a two-sided range as we will also do in this exercise.

# Calculate the t value and store it in the variable t_value
# Two-tailed test -> 0.975
t_value <- round(qt(p = 0.975, df = 49), 2)
t_value

## [1] 2.01

# Calculate a 95% confidence interval as a vector
conf_interval <- c(186.5 - t_value * 0.71, 186.5 + t_value * 0.71)
conf_interval

## [1] 185.0729 187.9271

We have calculated a confidence interval with a lower bound of 185.07 and an upper bound of 187.93.
This means that we are 95% confident that our confidence interval does not contain the population mean.

Conclusion of Example Question

By checking whether the confidence interval contains our population mean,
we are essentially doing the same as testing a hypothesis.
For instance, we started out with assuming the population mean of the height of Dutch males is 185 centimers.
This could be our null hypothesis.
Our alternative hypothesis could be the fact that the Dutch male height is different from 185 centimeters.
After testing, we would reject the null hypothesis because the confidence interval does not contain the population mean.

Basic Statistics

Bourbon0212

2018年7月26日