First, we download csv files from kaggle script, and join two datasets together by ‘Country’.
After that, we select the columns we need for analysis, and take a closer look at our data.
library(dplyr)
# Read csv from kaggle script and save it as character.
open_data <- read.csv('D:/GitHub/NTU-CS-X/Week3/Data/countries.csv', stringsAsFactors = F)
happiness <- read.csv('D:/GitHub/NTU-CS-X/Week3/Data/2015.csv', stringsAsFactors = F)
# Rename from "Country Name" to "Country" for joining csv later.
colnames(open_data)[2] <- 'Country'
# Join two csv file and select the needed columns for analysis.
data <- open_data %>%
left_join(happiness, by = 'Country') %>%
mutate(Country = factor(Country)) %>%
select(Country, Region, X2015.Score, Happiness.Score, Economy..GDP.per.Capita.,
Family, Health..Life.Expectancy., Freedom, Trust..Government.Corruption.,
Generosity, Dystopia.Residual)
# Rename column names with a user friendly title.
colnames(data) <- c("Country", "Region", "Openness", "Happiness", "GDP", "Family", "Health", "Freedom", "Trust", "Generosity", "DystopiaResidual")
head(data, 5)
## Country Region Openness Happiness GDP
## 1 Taiwan Eastern Asia 78 6.298 1.29098
## 2 United Kingdom Western Europe 76 6.867 1.26637
## 3 Denmark Western Europe 70 7.527 1.32548
## 4 Colombia Latin America and Caribbean 68 6.477 0.91861
## 5 Finland Western Europe 67 7.406 1.29025
## Family Health Freedom Trust Generosity DystopiaResidual
## 1 1.07617 0.87530 0.39740 0.08129 0.25376 2.32323
## 2 1.28548 0.90943 0.59625 0.32067 0.51912 1.96994
## 3 1.36058 0.87464 0.64938 0.48357 0.34139 2.49204
## 4 1.24018 0.69077 0.53466 0.05120 0.18401 2.85737
## 5 1.31826 0.88911 0.64169 0.41372 0.23351 2.61955
And now let’s create a table for easier analysis.
In this case, I rounded these values to 2 decimal points and rearrange the table in a decending order according to ‘Openness’.
library(formattable)
data %>%
# Rounded these values to 2 decimal points.
mutate_if(is.numeric, funs(round(., 2))) %>%
# Rearrange the table in a decending order according to 'Openness'.
arrange(desc(Openness)) %>%
formattable(align = "l") %>%
head(10)
| Country | Region | Openness | Happiness | GDP | Family | Health | Freedom | Trust | Generosity | DystopiaResidual |
|---|---|---|---|---|---|---|---|---|---|---|
| Taiwan | Eastern Asia | 78 | 6.30 | 1.29 | 1.08 | 0.88 | 0.40 | 0.08 | 0.25 | 2.32 |
| United Kingdom | Western Europe | 76 | 6.87 | 1.27 | 1.29 | 0.91 | 0.60 | 0.32 | 0.52 | 1.97 |
| Denmark | Western Europe | 70 | 7.53 | 1.33 | 1.36 | 0.87 | 0.65 | 0.48 | 0.34 | 2.49 |
| Colombia | Latin America and Caribbean | 68 | 6.48 | 0.92 | 1.24 | 0.69 | 0.53 | 0.05 | 0.18 | 2.86 |
| Finland | Western Europe | 67 | 7.41 | 1.29 | 1.32 | 0.89 | 0.64 | 0.41 | 0.23 | 2.62 |
| Australia | Australia and New Zealand | 67 | 7.28 | 1.33 | 1.31 | 0.93 | 0.65 | 0.36 | 0.44 | 2.27 |
| Uruguay | Latin America and Caribbean | 66 | 6.49 | 1.06 | 1.21 | 0.81 | 0.60 | 0.25 | 0.23 | 2.32 |
| United States | North America | 64 | 7.12 | 1.39 | 1.25 | 0.86 | 0.55 | 0.16 | 0.40 | 2.51 |
| Netherlands | Western Europe | 64 | 7.38 | 1.33 | 1.28 | 0.89 | 0.62 | 0.32 | 0.48 | 2.47 |
| Norway | Western Europe | 63 | 7.52 | 1.46 | 1.33 | 0.89 | 0.67 | 0.37 | 0.35 | 2.47 |
After so much preparation, we can now look at these data and tell some stories!
The first plot looks hard to read, so I will separate it and plot them respectively later.
library(ggplot2)
# A messy graph
ggplot(data, aes(x = Openness, y = Happiness, col = Region)) +
geom_point() +
geom_smooth(aes(group = 1),method = 'lm', se = F, linetype = 2) +
geom_smooth(method = 'lm', se = F) +
labs(x = "Openness Score",
y = "Happiness Score",
title = "Are open data friendly countries happy countries?",
subtitle = "Data openness and happiness by country in 2015")
It seems that there is a positive correlation between ‘Openness’ & ‘Happiness’ if we plot with all the data in the world.
However, if we plot the data separately according to their regions, it looks that the relation might be negative in some specific regions.
# The correlation between 'Openness' & 'Happiness' in the world.
ggplot(data, aes(x = Openness, y = Happiness)) +
geom_point() +
geom_smooth(aes(group = 1),method = 'lm', se = T, linetype = 2) +
labs(x = "Openness Score",
y = "Happiness Score",
title = "Are open data friendly countries happy countries?",
subtitle = "Scale:Worldwide Year:2015")
# The correlation between 'Openness' & 'Happiness' in different regions.
ggplot(data, aes(x = Openness, y = Happiness, col = Region)) +
geom_point() +
geom_smooth(method = 'lm', se = T) +
facet_wrap(.~Region) +
labs(x = "Openness Score",
y = "Happiness Score",
title = "Are open data friendly countries happy countries?",
subtitle = "Scale:Regional Year:2015")