Data Collect & Select

First, we download csv files from kaggle script, and join two datasets together by ‘Country’.
After that, we select the columns we need for analysis, and take a closer look at our data.

library(dplyr)
# Read csv from kaggle script and save it as character.
open_data <- read.csv('D:/GitHub/NTU-CS-X/Week3/Data/countries.csv', stringsAsFactors = F)
happiness <- read.csv('D:/GitHub/NTU-CS-X/Week3/Data/2015.csv', stringsAsFactors = F)
# Rename from "Country Name" to "Country" for joining csv later.
colnames(open_data)[2] <- 'Country'
# Join two csv file and select the needed columns for analysis.
data <- open_data %>%
  left_join(happiness, by = 'Country') %>%
  mutate(Country = factor(Country)) %>%
  select(Country, Region, X2015.Score, Happiness.Score, Economy..GDP.per.Capita., 
         Family, Health..Life.Expectancy., Freedom, Trust..Government.Corruption., 
         Generosity, Dystopia.Residual)
# Rename column names with a user friendly title.
colnames(data) <- c("Country", "Region", "Openness", "Happiness", "GDP", "Family", "Health", "Freedom", "Trust", "Generosity", "DystopiaResidual")
head(data, 5)
##          Country                      Region Openness Happiness     GDP
## 1         Taiwan                Eastern Asia       78     6.298 1.29098
## 2 United Kingdom              Western Europe       76     6.867 1.26637
## 3        Denmark              Western Europe       70     7.527 1.32548
## 4       Colombia Latin America and Caribbean       68     6.477 0.91861
## 5        Finland              Western Europe       67     7.406 1.29025
##    Family  Health Freedom   Trust Generosity DystopiaResidual
## 1 1.07617 0.87530 0.39740 0.08129    0.25376          2.32323
## 2 1.28548 0.90943 0.59625 0.32067    0.51912          1.96994
## 3 1.36058 0.87464 0.64938 0.48357    0.34139          2.49204
## 4 1.24018 0.69077 0.53466 0.05120    0.18401          2.85737
## 5 1.31826 0.88911 0.64169 0.41372    0.23351          2.61955

Formattable

And now let’s create a table for easier analysis.
In this case, I rounded these values to 2 decimal points and rearrange the table in a decending order according to ‘Openness’.

library(formattable)
data %>%
  # Rounded these values to 2 decimal points.
  mutate_if(is.numeric, funs(round(., 2))) %>%
  # Rearrange the table in a decending order according to 'Openness'.
  arrange(desc(Openness)) %>%
  formattable(align = "l") %>%
  head(10)
Country Region Openness Happiness GDP Family Health Freedom Trust Generosity DystopiaResidual
Taiwan Eastern Asia 78 6.30 1.29 1.08 0.88 0.40 0.08 0.25 2.32
United Kingdom Western Europe 76 6.87 1.27 1.29 0.91 0.60 0.32 0.52 1.97
Denmark Western Europe 70 7.53 1.33 1.36 0.87 0.65 0.48 0.34 2.49
Colombia Latin America and Caribbean 68 6.48 0.92 1.24 0.69 0.53 0.05 0.18 2.86
Finland Western Europe 67 7.41 1.29 1.32 0.89 0.64 0.41 0.23 2.62
Australia Australia and New Zealand 67 7.28 1.33 1.31 0.93 0.65 0.36 0.44 2.27
Uruguay Latin America and Caribbean 66 6.49 1.06 1.21 0.81 0.60 0.25 0.23 2.32
United States North America 64 7.12 1.39 1.25 0.86 0.55 0.16 0.40 2.51
Netherlands Western Europe 64 7.38 1.33 1.28 0.89 0.62 0.32 0.48 2.47
Norway Western Europe 63 7.52 1.46 1.33 0.89 0.67 0.37 0.35 2.47

Plot

After so much preparation, we can now look at these data and tell some stories!

Are open data friendly countries happy countries?

The first plot looks hard to read, so I will separate it and plot them respectively later.

library(ggplot2)
# A messy graph
ggplot(data, aes(x = Openness, y = Happiness, col = Region)) +
  geom_point() +
  geom_smooth(aes(group = 1),method = 'lm', se = F, linetype = 2) +
  geom_smooth(method = 'lm', se = F) + 
  labs(x = "Openness Score",
       y = "Happiness Score",
       title = "Are open data friendly countries happy countries?",
       subtitle = "Data openness and happiness by country in 2015")

It seems that there is a positive correlation between ‘Openness’ & ‘Happiness’ if we plot with all the data in the world.
However, if we plot the data separately according to their regions, it looks that the relation might be negative in some specific regions.

# The correlation between 'Openness' & 'Happiness' in the world.
ggplot(data, aes(x = Openness, y = Happiness)) +
  geom_point() +
  geom_smooth(aes(group = 1),method = 'lm', se = T, linetype = 2) +
  labs(x = "Openness Score",
       y = "Happiness Score",
       title = "Are open data friendly countries happy countries?",
       subtitle = "Scale:Worldwide Year:2015")

# The correlation between 'Openness' & 'Happiness' in different regions.
ggplot(data, aes(x = Openness, y = Happiness, col = Region)) +
  geom_point() +
  geom_smooth(method = 'lm', se = T) +
  facet_wrap(.~Region) +
  labs(x = "Openness Score",
       y = "Happiness Score",
       title = "Are open data friendly countries happy countries?",
       subtitle = "Scale:Regional Year:2015")

What other measures are correlated with “Openness”?

Again, we can see the strong positive correlation between ‘Openness’ & ‘Happiness’ in the world.
In addition, the correlations between ‘Happiness’ and most measurements are also positive!
At the first glance, it might seem weird that we have positive relation between ‘Happiness’ and ‘Dystopia Residual’. However, after learnig that ‘Dystopia Residual’ means the difference from “a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors”, it’s not that confusing anymore.

library(corrplot)
open_data_corr <- data %>%
  select(Openness, Happiness, GDP, Family, Health, 
         Freedom, Trust, Generosity, DystopiaResidual)

od_corr <- cor(open_data_corr, use = "complete", method = "pearson")
corrplot(od_corr)