Source: Megan Risdal - Happiness and Open Data
Data Collect & Seleect
Formattable
Plot

Data Collect & Select

First, we download csv files from kaggle script, and join two datasets together by ‘Country’.
After that, we select the columns we need for analysis, and take a closer look at our data.

library(dplyr)
# Read csv from kaggle script and save it as character.
open_data <- read.csv('D:/GitHub/NTU-CS-X/Week3/Data/countries.csv', stringsAsFactors = F)
happiness <- read.csv('D:/GitHub/NTU-CS-X/Week3/Data/2015.csv', stringsAsFactors = F)
# Rename from "Country Name" to "Country" for joining csv later.
colnames(open_data)[2] <- 'Country'
# Join two csv file and select the needed columns for analysis.
data <- open_data %>%
  left_join(happiness, by = 'Country') %>%
  mutate(Country = factor(Country)) %>%
  select(Country, Region, X2015.Score, Happiness.Score, Economy..GDP.per.Capita., 
         Family, Health..Life.Expectancy., Freedom, Trust..Government.Corruption., 
         Generosity, Dystopia.Residual)
# Rename column names with a user friendly title.
colnames(data) <- c("Country", "Region", "Openness", "Happiness", "GDP", "Family", "Health", "Freedom", "Trust", "Generosity", "DystopiaResidual")
head(data, 5)

##          Country                      Region Openness Happiness     GDP
## 1         Taiwan                Eastern Asia       78     6.298 1.29098
## 2 United Kingdom              Western Europe       76     6.867 1.26637
## 3        Denmark              Western Europe       70     7.527 1.32548
## 4       Colombia Latin America and Caribbean       68     6.477 0.91861
## 5        Finland              Western Europe       67     7.406 1.29025
##    Family  Health Freedom   Trust Generosity DystopiaResidual
## 1 1.07617 0.87530 0.39740 0.08129    0.25376          2.32323
## 2 1.28548 0.90943 0.59625 0.32067    0.51912          1.96994
## 3 1.36058 0.87464 0.64938 0.48357    0.34139          2.49204
## 4 1.24018 0.69077 0.53466 0.05120    0.18401          2.85737
## 5 1.31826 0.88911 0.64169 0.41372    0.23351          2.61955

Formattable

And now let’s create a table for easier analysis.
In this case, I rounded these values to 2 decimal points and rearrange the table in a decending order according to ‘Openness’.

library(formattable)
data %>%
  # Rounded these values to 2 decimal points.
  mutate_if(is.numeric, funs(round(., 2))) %>%
  # Rearrange the table in a decending order according to 'Openness'.
  arrange(desc(Openness)) %>%
  formattable(align = "l") %>%
  head(10)

Country	Region	Openness	Happiness	GDP	Family	Health	Freedom	Trust	Generosity	DystopiaResidual
Taiwan	Eastern Asia	78	6.30	1.29	1.08	0.88	0.40	0.08	0.25	2.32
United Kingdom	Western Europe	76	6.87	1.27	1.29	0.91	0.60	0.32	0.52	1.97
Denmark	Western Europe	70	7.53	1.33	1.36	0.87	0.65	0.48	0.34	2.49
Colombia	Latin America and Caribbean	68	6.48	0.92	1.24	0.69	0.53	0.05	0.18	2.86
Finland	Western Europe	67	7.41	1.29	1.32	0.89	0.64	0.41	0.23	2.62
Australia	Australia and New Zealand	67	7.28	1.33	1.31	0.93	0.65	0.36	0.44	2.27
Uruguay	Latin America and Caribbean	66	6.49	1.06	1.21	0.81	0.60	0.25	0.23	2.32
United States	North America	64	7.12	1.39	1.25	0.86	0.55	0.16	0.40	2.51
Netherlands	Western Europe	64	7.38	1.33	1.28	0.89	0.62	0.32	0.48	2.47
Norway	Western Europe	63	7.52	1.46	1.33	0.89	0.67	0.37	0.35	2.47

Plot

After so much preparation, we can now look at these data and tell some stories!

Are open data friendly countries happy countries?

The first plot looks hard to read, so I will separate it and plot them respectively later.

library(ggplot2)
# A messy graph
ggplot(data, aes(x = Openness, y = Happiness, col = Region)) +
  geom_point() +
  geom_smooth(aes(group = 1),method = 'lm', se = F, linetype = 2) +
  geom_smooth(method = 'lm', se = F) + 
  labs(x = "Openness Score",
       y = "Happiness Score",
       title = "Are open data friendly countries happy countries?",
       subtitle = "Data openness and happiness by country in 2015")

It seems that there is a positive correlation between ‘Openness’ & ‘Happiness’ if we plot with all the data in the world.
However, if we plot the data separately according to their regions, it looks that the relation might be negative in some specific regions.

# The correlation between 'Openness' & 'Happiness' in the world.
ggplot(data, aes(x = Openness, y = Happiness)) +
  geom_point() +
  geom_smooth(aes(group = 1),method = 'lm', se = T, linetype = 2) +
  labs(x = "Openness Score",
       y = "Happiness Score",
       title = "Are open data friendly countries happy countries?",
       subtitle = "Scale:Worldwide Year:2015")

# The correlation between 'Openness' & 'Happiness' in different regions.
ggplot(data, aes(x = Openness, y = Happiness, col = Region)) +
  geom_point() +
  geom_smooth(method = 'lm', se = T) +
  facet_wrap(.~Region) +
  labs(x = "Openness Score",
       y = "Happiness Score",
       title = "Are open data friendly countries happy countries?",
       subtitle = "Scale:Regional Year:2015")

What other measures are correlated with “Openness”?

Again, we can see the strong positive correlation between ‘Openness’ & ‘Happiness’ in the world.
In addition, the correlations between ‘Happiness’ and most measurements are also positive!
At the first glance, it might seem weird that we have positive relation between ‘Happiness’ and ‘Dystopia Residual’. However, after learnig that ‘Dystopia Residual’ means the difference from “a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors”, it’s not that confusing anymore.

library(corrplot)
open_data_corr <- data %>%
  select(Openness, Happiness, GDP, Family, Health, 
         Freedom, Trust, Generosity, DystopiaResidual)

od_corr <- cor(open_data_corr, use = "complete", method = "pearson")
corrplot(od_corr)

Happiness & Open Data

Bourbon0212

2018年7月18日

Data Collect & Select

Formattable

Plot

Are open data friendly countries happy countries?

What other measures are correlated with “Openness”?