This practice is based on Hans Rosling talks New Insights on Poverty and The Best Stats You’ve Ever Seen.

The assignment uses data to answer specific question about global health and economics. The data contradicts commonly held preconceived notions. For example, Hans Rosling starts his talk by asking: “for each of the six pairs of countries below, which country do you think had the highest child mortality in 2015?”

  1. Sri Lanka or Turkey
  2. Poland or South Korea
  3. Malaysia or Russia
  4. Pakistan or Vietnam
  5. Thailand or South Africa

Most people get them wrong. Why is this? In part it is due to our preconceived notion that the world is divided into two groups: the Western world versus the third world, characterized by “long life,small family” and “short life, large family” respectively. In this homework we will use data visualization to gain insights on this topic.

library(tidyverse)
library(dplyr)
library(ggplot2)
library(gridExtra)
theme_set(theme_bw())

Download and organize the data

The first step in our analysis is to download and organize the data. The necessary data to answer these question is available on the gapminder website.

1. Create five tibble table objects, one for each of the tables

We will use the following datasets:

  1. Childhood mortality
  2. Life expectancy
  3. Fertility
  4. Population
  5. Total GDP

Create five tibble table objects, one for each of the tables provided in the above files. Hints: Use the read_csv function. Because these are only temporary files, give them short names.

childhood <- read_csv("http://docs.google.com/spreadsheet/pub?key=0ArfEDsV3bBwCcGhBd2NOQVZ1eWowNVpSNjl1c3lRSWc&output=csv")
life <- read_csv("http://docs.google.com/spreadsheet/pub?key=phAwcNAVuyj2tPLxKvvnNPA&output=csv")
fert <- read_csv("http://docs.google.com/spreadsheet/pub?key=phAwcNAVuyj0TAlJeCEzcGQ&output=csv")
pop <- read_csv("http://docs.google.com/spreadsheet/pub?key=phAwcNAVuyj0XOoBL_n5tAQ&output=csv")
gdp <- read_csv("http://docs.google.com/spreadsheet/pub?key=pyj6tScZqmEfI4sLVvEQtHw&output=csv")

2. Write a function that takes a table as an argument and returns the column name

Write a function called my_func that takes a table as an argument and returns the column name. For each of the five tables, what is the name of the column containing the country names? Print out the tables or look at them with View to determine the column.

my_func <- function(tab){
  return(names(tab))
}

c(my_func(childhood)[1], my_func(life)[1], my_func(fert)[1], my_func(pop)[1], my_func(gdp)[1])
## [1] "Under five mortality"    "Life expectancy"        
## [3] "Total fertility rate"    "Total population"       
## [5] "GDP (constant 2000 US$)"

3. Fix inconsistency in naming their country column

In the previous problem we noted that gapminder is inconsistent in naming their country column. Fix this by assigning a common name to this column in the various tables.

names(childhood)[1] <- paste("country")
names(life)[1] <- paste("country")
names(fert)[1] <- paste("country")
names(pop)[1] <- paste("country")
names(gdp)[1] <- paste("country")

4. Create tidy data sets

Notice that in these tables, years are represented by columns. We want to create a tidy dataset in which each row is a unit or observation and our 5 values of interest, including the year for that unit, are in the columns. The unit here is a country/year pair.

We call this the long format. Use the gather function from the tidyr package to create a new table for childhood mortality using the long format. Call the new columns year and child_mortality

childhood_long <- gather(childhood, key=year, value=child_mortality, -country, factor_key = TRUE)

Now redefine the remaining tables in this way.

life_long <- gather(life, key=year, value=life_expectancy, -country, factor_key = TRUE)
fert_long <- gather(fert, key=year, value=fertility, -country, factor_key = TRUE)
pop_long <- gather(pop, key=year, value=population, -country, factor_key = TRUE)
gdp_long <- gather(gdp, key=year, value=total_gdp, -country, factor_key = TRUE)

5. Join all tables together

Now we want to join all these files together. Make one consolidated table containing all the columns

# Match all files by country and year
mergeAll <- function(x, y){
  df <- merge(x, y, by = c("country", "year"), all = TRUE)
  return(df)
}

# Outer join 
dat <- Reduce(mergeAll, list(childhood_long, life_long, fert_long, pop_long, gdp_long))

6. Add column: continent for each country

Add a column to the consolidated table containing the continent for each country. Hint: We have created a file that maps countries to continents (http://ghuang.stat.nctu.edu.tw/course/datasci17/files/hwks/continent-info.csv). Hint: Learn to use the left_join function.

# Load continent map
conti_map <- read_csv("http://ghuang.stat.nctu.edu.tw/course/datasci17/files/hwks/continent-info.csv",  col_names = c("country", "continent"))

# Left Join
df <- merge(dat, conti_map, by = c("country"), all.x = TRUE)

Report the child mortalilty rate in 2015

Report the child mortalilty rate in 2015 for these 5 pairs:

  1. Sri Lanka or Turkey
  2. Poland or South Korea
  3. Malaysia or Russia
  4. Pakistan or Vietnam
  5. Thailand or South Africa
reportChildMortality <- function(y, c1, c2){
  v1 <- filter(df, year==y & (country==c1)) %>% select(child_mortality)
  v1 <- v1[!is.na(v1)]
  v2 <- filter(df, year==y & (country==c2)) %>% select(child_mortality)
  v2 <- v2[!is.na(v2)]
  message("Child mortality for ", c1, " and ", c2, " in ", y, " are ", mean(v1), " and ", mean(v2))
}

# Sri Lanka or Turkey
reportChildMortality("2015", "Sri Lanka", "Turkey")

# Poland or South Korea
reportChildMortality("2015", "Poland", "South Korea")

# Malaysia or Russia
reportChildMortality("2015", "Malaysia", "Russia")

# Pakistan or Vietnam
reportChildMortality("2015", "Pakistan", "Vietnam")

# Thailand or South Africa
reportChildMortality("2015", "Thailand", "South Africa")
## Child mortality for Sri Lanka and Turkey in 2015 are 8.7 and 13.5
## Child mortality for Poland and South Korea in 2015 are 5.2 and 3.5
## Child mortality for Malaysia and Russia in 2015 are 8.2 and 9.6
## Child mortality for Pakistan and Vietnam in 2015 are 81.1 and 21.7
## Child mortality for Thailand and South Africa in 2015 are 12.3 and 42.1

long-life-in-a-small-family and short-life-in-a-large-family dichotomy

To examine if in fact there was a long-life-in-a-small-family and short-life-in-a-large-family dichotomy, we will visualize the average number of children per family (fertility) and the life expectancy for each country.

1. life expectancy versus fertiltiy in 1962

Use ggplot2 to create a plot of life expectancy versus fertiltiy in 1962 for Africa, Asia, Europe, and the Americas. Use color to denote continent and point size to denote population size:

# Filter and select data
new_df <- df %>%
  filter(year=="1962") %>%
  select(continent, life_expectancy, fertility, population)

# Remove missing values before plotting
new_df <- na.omit(new_df)

# Plot with ggplot2
g <- ggplot(data=new_df, aes(x=fertility, y=life_expectancy, color=continent, size=population))
g + geom_point(alpha = 0.7)

Do you see a dichotomy? Explain.

It is obvious that nearly all countries in Europe have low fertrility and high life expectancy, however, countries in Africa tend to have high fertility but low life expectancy.

2. Annotate the plot to show different types of countries

Now we will annotate the plot to show different types of countries.

Learn about OECD and OPEC. Add a couple of columns to your consolidated tables containing a logical vector that tells if a country is OECD and OPEC respectively. It is ok to base membership on 2015.


# Create the member list of OECD based on the official website (2017 Oct.)
# http://www.oecd.org/about/membersandpartners/
OECD <- c("Australia", "Austria", "Belgium", "Canada", "Chile", "Czech Republic", "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Iceland", "Ireland", "Israel", "Italy", "Japan", "Korea", "Latvia", "Luxembourg", "Mexico", "Netherlands", "New Zealand", "Norway", "Poland", "Portugal", "Slovak Republic", "Slovenia", "Spain", "Sweden", "Switzerland", "Turkey", "United Kingdom", "United States")

# Create the member list of OPEC based on the official website (2017 Oct.)
# http://www.opec.org/opec_web/en/about_us/25.htm
OPEC <- c("Algeria", "Angola", "Ecuador", "Equatorial Guinea", "Gabon", "Iran", "Iraq", "Kuwait", "Libya", "Nigeria", "Qatar", "Saudi Arabia", "United Arab Emirates", "Venezuela")

# Append logical vectors to the original data frame
df$OECD <- with(df, country %in% OECD)
df$OPEC <- with(df, country %in% OPEC)

3. Annotate the OECD countries and OPEC countries

Make the same plot as in Problem 3.1, but this time use color to annotate the OECD countries and OPEC countries. For countries that are not part of these two organization annotate if they are from Africa, Asia, or the Americas.

# Create a new column annotating if the conutry is a member of OECD, OPEC, or others.
df$category <- ifelse(df$OECD == TRUE, "OECD",
                      ifelse(df$OPEC == TRUE, "OPEC",
                             ifelse(df$continent %in% c("Africa", "Asia", "Americas"), df$continent,
                             "others")))

# Filter and select data
new_df <- df %>%
  filter(year=="1962" & category!="others") %>%
  select(category, life_expectancy, fertility, population)

# Remove missing values before plotting
new_df <- na.omit(new_df)

# Plot with ggplot2
g <- ggplot(data=new_df, aes(x=fertility, y=life_expectancy, color=category, size=population))
g + geom_point(alpha=0.7)

How would you describe the dichotomy?

All members of OECD have low fertility and ligh life expectancy while members of OPEC tend to have high fertility but low life expectancy, just like countries in Africa.

4. How different kinds of family change across time

Explore how this figure changes across time. Show us 4 figures that demonstrate how this figure changes through time.

# Split the years into 4 intervals
df$year <- as.numeric(as.character(df$year))
breakPoints <- c(1962, 1982, 2002, 2015)

# Filter and select data
new_df <- df %>%
  filter(year %in% breakPoints & category!="others") %>%
  select(year, category, life_expectancy, fertility, population)

# Remove missing values before plotting
new_df <- na.omit(new_df)

# Plot with ggplot2
g <- ggplot(data=new_df, aes(x=fertility, y=life_expectancy, color=category, size=population))
g + facet_grid(year ~ .) + geom_point(alpha=0.7)
  

Would you say that the same dichotomy exists today? Explain:

For members of OECD, the low-fertility-and-high-life-expectancy phenomenon appears to be clearer over time, however, for members of OPEC, fewer and fewer contries maintain the low-fertility-and-high-life-expectancy phenomenon as time goes by.

long-life-in-a-small-family and short-life-in-a-large-family: Comparison between different countries

Having time as a third dimension made it somewhat difficult to see specific country trends. Let’s now focus on specific countries.

1. Compare France and its former colony Tunisia

Let’s compare France and its former colony Tunisia. Make a plot of fertility versus year with color denoting the country. Do the same for life expecancy. How would you compare Tunisia’s improvement compared to France’s in the past 60 years? Hint: use geom_line

# Filter and select data
new_df <- df %>%
    filter(2016 - year <= 60 & (country == "France" | country == "Tunisia")) %>%
    select(country, fertility, life_expectancy, year)

# Remove missing values before plotting
new_df <- na.omit(new_df)

# Plot with ggplot2
g1 <- ggplot(data=new_df, aes(x=year, y=fertility, color=country)) + geom_line()
g2 <- ggplot(data=new_df, aes(x=year, y=life_expectancy, color=country)) + geom_line()
grid.arrange(g1, g2, ncol=2)

2. Compare Vietnam to the OECD countries

Do the same, but this time compare Vietnam to the OECD countries.

# Filter and select data
new_df <- df %>%
    filter(2016 - year <= 60 & (country == "Vietnam" | category == "OECD")) %>%
    select(country, category, fertility, life_expectancy, year)

# Remove missing values before plotting
new_df <- na.omit(new_df)

# modify label
new_df$category <- with(new_df, ifelse(country == "Vietnam", "Vietnam", category))

# Plot with ggplot2
g1 <- ggplot(data=new_df, aes(x=year, y=fertility, color=category)) + geom_line()
g2 <- ggplot(data=new_df, aes(x=year, y=life_expectancy, color=category)) + geom_line()
grid.arrange(g1, g2, ncol=2)

Examine GDP per capita per day

We are now going to examine GDP per capita per day.

1. Distribution of GDP per capita per day across countries in 1970

Create a smooth density estimate of the distribution of GDP per capita per day across countries in 1970. Include OECD, OPEC, Asia, Africa, and the Americas in the computation. When doing this we want to weigh countries with larger populations more. We can do this using the “weight” argument in geom_density.

# Filter, select, and summarize data
new_df <- df %>%
    filter(year == "1970", category != "others") %>%
    select(country, category, total_gdp, population) %>%
    group_by(category)

# Remove missing values before plotting
new_df <- na.omit(new_df)

# Calculate GDP per capita per day
new_df$gdp_per_capita_per_day <- with(new_df, total_gdp/population/365)

# Plot with ggplot2
g <- ggplot(data=new_df, aes(x=gdp_per_capita_per_day))
g + geom_density(aes(weight=population/sum(population))) + 
  scale_x_log10() + 
  labs(x="GDP per capita per day")

2. Distribution in each group

Now do the same but show each of the five groups separately.

```{r, warning=FALSE} g <- ggplot(data=new_df, aes(x=gdp_per_capita_per_day, fill=category)) g + geom_density(aes(weight=population/sum(population)), alpha=0.3) + scale_x_log10() + labs(x=”GDP per capita per day”)


![](https://i.imgur.com/eREsdKy.png)

### 3. How the distribution has changed through the years

Visualize these densities for several years. Show a couple of of them. Summarize how the distribution has changed through the years.

```{r, warning=FALSE}
# Filter, select, and summarize data
years_selected <- c(1970, 1990, 2010)
new_df <- df %>%
    filter(year %in% years_selected, category != "others") %>%
    select(year, country, category, total_gdp, population)

# Remove missing values before plotting
new_df <- na.omit(new_df)

# Calculate GDP per capita per day
new_df$gdp_per_capita_per_day <- with(new_df, total_gdp/population/365)
g <- ggplot(data=new_df, aes(x=gdp_per_capita_per_day, fill=category))
g + facet_grid(year ~ .) + 
  geom_density(aes(weight=population/sum(population)), alpha=0.3) + 
  scale_x_log10() + 
  labs(x="GDP per capita per day")

References