This is a practice in R for realization of Monte Carlo Simulation, Central Limit Theorem (CLT), normal approximation, and so on.

Game of Roullete

In the game of roullete you can bet on several things including black or red. On this bet, if you win, you double your money; if you fail, you lose what you bet. For example, you bet 1 dollar on black. When the result is black, you earn 1 dollar; when the result is red or green, you lose 1 dollar. How does the casino make money on this then? If you look at the possibilities, you realize that the chance of red or black are both slightly less than 1/2. There are two green spots, so the landing on black (or red) is actually 18/38, or 9/19.

Sampling Function

Let’s make a quick sampling model for this simple version of roulette. You are going to bet 1 dollar each time you play and always bet on black. Make a model for this process using the sample function. Write a function get_outcome that takes as an argument the number of times you play $N$ and returns your earnings $S_N$.

get_outcome <- function(N){
  ## Create samples
  roulette <- rep(c("red", "black", "green"), times = c(18,18,2))
  
  ## Do sampling
  events <- replicate(N, sample(roulette, 1))
  
  ## When the result is black, you earn 1 dollar; otherwise, you lose 1 dollar
  earnings <- sapply(events, function(e){
    if(e=="black"){
      return(1)
    }
    else{
      return(-1)
    }
  })
  
  ## Sum up the earnings for N times
  SN <- sum(earnings)
  return(SN)
}

Monte Carlo Simulation for the distribution of total earnings

Use Monte Carlo simulation to study the distribution of total earnings $S_N$ for $N=10,25,100,1000$. That is, study the distribution of earnings for different number of plays. Are they similar to the normal distribution? How do the expected values and standard errors change with $N$?

## Define standard error function
stderr <- function(x){
  sqrt(mean((x-mean(x))^2))
}

## Do Monte Carlo simulation with different value of N
B <- 10^5
par(mfrow=c(4,2))
E_SN <- c()
for(N in c(10, 25, 100, 1000)){
  
  ## Play the game B runs, each run we bet N times and record the total earning.
  events <- replicate(B, get_outcome(N))
  E_SN <- c(E_SN, mean(events))
  cat("N=",N,"\n------\n")
  cat("Expected Value: ", mean(events), "\n")
  cat("Standard Error: ", stderr(events), "\n\n")
  
  ## Count the occurrence of different values of total earnings
  tab <- table(events)
  
  ## X-Y Plot
  plot(names(tab), as.vector(tab), main=paste("Distribution of Earnings ( N =", N, ")"), xlab="total earning", ylab="frequencies")
  
  ## Q-Q Plot
  z <- (events - mean(events)) / stderr(events)
  qqnorm(z, main=paste("Normal Q-Q Plot ( N =", N, ")"))
  abline(0,1)
}
## N= 10 
## ------
## Expected Value:  -0.53592 
## Standard Error:  3.145993

## N= 25 
## ------
## Expected Value:  -1.33416 
## Standard Error:  4.991162

## N= 100 
## ------
## Expected Value:  -5.22196 
## Standard Error:  9.991617

## N= 1000 
## ------
## Expected Value:  -52.67244 
## Standard Error:  31.60298

Monte Carlo Simulation for the distribution of average winnings

Repeat the previous simulation but for the average winnings $S_N/N$.

## Do Monte Carlo simulation with different value of N
B <- 10^5
par(mfrow=c(4,2))
for(N in c(10, 25, 100, 1000)){
  
  ## Play the game B runs, each run we bet N times and record the average earning.
  events <- replicate(B, get_outcome(N))/N
  cat("N=",N,"\n------\n")
  cat("Expected Value: ", mean(events), "\n")
  cat("Standard Error: ", stderr(events), "\n")
  cat("\n")
  
  ## Count the occurrence of different values of average earnings
  tab <- table(events)
  
  ## X-Y Plot
  plot(names(tab), as.vector(tab), main=paste("Distribution of Earnings ( N =", N, ")"), xlab="total earning", ylab="frequencies")
  
  ## Q-Q Plot
  z <- (events - mean(events)) / stderr(events)
  qqnorm(z, main=paste("Normal Q-Q Plot ( N =", N, ")"))
  abline(0,1)
}
## N= 10 
## ------
## Expected Value:  -0.05146 
## Standard Error:  0.3141412

## N= 25 
## ------
## Expected Value:  -0.0523152 
## Standard Error:  0.1987781

## N= 100 
## ------
## Expected Value:  -0.0529774 
## Standard Error:  0.09981486

## N= 1000 
## ------
## Expected Value:  -0.0526206 
## Standard Error:  0.03166803

Comparison between theoretical values and simulation results

What is the expected value of $S_N$ in the previous simulation theoretically? What is the standard error of $S_N$? Are they close to your simulation results?

mu <- 1*(9/19) + (-1)*(10/19)
N <- c(10, 25, 100, 1000)
x <- data.frame("N"=c(10, 25, 100, 1000), "theoretical values"=mu*N, "simulation results"=E_SN)
x
##      N theoretical.values simulation.results
## 1   10         -0.5263158           -0.53592
## 2   25         -1.3157895           -1.33416
## 3  100         -5.2631579           -5.22196
## 4 1000        -52.6315789          -52.67244

For every bet, the expected earning can be calculated as $1 \times p+(-1) \times (1-p)$, where $p$ is the probability that the result is black. So as the table above, we can calculate the theoretical values of earnings when you bet 10 times or more. Compare our Monte Carlo simulation results with these theoretical values, we can see that the differences between them are no more than 0.06. That is, our simulation results are very close to the theoretical values when calculating the expected values of earnings when we bet 10, 25, 100, or 1000 times.

Approximate the probability that the casino loses money

Use the Central Limit Theorem (CLT) to approximate the probability that the casino loses money when you play 25 times. Then use a Monte Carlo simulation to confirm.

N <- 25

## Use the Central Limit Theorem (CLT) to approximate
mu <- 1*(9/19) + (-1)*(10/19)
std.dev <- (1-(-1))*sqrt((9/19)*(10/19))
mean <- N*mu
s <- sqrt(N)*std.dev
expected <- 1 - pnorm(0, mean = mean, sd = s)
cat("With CLT approximation, \nthe probability that the casino loses money (N = 25) is", expected, ".\n\n")
## With CLT approximation, 
## the probability that the casino loses money (N = 25) is 0.3960737 .
## Use a Monte Carlo simulation to confirm
B <- 10^5
events <- replicate(B, get_outcome(N))
simulated <- mean(events > 0)  
cat("With Monte Carlo simulation, \nthe probability that the casino loses money (N = 25) is", simulated, ".\n")
## With Monte Carlo simulation, 
## the probability that the casino loses money (N = 25) is 0.39489 .

To use the Central Limit Theorem (CLT) for approximaton on the eaning of each bet, we need to calculate the expectation and the standard deviation of the earning for a bet, which are $\mu = 1 \times p+(-1) \times (1-p)$ and $\sigma = \vert 1-(-1)\vert \times \sqrt{p \times (1-p)}$ respectively. Then we use $25 \times \mu$ and $\sqrt{25} \times \sigma$ as the input of pnorm to calculate the approximated probability.

Compare the approximated probility with the probability calculated from Monte Carlo simulation, we can find that they are really close. That is, the probability of the earnings/loses of the casino can actually be calculated from the approximation of Central Limit Theorem.

The probability that the casino loses money over playing times

In general, what is the probability that the casino loses money as a function of $N$? Make a plot for values ranging from 25 to 1,000. Why does the casino give you free drinks if you keep playing?

## Define the function that calculates the probability 
## that the casino loses money if you play N times
CLT_approximation <- function(N){
  mu <- 1*(9/19) + (-1)*(10/19)
  std.dev <- (1-(-1))*sqrt((9/19)*(10/19))
  mean <- N*mu
  s <- sqrt(N)*std.dev
  expected <- 1 - pnorm(0, mean = mean, sd = s)
  return(expected)
}

## Calculate the probability that the casino loses money
## for different values of N (from N=25 to N=1000)
x <- seq(25, 1000)
y <- sapply(x, CLT_approximation)

## X-Y Plot
plot(x, y, cex=0.3, xlab="N", ylab="Probabily", main="Probabily that the casino loses money")

Using the approximation of Central Limit Theorem, we can see that the probability that the casino loses money decreases as $N$(the number of times a player plays the game) increases. This is why the casino hopes the player can keep playing at the same desk as long as possible.

Predict the outcomes of the USA pesidential election of each state

In 2012 Nate Silver, and other data scientists, predicted the outcomes of the USA pesidential election of each state correctly. They did this by aggregating data from many polls to create more precise estimates than what one single poll can provide.

In this problem, we will try to predict the result of the 2016 USA pesidential election by studying the performance of polls in elections that already occurred and then aggregating results.

Central limit theory (CLT) on poll results

Let $X_{i}=1$ if the $i$th voter is democrat, $X_{i}=0$ if republican for $i=1, \cdots, N$. The central limit theory (CLT) says that if $N$ is large,

\[\bar{X}=\frac{1}{N}\sum_{i=1}^{N}X_{i}=\hat{p} \sim N(p, \frac{\hat{p}(1-\hat{p})}{N})\]

where $p$ is the proportion of democrats in the population. Based on the above CLT result, what is the 95% CI for $p$?

The Central Limit Theorem tells us that the distribution of any random variable $X$ is approximately normal with its mean $\mu$ (the population mean) and its standard deviation $\sigma/\sqrt{n}$ where $\sigma$ is the population standard deviation and $n$ is the population size. As a result, the random interval $\mu - Z_{0.975} \frac{\sigma}{\sqrt{n}}$ to $\mu + Z_{0.975}\frac{\sigma}{\sqrt{n}}$ (where $Z_{0.975}$ is approximately 1.96) has a 95% probability of falling on the true value of $X$.

Since we know the distribution of $p$ (the proportion of democrats in the population), so the 95% confidence interval of $p$ can be formulated as below:

\[\bar{X}-Z_{0.975}\frac{S_{X}}{\sqrt{N}} \leq p \leq \bar{X}+Z_{0.975}\frac{S_{X}}{\sqrt{N}}\]

where $N$ is the sample size, $\bar{X}$ is acutually $\hat{p}$, and $S_{X}$ is the sample standard deviation $\sqrt{\hat{p}(1-\hat{p})}$.

Confidence interval (CI)

Suppose that the true population proportion $p=0.47$. Perform a simulation for $N=30$ to show that the CI you create in the previous section is actually a 95% CI.

p <- 0.47
N <- 30
B <- 10^5

## Generate population corresponds to true population distribution (p=0.47)
voters <- rep(c(0,1), times = c(53,47))

## Do Monte Carlo simulation B times
coverd <- 0
for(i in 1:B) {
  
  ## Sample 30 voters from population
  X <- sample(c(0, 1), N, replace=TRUE, prob=c(0.53, 0.47))
  
  ## Calculate confidence interval for this sampling
  p_hat <- mean(X)
  se <- sqrt(p_hat*(1-p_hat)/N)   #se <- sd(X)/sqrt(N)
  Q <- qnorm(0.975)               #Q <- qt(0.975, df=(N-1))
  interval <- c(p_hat-Q*se, p_hat+Q*se)
  
  ## If true parameter falls in the interval, then increment the count
  if(p <= interval[2] & p >= interval[1]){
    coverd <- coverd + 1
  }
}

cat("Simulation shows that for our random confidence interval, the probability of falling on the parameter we are estimating is", coverd/B, 
    "\nSince The central limit theorem applies poorly to binomial distribution with a sample size less than 30, we should increase the value of N (e.g., N = 100), then the probability would be close to 95%.")
## Simulation shows that for our random confidence interval, the probability of falling on the parameter we are estimating is 0.93203 
## Since The central limit theorem applies poorly to binomial distribution with a sample size less than 30, we should increase the value of N (e.g., N = 100), then the probability would be close to 95%.

Test statistic

We are interested in testing $H_{0}: p=0.5$ versus $H_{a}: p \neq 0.5$. The test statistic we will use is

\[t=\frac{\hat{p}-0.5}{\sqrt{\frac{\hat{p}(1-\hat{p})}{N}}}\]

If a poll drawed $N=100$ voters and obtained the proportion of democrates $\hat{p}=0.49$. What is the p-value for this poll result? Please calculate the p-value using the population data and the normal approximation.

## Define t-statistics function
tstat <- function(p, N){ return((p-0.5)/sqrt(p*(1-p)/N)) }

## Calculate t-statistics for p hat
p <- 0.49
N <- 100
obststat <- tstat(p, N)

## Calculate the distribution of t-statistics using the population data corresponds to null hypothesis
B <- 10^5
null_distribution <- function(N){
    p <- mean(sample(c(0, 1), N, replace=TRUE, prob=c(0.5, 0.5)))
    return(tstat(p, N))
}
null <- replicate(B, null_distribution(N))
pval <- mean(null > abs(obststat)) + mean(null < -abs(obststat))

## Because of the definition of p and N, the t-statistics generated from this simulation
## will aways be multiples of 0.2, so I fit a smooth curve to approach the real distribution
## and then calculate the p-value from the curve.
tab <- sapply(seq(1:100)/100, function(e){quantile(null, p=c(e))})
null_smoothed <- predict(loess(as.vector(tab) ~ seq(1:100)))
pval_smoothed <- mean(null_smoothed > abs(obststat)) + mean(null_smoothed < -abs(obststat))

## Calculate the p-value using normal approximation
righttail <- 1 - pnorm(abs(obststat))
lefttail <- pnorm(-abs(obststat))
npval <- lefttail + righttail

## Show Result
x <- data.frame("method"=c("using population data",
                           "using population data (fit a smooth curve)",  "using normal approximation"), 
                "p-value"=c(pval, pval_smoothed, npval))
x
##                                       method   p.value
## 1                      using population data 0.7647800
## 2 using population data (fit a smooth curve) 0.8400000
## 3                 using normal approximation 0.8414493

Data wrangling

Now returing to our analysis of the 2016 USA pesidential election. The first step in our analysis will be to wrangle the data in a way that will simplify the analysis. Ultimately, we want a table of results with each poll represented by a row and including results for each candidate as well as information about the poll such as name and date.

Install and load the pollstR package. This package provides functions to access data in the Huffington Post’s database. Read the help file for the pollster_charts_polls() function and write a function that reads the polls’ data related to 2016 General Election: Trump vs. Clinton in this chart. Name the object race2016. Hint: select a right slug for this chart.

library(pollstR)

slug <- "2016-general-election-trump-vs-clinton"
race2016 <- pollster_charts_polls(slug)[["content"]]

Create a tidy data set

Examine and familiarize yourself with the race2016 object. Note that the content component has a table with election results. Look at the content component and create a new table with only variables: Trump, Clinton, poll_slug, start_date, observations, margin_of_error, and mode.

library("dplyr")

df <- race2016 %>% select("Trump", "Clinton", "poll_slug", "start_date", "observations", "margin_of_error", "mode")

Visualization of poll data

To explore these poll data, write codes to create a plot similar to this chart. Hint: follow the example codes provided in this webpage.

library(ggplot2)
theme_set(theme_bw())

ggplot() + geom_point(data = df, aes(x = start_date, y = Trump, color = "Trump"), alpha = 0.5) +
          geom_smooth(data = df, method = "loess", se = FALSE, aes(x = start_date, y = Trump, color = "Trump")) +
          geom_point(data = df, aes(x = start_date, y = Clinton, color = "Clinton"), alpha = 0.5) +
          geom_smooth(data = df, method = "loess", se = FALSE, aes(x = start_date, y = Clinton, color = "Clinton")) + 
          ggtitle("2016 General Election: Trump vs. Clinton") + ylab("Support Ratio") + xlab("Date")

Aggregation of poll result

Now we want to aggregate these polls’ results. We can assume that all these polls were based on independent data. So, in section “Create a tidy data set”, how many observations does the created table actually show data for? Hint: the column observations indicates the number of voters picked in each poll. By combining these independent polls, how many votes have we observed?

votes <- df %>% select("observations") %>% sum()
cat("The number of votes we have observed is", votes)
## The number of votes we have observed is 969747

Estimated proportion of voting for Trump/Clinton

Furthermore, we can deduce how many voters favor Trump and how many voters favor Clinton in each poll. By adding these all up, what are the estimated proportion of voting for Trump and the one of voting for Clinton from these polls?

## Deduce how many voters favor Trump and how many voters favor Clinton in each poll, and add them up
df$Trump.pop <- df$Trump/100 * df$observations
df$Clinton.pop <- df$Clinton/100 * df$observations
votes.Trump <- df$Trump.pop %>% sum()
votes.Clinton <- df$Clinton.pop %>% sum()

## Calcualte the estimated proportion of voting for Trump and the one of voting for Clinton
tab <- matrix(c(votes.Trump/votes, votes.Clinton/votes), ncol=1, byrow=TRUE)

## Show Result
colnames(tab) <- c("estimated proportion")
rownames(tab) <- c("Trump", "Clinton")
tab
##         estimated proportion
## Trump              0.4196854
## Clinton            0.4750039

Confidence interval for the true proportion voting for Trump/Clinton

Using the result in the section “Central limit theory (CLT) on poll results”, what is the 95% confidence interval for the true proportion voting for Trump based on the aggregated data? What is the 95% confidence interval for the proportion for Clinton?

N <- votes

## Calculate the 95% confidence interval for the proportion for Trump
X <- rep(c(0,1), times = c(votes-votes.Trump, votes.Trump))
se <- sd(X)/sqrt(N)
Q <- qnorm(0.975)
interval.Trump <- c(mean(X)-Q*se, mean(X)+Q*se)

## Calculate the 95% confidence interval for the proportion for Clinton
X <- rep(c(0,1), times = c(votes-votes.Clinton, votes.Clinton))
se <- sd(X)/sqrt(N)
Q <- qnorm(0.975)
interval.Clinton <- c(mean(X)-Q*se, mean(X)+Q*se)

## Show Result
tab <- matrix(c(paste(interval.Trump, collapse=" ~ "), paste(interval.Clinton, collapse=" ~ ")), ncol=1, byrow=TRUE)
colnames(tab) <- c("95% confidence interval for the true proportion")
rownames(tab) <- c("Trump", "Clinton")
tab
##         95% confidence interval for the true proportion
## Trump   "0.418702925705153 ~ 0.420667383633582"        
## Clinton "0.474009856955411 ~ 0.475997670788963"

Spread of the estimated proportion voting for Trump/Clinton

For illustrative purposes, we will assume that there are only two parties and call $p$ the proportion voting for Clinton and $1-p$ the proportion voting for Trump. We are interested in the spread $d=p-(1-p)=2p-1$. The spread is approximately $1-2\hat{p}$, where $\hat{p}$ is the estimated proportion based on the aggregated data. This implies that the standard error for the spread is twice as large than for $\hat{p}$. So our confidence interval for the spread $d$ is:

## Exclude votes that are neither for Clinton nor for Trump
N <- votes.Clinton + votes.Trump
p <- votes.Clinton/N

## Calculate the confidence interval of spread d
d <- 1-2*p
se <- sqrt(p*(1-p)/N) * 2
Q <- qnorm(0.975)
interval.spread <- c(d-Q*se, d+Q*se)

## Print out Result
cat("95% Confidence interval for the spread (d) is", paste(interval.spread, collapse = " ~ "))
## 95% Confidence interval for the spread (d) is -0.0639300252436611 ~ -0.0597297183214559

Null hypothesis

We are interested in testing $H_{0}: d=0$ versus $H_{a}: d \neq 0$. Create your test statistic for this test, and obtain the p-value based on the aggregrated data.

The test statistics for $d$ can be formulated into a function of $p$ since $d = 1-2p$

\[t-statistics=\frac{1-2\times\hat{p}}{2\times\sqrt{\frac{\hat{p}(1-\hat{p})}{N}}}\]

where $p$ is the proportion voting for Clinton and $N$ is the sample size

N <- votes.Clinton + votes.Trump
p <- votes.Clinton/N

## Define t-statistics function for spread d
tstat <- function(p){ return((1-2*p)/(2*sqrt(p*(1-p)/N))) }

## Calculate the t-statistics for spread
obststat <- tstat(p)

## Calculate p-value using normal approximation
righttail <- 1 - pnorm(abs(obststat))
lefttail <- pnorm(-abs(obststat))
npval <- lefttail + righttail

## Show Result
cat("The p-value based on the aggregrated data is", npval, 
    ".\nThat is, the evidence is strong to reject null hypothesis, and thus spread (d) is not zero.")
## The p-value based on the aggregrated data is 0 .
## That is, the evidence is strong to reject null hypothesis, and thus spread (d) is not zero.

References