Bootstrapping

<br>
<br>
.pull-right[

# Bootstrapping
## Dr. Mine Dogucu
]

---

## Data

```r
lapd <- lapd %>% 
  janitor::clean_names() %>% 
  filter(year == 2018) %>% 
  select(base_pay)
```

--
We will be using payroll data from Los Angeles Police Department (LAPD) from 2018.

```r
glimpse(lapd)   
```

```
## Rows: 14,824
## Columns: 1
## $ base_pay <dbl> 119321.60, 113270.70, 148116.00, 78676.87, 109373.63, 95001.7…
```

---

## Population Distribution

---

## True Median

```r
median(lapd$base_pay)
```

```
## [1] 97600.66
```
This is a **population parameter**. We often do not know population parameters but we can **estimate** them. Estimation requires some sample data.

---

### Sample 1

```
##  [1]      0.00 101248.80 109378.40 132957.60  90956.57 132743.97 104091.10
##  [8] 104100.80  48696.00 125958.40
```

Median of sample 1 is 104095.95.

### Sample 2

```
##  [1] 95971.20 96193.81     0.00 34479.44 90005.56 66881.94 75342.80 68034.18
##  [9] 54612.80     0.00
```

Median of sample 2 is 67458.06.

---

### Sample 3

```
##  [1] 143967.89 109386.56 119321.60 106724.41  65343.46  90583.56  96848.28
##  [8] 103892.80  67380.10  85136.00
```

Median of sample 3 is 100370.54.

### Sample 4

```
##  [1]      0.00 101248.80 109378.40 132957.60  90956.57 132743.97 104091.10
##  [8] 104100.80  48696.00 125958.40
```

Median of sample 4 is 104095.95.

---

## Sampling Variability

- Note that the median varies from sample to sample. Each sample's median is not necessarily the population median we are trying to estimate. There is variance of sample medians.

- In real life taking samples from the population is costly. We often have only have one sample that we can use to estimate the population parameter.

- How can we take sampling variability into account when we only have one sample?

- There are different ways to do this. We will use **bootstrapping** in this class.
    - You have done this using Central Limit Theorem in your introductory statistics courses.

---

<br>

]

---

---

---

---

## Random Sample ( `$n$` = 20)

```r
library(infer) # for bootstrap related functions
set.seed(12345)
lapd_sample <- sample_n(lapd, 20)

lapd_sample$base_pay
```

```
##  [1]      0.00 101248.80 109378.40 132957.60  90956.57 132743.97 104091.10
##  [8] 104100.80  48696.00 125958.40  95971.20  96193.81      0.00  34479.44
## [15]  90005.56  66881.94  75342.80  68034.18  54612.80      0.00
```

---

## Bootstrapping

```r
boot <- lapd_sample %>% 
  specify(response = base_pay) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "median")
```

---

<video width="80%" align = "center" controls>
  <source src="screencast/7-infer-bootstrap.mp4" type="video/mp4">
</video>
---

```r
visualize(boot) +
  scale_x_continuous(labels = scales::comma_format()) +
  theme_bw() +
  theme(text = element_text(size = 20)) 
```

---

## 95% Confidence Interval

We can construct the 95% confidence interval by calculating the 2.5th and 97.5th percentiles of the bootstrap distribution.

```r
boot %>% 
  summarize(lower_bound = quantile(stat, 0.025),
            upper_bound = quantile(stat, 0.975))
```

```
## # A tibble: 1 x 2
##   lower_bound upper_bound
##         <dbl>       <dbl>
## 1      60747.     104091.
```

This confidence interval captures the true median (97600.66).

---

## Interpretation of Confidence Intervals

.font50[<svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-right:0.2em;font-size:inherit;fill:black;overflow:visible;position:relative;"><path d="M569.517 440.013C587.975 472.007 564.806 512 527.94 512H48.054c-36.937 0-59.999-40.055-41.577-71.987L246.423 23.985c18.467-32.009 64.72-31.951 83.154 0l239.94 416.028zM288 354c-25.405 0-46 20.595-46 46s20.595 46 46 46 46-20.595 46-46-20.595-46-46-46zm-43.673-165.346l7.418 136c.347 6.364 5.609 11.346 11.982 11.346h48.546c6.373 0 11.635-4.982 11.982-11.346l7.418-136c.375-6.874-5.098-12.654-11.982-12.654h-63.383c-6.884 0-12.356 5.78-11.981 12.654z"/></svg>] Calculating a confidence interval does not guarantee that we will capture the true value of population parameter in the interval.

If we were to take considerable large number of samples (we only had one sample) and construct 95% confidence intervals for each of the samples we would expect about 95% of the confidence intervals to capture the the true value of population parameter in the interval.

---

## Reminders

- Sample statistics `$\neq$` population parameter.

- Different samples can have different statistics, thus there is sampling variability.

- We have constructed a confidence interval to infer about a median but we could do this for mean, proportion, difference between two group means etc.