Simple Linear Regression

<br>
<br>
.pull-right[

# Simple Linear Regression
## Dr. Mine Dogucu
]

---

#### Data `babies` in `openintro` package

```
## Rows: 1,236
## Columns: 8
## $ case      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ bwt       <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 140, 144, …
## $ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351, 282, 2…
## $ parity    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ age       <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, 30, …
## $ height    <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, 63, …
## $ weight    <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120, 124, 1…
## $ smoke     <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, …
```

---

##  Baby Weights

```r
ggplot(babies, 
       aes(x = gestation, y = bwt)) +
  geom_point()
```

]

]
---

##  Baby Weights

```r
ggplot(babies,
         aes(x = gestation, y = bwt)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) 
```

`lm` stands for linear model  
`se` stands for standard error
]

]

---

| y | Response    | Birth weight | Numeric |
|---|-------------|-----------------|---------|
| x | Explanatory | Gestation           | Numeric |

---

## Linear Equations Review

Recall from your previous math classes

`$y = mx + b$`

where `$m$` is the slope and `$b$` is the y-intercept

e.g. `$y = 2x -1$`
]

Notice anything different between baby weights plot and this one?
]

---

**Math** class

`$y = b + mx$`

`$b$` is y-intercept  
`$m$` is slope  
]

**Stats** class

`$y_i = \beta_0 +\beta_1x_i + \epsilon_i$`

`$\beta_0$` is y-intercept  
`$\beta_1$` is slope  
`$\epsilon_i$` is error/residual  
`$i = 1, 2, ...n$` identifier for each point
]

---

```r
model_g <- lm(bwt ~ gestation, data = babies)
```

`lm` stands for linear model. We are fitting a linear regression model. Note that the variables are entered in y ~ x order.

---

```r
broom::tidy(model_g)
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  -10.1      8.32       -1.21 2.27e- 1
## 2 gestation      0.464    0.0297     15.6  3.22e-50
```

--
`$\hat {y}_i = b_0 + b_1 x_i$`

`$\hat {\text{bwt}_i} = b_0 + b_1 \text{ gestation}_i$`

`$\hat {\text{bwt}_i} = -10.1 + 0.464\text{ gestation}_i$`

---

## Expected bwt for a baby with 300 days of gestation

`$\hat {\text{bwt}_i} = -10.1 + 0.464\text{ gestation}_i$`

`$\hat {\text{bwt}} = -10.1 + 0.464 \times 300$`

`$\hat {\text{bwt}} =$` 129.1

For a baby with 300 days of gestation the expected birth weight is 129.1 ounces.

---

## Interpretation of estimates

.pull-left[
<img src="09a-slr_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" />

`$b_1 = 0.464$` which means for one unit(day) increase in gestation period the expected increase in birth weight is 0.464 ounces.

]

.pull-right[
<img src="09a-slr_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" />

`$b_0 = -10.1$` which means for gestation period of 0 days the expected birth weight is -10.1 ounces!!!!!!!!
(does NOT make sense)
]

---

## Extrapolation

- There is no such thing as 0 days of gestation.

- Birth weight cannot possibly be -10.1 ounces.

- Extrapolation happens when we use a model outside the range of the x-values that are observed. After all, we cannot really know how the model behaves (e.g. may be non-linear) outside of the scope of what we have observed.

---

## Baby number 148

```r
babies %>% 
  filter(case == 148) %>% 
  select(bwt, gestation)
```

```
## # A tibble: 1 x 2
##     bwt gestation
##   <int>     <int>
## 1   160       300
```

]

![](09a-slr_files/figure-html/unnamed-chunk-13-1.png)

]

---

## Baby #148

**Expected**

`$\hat y_{148} = b_0 +b_1x_{148}$`

`$\hat y_{148} = -10.1 + 0.464\times300$`

`$\hat y_{148}$` = 129.1

]

**Observed**

`$y_{148} =$` 160

]

---

## Residual for `i = 148`

]

`$y_{148} = 160$`

<hr>

`$\hat y_{148}$` = 129.1

<hr>

`$e_{148} = y_{148} - \hat y_{148}$`

`$e_{148} =$` 30.9

]

---

## Least Squares Regression

The goal is to minimize

`$$e_1^2 + e_2^2 + ... + e_n^2$$`

which can be rewritten as

`$$\sum_{i = 1}^n e_i^2$$`

---

## Conditions for Least Squares Regression

- Linearity

- Normality of Residuals

- Constant Variance

- Independence

---

]

]

---

![](09a-slr_files/figure-html/unnamed-chunk-17-1.png)

]

![](09a-slr_files/figure-html/unnamed-chunk-18-1.png)

]

---

![](09a-slr_files/figure-html/unnamed-chunk-19-1.png)

]

![](09a-slr_files/figure-html/unnamed-chunk-20-1.png)

]

---

## Independence

Harder to check because we need to know how the data were collected.

In the description of the dataset it says _[a study]considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area._

It is possible that babies born in the same hospital may have similar birth weight.

Correlated data examples: patients within hospitals, students within schools, people within neighborhoods, time-series data.

---

### Inference: Confidence Interval

```r
confint(model_g)
```

```
##                   2.5 %    97.5 %
## (Intercept) -26.3915884 6.2632199
## gestation     0.4059083 0.5226169
```

Note that the confidence interval for the slope does not contain zero and all the values in the interval are positive indicating a possible positive relationship between gestation and birth weight.