class: title-slide <br> <br> .pull-right[ # Logistic Regression ## Dr. Mine Dogucu ] --- class: middle In all our models, we have worked with numeric response variables. We will now consider categorical response (with two levels) using logistic regression. --- [Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination](https://www.nber.org/papers/w9873). - Researchers respond to help-wanted ads in Boston and Chicago newspapers with fictitious resumes. -- - They randomly assign White sounding names to half the resumes and African American sounding names to the other half. -- - They create high quality resumes (more experience, likely to have an email address etc.) and low quality resumes. - For each job ad they send four resumes (two high quality and two low quality.) --- ## Data ```r resume <- resume %>% select(received_callback, race, years_experience, job_city) glimpse(resume) ``` ``` ## Rows: 4,870 ## Columns: 4 ## $ received_callback <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL… ## $ race <chr> "white", "white", "black", "black", "white", "white"… ## $ years_experience <int> 6, 6, 6, 6, 22, 6, 5, 21, 3, 6, 8, 8, 4, 4, 5, 4, 5,… ## $ job_city <chr> "Chicago", "Chicago", "Chicago", "Chicago", "Chicago… ``` --- Response variable: `received_callback` .pull-left[ ```r count(resume, received_callback) %>% mutate(prop = n / sum(n)) ``` ``` ## # A tibble: 2 x 3 ## received_callback n prop ## <lgl> <int> <dbl> ## 1 FALSE 4478 0.920 ## 2 TRUE 392 0.0805 ``` ] .pull-right[ ![](09e-logistic_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] --- ## Notation `\(y_i\)` = whether a (fictitious) job candidate receives a call back. `\(\pi_i\)` = probability that the `\(i\)`th job candidate will receive a call back. `\(1-\pi_i\)` = probability that the `\(i\)`th job candidate will **not** receive a call back. --- ### Where is the line? ```r ggplot(resume, aes(x = race, y = received_callback)) + geom_point() ``` <img src="09e-logistic_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## The Linear Model We can model the probability of receiving a callback with a linear model. `\(\text{transformation}(\pi_i) = \beta_0 + \beta_1x_{1i}+\beta_2x_{2i} +.... \beta_kx_{ki}\)` -- `\(logit(\pi_i) = \beta_0 + \beta_1x_{1i}+\beta_2x_{2i} +.... \beta_kx_{ki}\)` -- `\(logit(\pi_i) = log(\frac{\pi_i}{1-\pi_i})\)` -- Note that log is natural log and not base 10. This is also the case for the `log()` function in R. --- class: middle **Probability** `\(\pi_i\)` Probability of receiving a callback. -- **Odds** `\(\frac{\pi_i}{1-\pi_i}\)` Odds of receiving a callback. -- **Logit** `\(log(\frac{\pi_i}{1-\pi_i})\)` Logit of receiving a callback. --- class: middle <img src="09e-logistic_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ### When race is black (0) ```r resume %>% filter(race == "black") %>% count(received_callback) %>% mutate(prop = n / sum(n)) ``` ``` ## # A tibble: 2 x 3 ## received_callback n prop ## <lgl> <int> <dbl> ## 1 FALSE 2278 0.936 ## 2 TRUE 157 0.0645 ``` Note that R assigns 0 an 1 to levels of categorical variables in alphabetical order. In this case black (0) and white(1) --- ### When race is black (0) ```r p_b <- resume %>% filter(race == "black") %>% count(received_callback) %>% mutate(prop = n / sum(n)) %>% filter(received_callback == TRUE) %>% select(prop) %>% pull() ``` Probability of receiving a callback when the candidate has a Black sounding name is 0.0644764. --- ### When race is white (1) ```r p_w <- resume %>% filter(race == "white") %>% count(received_callback) %>% mutate(prop = n / sum(n)) %>% filter(received_callback == TRUE) %>% select(prop) %>% pull() ``` Probability of receiving a callback when the candidate has a white sounding name is 0.0965092. --- .pull-left[ ```r p_b ``` ``` ## [1] 0.06447639 ``` ```r ## Odds odds_b <- p_b / (1 - p_b) odds_b ``` ``` ## [1] 0.06892011 ``` ```r ## Logit logit_b <- log(odds_b) logit_b ``` ``` ## [1] -2.674807 ``` ] -- .pull-right[ ```r p_w ``` ``` ## [1] 0.09650924 ``` ```r ## Odds odds_w <- p_w / (1 - p_w) odds_w ``` ``` ## [1] 0.1068182 ``` ```r ## Logit logit_w <- log(odds_w) logit_w ``` ``` ## [1] -2.236627 ``` ] --- .pull-left[ <img src="09e-logistic_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> This is THE LINE of the linear model. As x increases by 1 unit, the expected change in the logit of receiving call back is 0.4381802. In this case, this is just the difference between logit for the white group and the black group. ] .pull-right[ The slope of the line is: ```r logit_w - logit_b ``` ``` ## [1] 0.4381802 ``` The intercept is ```r logit_b ``` ``` ## [1] -2.674807 ``` ] --- ```r model_r <- glm(received_callback ~ race, data = resume, family = binomial) ``` ```r tidy(model_r) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -2.67 0.0825 -32.4 1.59e-230 ## 2 racewhite 0.438 0.107 4.08 4.45e- 5 ``` `\(log(\frac{\hat \pi_i}{1-\hat \pi_i}) = -2.67 + 0.438\times racewhite_i\)` --- class: middle center | Scale | Range | |-------------|------------------------| | Probability | 0 to 1 | | Odds | 0 to `\(\infty\)` | | Logit | - `\(\infty\)` to `\(\infty\)` | --- We will consider years of experience as an explanatory variable. Normally, we would also include race in the model and have multiple explanatory variables, however, for learning purposes, we will keep the model simple. ```r model_y <- glm(received_callback ~ years_experience, data = resume, family = binomial) ``` ```r tidy(model_y) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -2.76 0.0962 -28.7 5.58e-181 ## 2 years_experience 0.0391 0.00918 4.26 2.07e- 5 ``` --- ```r model_y_summary <- tidy(model_y) intercept <- model_y_summary %>% filter(term == "(Intercept)") %>% select(estimate) %>% pull() slope <- model_y_summary %>% filter(term == "years_experience") %>% select(estimate) %>% pull() ``` --- ### From logit to odds Logit for a Candidate with 1 year of experience (rounded equation) `\(-2.76 + 0.0391 \times 1\)` -- Odds for a Candidate with 1 year of experience `\(odds = e^{logit}\)` `\(\frac{\pi_i}{1-\pi_i} = e^{log(\frac{\pi_i}{1-\pi_i})}\)` `\(\frac{\hat\pi_i}{1-\hat\pi_i} = e^{-2.76 + 0.0391 \times 1}\)` --- ### From odds to probability `\(\pi_i = \frac{odds}{1+odds}\)` `\(\pi_i = \frac{\frac{\pi_i}{1-\pi_i}}{1+\frac{\pi_i}{1-\pi_i}}\)` `\(\hat\pi_i = \frac{e^{-2.76 + 0.0391 \times 1}}{1+e^{-2.76 + 0.0391 \times 1}} = 0.0618\)` Note you can use `exp()` function in R for exponentiating number e. ```r exp(1) ``` ``` ## [1] 2.718282 ``` --- ## Logistic Regression model **Logit form:** `\(log(\frac{\pi_i}{1-\pi_i}) = \beta_0 + \beta_1x_{1i}+\beta_2x_{2i} +.... \beta_kx_{ki}\)` **Probability form:** `\(\large{\pi_i = \frac{e^{\beta_0 + \beta_1x_{1i}+\beta_2x_{2i} +.... \beta_kx_{ki}}}{1+e^{\beta_0 + \beta_1x_{1i}+\beta_2x_{2i} +.... \beta_kx_{ki}}}}\)` --- .left-panel[ Estimated probability of a candidate with 0 years of experience receiving a callback `\(\hat\pi_i = \frac{e^{-2.76 + 0.0391 \times 0}}{1+e^{-2.76 + 0.0391 \times 0}} = 0.0595\)` ] .right-panel[ Estimated probability of a candidate with 1 year of experience receiving a callback `\(\hat\pi_i = \frac{e^{-2.76 + 0.0391 \times 1}}{1+e^{-2.76 + 0.0391 \times 1}} = 0.0618\)` ] --- ```r model_ryc <- glm(received_callback ~ race + years_experience + job_city, data = resume, family = binomial) ``` --- ```r tidy(model_ryc) ``` ``` ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -2.78 0.134 -20.8 6.18e-96 ## 2 racewhite 0.440 0.108 4.09 4.39e- 5 ## 3 years_experience 0.0332 0.00940 3.53 4.11e- 4 ## 4 job_cityChicago -0.329 0.108 -3.04 2.33e- 3 ``` --- The estimated probability that a Black candidate with 10 years of experience, residing in Boston, would receive a callback. `\(\large{\hat\pi_i = \frac{e^{-2.78 + (0.0440 \times 0) + (0.0332\times10) + (-0.0329\times 0)}}{1+e^{-2.78 + (0.0440 \times 0) + (0.0332\times10) + (-0.0329\times 0)}} = 0.0796}\)` --- We have used the data for educational purposes. The original study considers many other variables that may influence whether someone receives a callback or not. Read the original study for other considerations. [Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination](https://www.nber.org/papers/w9873).