class: title-slide <br> <br> .right-panel[ # Subsetting Data ## Dr. Mine Dogucu ] --- ```r glimpse(lapd) ``` ``` ## Rows: 68,564 ## Columns: 35 ## $ row_id <chr> "3-1000027830ctFu", "3-1000155488ctFu", "3-… ## $ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2… ## $ department_title <chr> "Police (LAPD)", "Police (LAPD)", "Police (… ## $ payroll_department <dbl> 4301, 4302, 4301, 4301, 4302, 4302, 4301, 4… ## $ record_number <dbl> 1000027830, 1000155488, 1000194958, 1000232… ## $ job_class_title <chr> "Police Detective II", "Clerk Typist", "Pol… ## $ employment_type <chr> "Full Time", "Full Time", "Full Time", "Ful… ## $ hourly_or_event_rate <dbl> 53.16, 23.77, 60.80, 60.98, 45.06, 34.42, 4… ## $ projected_annual_salary <dbl> 110998.08, 49623.67, 126950.40, 127326.24, … ## $ q1_payments <dbl> 24931.20, 11343.96, 24184.00, 29391.20, 208… ## $ q2_payments <dbl> 29181.61, 13212.37, 28327.20, 36591.20, 241… ## $ q3_payments <dbl> 26545.80, 11508.36, 28744.20, 32904.81, 215… ## $ q4_payments <dbl> 29605.30, 13442.53, 33224.88, 37234.03, 252… ## $ payments_over_base_pay <dbl> 4499.12, 1844.82, 13192.43, 18034.53, 1376.… ## $ percent_over_base_pay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ total_payments <dbl> 110263.91, 49507.22, 114480.28, 136121.24, … ## $ base_pay <dbl> 105764.79, 47662.40, 101287.85, 118086.71, … ## $ permanent_bonus_pay <dbl> 3174.12, 0.00, 7363.95, 7086.67, 0.00, 0.00… ## $ longevity_bonus_pay <dbl> 0.00, 1310.82, 0.00, 0.00, 1251.19, 1726.16… ## $ temporary_bonus_pay <dbl> 1325.00, 0.00, 1205.00, 1325.00, 125.00, 68… ## $ lump_sum_pay <dbl> 0.00, 0.00, 2133.18, 0.00, 2068.80, 0.00, 0… ## $ overtime_pay <dbl> 0.00, 0.00, 4424.32, 9839.33, 0.00, 0.00, 4… ## $ other_pay_adjustments <dbl> 0.00, 534.00, -1934.02, -216.47, -2068.80, … ## $ other_pay_payroll_explorer <dbl> 4499.12, 1844.82, 8768.11, 8195.20, 1376.19… ## $ mou <dbl> 24, 3, 24, 24, 12, 3, 24, 24, 24, 24, 24, 1… ## $ mou_title <chr> "POLICE OFFICERS UNIT", "CLERICAL UNIT", "P… ## $ fms_department <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,… ## $ job_class <dbl> 2223, 1358, 2227, 2232, 1839, 2207, 2214, 2… ## $ pay_grade <chr> "2", "0", "1", "1", "0", "2", "3", "1", "B"… ## $ average_health_cost <dbl> 11651.40, 10710.24, 11651.40, 11651.40, 107… ## $ average_dental_cost <dbl> 898.08, 405.24, 898.08, 898.08, 405.24, 405… ## $ average_basic_life <dbl> 191.04, 11.40, 191.04, 191.04, 11.40, 11.40… ## $ average_benefit_cost <dbl> 12740.52, 11126.88, 12740.52, 12740.52, 111… ## $ benefits_plan <chr> "Police", "City", "Police", "Police", "City… ## $ job_class_link <chr> "http://per.lacity.org/perspecs/2223.pdf", … ``` --- ```r colnames(lapd) ``` ``` ## [1] "row_id" "year" ## [3] "department_title" "payroll_department" ## [5] "record_number" "job_class_title" ## [7] "employment_type" "hourly_or_event_rate" ## [9] "projected_annual_salary" "q1_payments" ## [11] "q2_payments" "q3_payments" ## [13] "q4_payments" "payments_over_base_pay" ## [15] "percent_over_base_pay" "total_payments" ## [17] "base_pay" "permanent_bonus_pay" ## [19] "longevity_bonus_pay" "temporary_bonus_pay" ## [21] "lump_sum_pay" "overtime_pay" ## [23] "other_pay_adjustments" "other_pay_payroll_explorer" ## [25] "mou" "mou_title" ## [27] "fms_department" "job_class" ## [29] "pay_grade" "average_health_cost" ## [31] "average_dental_cost" "average_basic_life" ## [33] "average_benefit_cost" "benefits_plan" ## [35] "job_class_link" ``` --- ## subsetting variables/columns <img src="img/data-wrangle.001.jpeg" width="80%" style="display: block; margin: auto;" /> -- `select()` --- ## subsetting observations/rows <img src="img/data-wrangle.002.jpeg" width="80%" style="display: block; margin: auto;" /> `slice()` and `filter()` --- `select` is used to select certain variables in the data frame. .left-panel[ ```r select(lapd, year, base_pay) ``` ``` ## # A tibble: 68,564 x 2 ## year base_pay ## <dbl> <dbl> ## 1 2013 105765. ## 2 2013 47662. ## 3 2013 101288. ## 4 2013 118087. ## 5 2013 90322. ## 6 2013 62770. ## # … with 68,558 more rows ``` ] -- .right-panel[ ```r lapd %>% select(year, base_pay) ``` ``` ## # A tibble: 68,564 x 2 ## year base_pay ## <dbl> <dbl> ## 1 2013 105765. ## 2 2013 47662. ## 3 2013 101288. ## 4 2013 118087. ## 5 2013 90322. ## 6 2013 62770. ## # … with 68,558 more rows ``` ] --- `select` can also be used to drop certain variables if used with a negative sign. ```r select(lapd, -row_id, -department_title) ``` ``` ## # A tibble: 68,564 x 33 ## year payroll_department record_number job_class_title employment_type ## <dbl> <dbl> <dbl> <chr> <chr> ## 1 2013 4301 1000027830 Police Detective II Full Time ## 2 2013 4302 1000155488 Clerk Typist Full Time ## 3 2013 4301 1000194958 Police Sergeant I Full Time ## 4 2013 4301 1000232317 Police Lieutenant I Full Time ## 5 2013 4302 1000329284 Principal Storekeeper Full Time ## 6 2013 4302 1001124320 Police Service Represe… Full Time ## # … with 68,558 more rows, and 28 more variables: hourly_or_event_rate <dbl>, ## # projected_annual_salary <dbl>, q1_payments <dbl>, q2_payments <dbl>, ## # q3_payments <dbl>, q4_payments <dbl>, payments_over_base_pay <dbl>, ## # percent_over_base_pay <dbl>, total_payments <dbl>, base_pay <dbl>, ## # permanent_bonus_pay <dbl>, longevity_bonus_pay <dbl>, ## # temporary_bonus_pay <dbl>, lump_sum_pay <dbl>, overtime_pay <dbl>, ## # other_pay_adjustments <dbl>, other_pay_payroll_explorer <dbl>, mou <dbl>, ## # mou_title <chr>, fms_department <dbl>, job_class <dbl>, pay_grade <chr>, ## # average_health_cost <dbl>, average_dental_cost <dbl>, ## # average_basic_life <dbl>, average_benefit_cost <dbl>, benefits_plan <chr>, ## # job_class_link <chr> ``` --- ## Selection helpers `starts_with()` `ends_with()` `contains()` --- ```r select(lapd, starts_with("q")) ``` ``` ## # A tibble: 68,564 x 4 ## q1_payments q2_payments q3_payments q4_payments ## <dbl> <dbl> <dbl> <dbl> ## 1 24931. 29182. 26546. 29605. ## 2 11344. 13212. 11508. 13443. ## 3 24184 28327. 28744. 33225. ## 4 29391. 36591. 32905. 37234. ## 5 20813 24136 21518. 25231. ## 6 16057. 17927. 14150. 17052. ## # … with 68,558 more rows ``` --- ```r select(lapd, ends_with("pay")) ``` ``` ## # A tibble: 68,564 x 8 ## payments_over_ba… percent_over_bas… base_pay permanent_bonus… longevity_bonus… ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 4499. 0 105765. 3174. 0 ## 2 1845. 0 47662. 0 1311. ## 3 13192. 0 101288. 7364. 0 ## 4 18035. 0 118087. 7087. 0 ## 5 1376. 0 90322. 0 1251. ## 6 2415. 0 62770. 0 1726. ## # … with 68,558 more rows, and 3 more variables: temporary_bonus_pay <dbl>, ## # lump_sum_pay <dbl>, overtime_pay <dbl> ``` --- ```r select(lapd, contains("pay")) ``` ``` ## # A tibble: 68,564 x 17 ## payroll_department q1_payments q2_payments q3_payments q4_payments ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 4301 24931. 29182. 26546. 29605. ## 2 4302 11344. 13212. 11508. 13443. ## 3 4301 24184 28327. 28744. 33225. ## 4 4301 29391. 36591. 32905. 37234. ## 5 4302 20813 24136 21518. 25231. ## 6 4302 16057. 17927. 14150. 17052. ## # … with 68,558 more rows, and 12 more variables: payments_over_base_pay <dbl>, ## # percent_over_base_pay <dbl>, total_payments <dbl>, base_pay <dbl>, ## # permanent_bonus_pay <dbl>, longevity_bonus_pay <dbl>, ## # temporary_bonus_pay <dbl>, lump_sum_pay <dbl>, overtime_pay <dbl>, ## # other_pay_adjustments <dbl>, other_pay_payroll_explorer <dbl>, ## # pay_grade <chr> ``` --- ## subsetting variables/columns <img src="img/data-wrangle.001.jpeg" width="80%" style="display: block; margin: auto;" /> -- `select()` --- ## subsetting observations/rows <img src="img/data-wrangle.002.jpeg" width="80%" style="display: block; margin: auto;" /> `slice()` and `filter()` --- .pull-left[ `slice()` subsetting rows based on a row number. The data below include all the rows from third to seventh. Including third and seventh. ```r slice(lapd, 3:7) ``` ``` ## # A tibble: 5 x 35 ## row_id year department_title payroll_departm… record_number job_class_title ## <chr> <dbl> <chr> <dbl> <dbl> <chr> ## 1 3-1000… 2013 Police (LAPD) 4301 1000194958 Police Sergeant… ## 2 3-1000… 2013 Police (LAPD) 4301 1000232317 Police Lieutena… ## 3 3-1000… 2013 Police (LAPD) 4302 1000329284 Principal Store… ## 4 3-1001… 2013 Police (LAPD) 4302 1001124320 Police Service … ## 5 3-1001… 2013 Police (LAPD) 4301 1001221822 Police Officer … ## # … with 29 more variables: employment_type <chr>, hourly_or_event_rate <dbl>, ## # projected_annual_salary <dbl>, q1_payments <dbl>, q2_payments <dbl>, ## # q3_payments <dbl>, q4_payments <dbl>, payments_over_base_pay <dbl>, ## # percent_over_base_pay <dbl>, total_payments <dbl>, base_pay <dbl>, ## # permanent_bonus_pay <dbl>, longevity_bonus_pay <dbl>, ## # temporary_bonus_pay <dbl>, lump_sum_pay <dbl>, overtime_pay <dbl>, ## # other_pay_adjustments <dbl>, other_pay_payroll_explorer <dbl>, mou <dbl>, ## # mou_title <chr>, fms_department <dbl>, job_class <dbl>, pay_grade <chr>, ## # average_health_cost <dbl>, average_dental_cost <dbl>, ## # average_basic_life <dbl>, average_benefit_cost <dbl>, benefits_plan <chr>, ## # job_class_link <chr> ``` ] -- .pull-right[ `filter()` subsetting rows based on a condition. The data below includes rows when the recorded year is 2018. ```r filter(lapd, year == 2018) ``` ``` ## # A tibble: 14,824 x 35 ## row_id year department_title payroll_departm… record_number job_class_title ## <chr> <dbl> <chr> <dbl> <dbl> <chr> ## 1 8-1000… 2018 Police (LAPD) 4301 1000027830 Police Detectiv… ## 2 8-1000… 2018 Police (LAPD) 4301 1000194958 Police Sergeant… ## 3 8-1000… 2018 Police (LAPD) 4301 1000232317 Police Lieutena… ## 4 8-1001… 2018 Police (LAPD) 4302 1001124320 Police Service … ## 5 8-1001… 2018 Police (LAPD) 4301 1001221822 Police Officer … ## 6 8-1001… 2018 Police (LAPD) 4301 1001317832 Police Officer … ## # … with 14,818 more rows, and 29 more variables: employment_type <chr>, ## # hourly_or_event_rate <dbl>, projected_annual_salary <dbl>, ## # q1_payments <dbl>, q2_payments <dbl>, q3_payments <dbl>, q4_payments <dbl>, ## # payments_over_base_pay <dbl>, percent_over_base_pay <dbl>, ## # total_payments <dbl>, base_pay <dbl>, permanent_bonus_pay <dbl>, ## # longevity_bonus_pay <dbl>, temporary_bonus_pay <dbl>, lump_sum_pay <dbl>, ## # overtime_pay <dbl>, other_pay_adjustments <dbl>, ## # other_pay_payroll_explorer <dbl>, mou <dbl>, mou_title <chr>, ## # fms_department <dbl>, job_class <dbl>, pay_grade <chr>, ## # average_health_cost <dbl>, average_dental_cost <dbl>, ## # average_basic_life <dbl>, average_benefit_cost <dbl>, benefits_plan <chr>, ## # job_class_link <chr> ``` ] --- .pull-left[ ### Relational Operators in R | Operator | Description | |----------|--------------------------| | < | Less than | | > | Greater than | | <= | Less than or equal to | | >= | Greater than or equal to | | == | Equal to | | != | Not equal to | ] .pull-right[ ### Logical Operators in R | Operator | Description | |----------|-------------| | & | and | | | | or | ] --- class: middle Q. According to [datausa.io](https://datausa.io/profile/geo/los-angeles-ca#:~:text=In%202018%2C%20Los%20Angeles%2C%20CA,%2462%2C474%2C%20a%203.78%25%20increase.) Los Angeles had a median household income of $62474 in 2018. How many LAPD staff members had a base pay higher than $62474 in year 2018 according to this data? .footnote[Median household income is **not** the same thing as median employee income. Our aim is data wrangling and not necessarily statistical analysis for now.] --- ```r lapd %>% filter(year == 2018 & base_pay > 62474) ``` ``` ## # A tibble: 11,690 x 35 ## row_id year department_title payroll_departm… record_number job_class_title ## <chr> <dbl> <chr> <dbl> <dbl> <chr> ## 1 8-1000… 2018 Police (LAPD) 4301 1000027830 Police Detectiv… ## 2 8-1000… 2018 Police (LAPD) 4301 1000194958 Police Sergeant… ## 3 8-1000… 2018 Police (LAPD) 4301 1000232317 Police Lieutena… ## 4 8-1001… 2018 Police (LAPD) 4302 1001124320 Police Service … ## 5 8-1001… 2018 Police (LAPD) 4301 1001221822 Police Officer … ## 6 8-1001… 2018 Police (LAPD) 4301 1001317832 Police Officer … ## # … with 11,684 more rows, and 29 more variables: employment_type <chr>, ## # hourly_or_event_rate <dbl>, projected_annual_salary <dbl>, ## # q1_payments <dbl>, q2_payments <dbl>, q3_payments <dbl>, q4_payments <dbl>, ## # payments_over_base_pay <dbl>, percent_over_base_pay <dbl>, ## # total_payments <dbl>, base_pay <dbl>, permanent_bonus_pay <dbl>, ## # longevity_bonus_pay <dbl>, temporary_bonus_pay <dbl>, lump_sum_pay <dbl>, ## # overtime_pay <dbl>, other_pay_adjustments <dbl>, ## # other_pay_payroll_explorer <dbl>, mou <dbl>, mou_title <chr>, ## # fms_department <dbl>, job_class <dbl>, pay_grade <chr>, ## # average_health_cost <dbl>, average_dental_cost <dbl>, ## # average_basic_life <dbl>, average_benefit_cost <dbl>, benefits_plan <chr>, ## # job_class_link <chr> ``` --- ```r lapd %>% filter(year == 2018 & base_pay > 62474) %>% nrow() ``` ``` ## [1] 11690 ``` --- Q. How many observations are available between 2013 and 2015 including 2013 and 2015? -- ```r lapd %>% filter(year >= 2013 & year <= 2015) ``` ``` ## # A tibble: 40,227 x 35 ## row_id year department_title payroll_departm… record_number job_class_title ## <chr> <dbl> <chr> <dbl> <dbl> <chr> ## 1 3-1000… 2013 Police (LAPD) 4301 1000027830 Police Detectiv… ## 2 3-1000… 2013 Police (LAPD) 4302 1000155488 Clerk Typist ## 3 3-1000… 2013 Police (LAPD) 4301 1000194958 Police Sergeant… ## 4 3-1000… 2013 Police (LAPD) 4301 1000232317 Police Lieutena… ## 5 3-1000… 2013 Police (LAPD) 4302 1000329284 Principal Store… ## 6 3-1001… 2013 Police (LAPD) 4302 1001124320 Police Service … ## # … with 40,221 more rows, and 29 more variables: employment_type <chr>, ## # hourly_or_event_rate <dbl>, projected_annual_salary <dbl>, ## # q1_payments <dbl>, q2_payments <dbl>, q3_payments <dbl>, q4_payments <dbl>, ## # payments_over_base_pay <dbl>, percent_over_base_pay <dbl>, ## # total_payments <dbl>, base_pay <dbl>, permanent_bonus_pay <dbl>, ## # longevity_bonus_pay <dbl>, temporary_bonus_pay <dbl>, lump_sum_pay <dbl>, ## # overtime_pay <dbl>, other_pay_adjustments <dbl>, ## # other_pay_payroll_explorer <dbl>, mou <dbl>, mou_title <chr>, ## # fms_department <dbl>, job_class <dbl>, pay_grade <chr>, ## # average_health_cost <dbl>, average_dental_cost <dbl>, ## # average_basic_life <dbl>, average_benefit_cost <dbl>, benefits_plan <chr>, ## # job_class_link <chr> ``` --- Q. How many observations are available between 2013 and 2015 including 2013 and 2015? ```r lapd %>% filter(year >= 2013 & year <= 2015) %>% nrow() ``` ``` ## [1] 40227 ``` --- Q. How many LAPD staff were employed full time in 2018? ```r lapd %>% filter(employment_type == "Full Time" & year == 2018) %>% nrow() ``` ``` ## [1] 14664 ``` --- We have done all sorts of selections, slicing, filtering on `lapd` but it has not changed at all. Why do you think so? ```r glimpse(lapd) ``` ``` ## Rows: 68,564 ## Columns: 35 ## $ row_id <chr> "3-1000027830ctFu", "3-1000155488ctFu", "3-… ## $ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2… ## $ department_title <chr> "Police (LAPD)", "Police (LAPD)", "Police (… ## $ payroll_department <dbl> 4301, 4302, 4301, 4301, 4302, 4302, 4301, 4… ## $ record_number <dbl> 1000027830, 1000155488, 1000194958, 1000232… ## $ job_class_title <chr> "Police Detective II", "Clerk Typist", "Pol… ## $ employment_type <chr> "Full Time", "Full Time", "Full Time", "Ful… ## $ hourly_or_event_rate <dbl> 53.16, 23.77, 60.80, 60.98, 45.06, 34.42, 4… ## $ projected_annual_salary <dbl> 110998.08, 49623.67, 126950.40, 127326.24, … ## $ q1_payments <dbl> 24931.20, 11343.96, 24184.00, 29391.20, 208… ## $ q2_payments <dbl> 29181.61, 13212.37, 28327.20, 36591.20, 241… ## $ q3_payments <dbl> 26545.80, 11508.36, 28744.20, 32904.81, 215… ## $ q4_payments <dbl> 29605.30, 13442.53, 33224.88, 37234.03, 252… ## $ payments_over_base_pay <dbl> 4499.12, 1844.82, 13192.43, 18034.53, 1376.… ## $ percent_over_base_pay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ total_payments <dbl> 110263.91, 49507.22, 114480.28, 136121.24, … ## $ base_pay <dbl> 105764.79, 47662.40, 101287.85, 118086.71, … ## $ permanent_bonus_pay <dbl> 3174.12, 0.00, 7363.95, 7086.67, 0.00, 0.00… ## $ longevity_bonus_pay <dbl> 0.00, 1310.82, 0.00, 0.00, 1251.19, 1726.16… ## $ temporary_bonus_pay <dbl> 1325.00, 0.00, 1205.00, 1325.00, 125.00, 68… ## $ lump_sum_pay <dbl> 0.00, 0.00, 2133.18, 0.00, 2068.80, 0.00, 0… ## $ overtime_pay <dbl> 0.00, 0.00, 4424.32, 9839.33, 0.00, 0.00, 4… ## $ other_pay_adjustments <dbl> 0.00, 534.00, -1934.02, -216.47, -2068.80, … ## $ other_pay_payroll_explorer <dbl> 4499.12, 1844.82, 8768.11, 8195.20, 1376.19… ## $ mou <dbl> 24, 3, 24, 24, 12, 3, 24, 24, 24, 24, 24, 1… ## $ mou_title <chr> "POLICE OFFICERS UNIT", "CLERICAL UNIT", "P… ## $ fms_department <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,… ## $ job_class <dbl> 2223, 1358, 2227, 2232, 1839, 2207, 2214, 2… ## $ pay_grade <chr> "2", "0", "1", "1", "0", "2", "3", "1", "B"… ## $ average_health_cost <dbl> 11651.40, 10710.24, 11651.40, 11651.40, 107… ## $ average_dental_cost <dbl> 898.08, 405.24, 898.08, 898.08, 405.24, 405… ## $ average_basic_life <dbl> 191.04, 11.40, 191.04, 191.04, 11.40, 11.40… ## $ average_benefit_cost <dbl> 12740.52, 11126.88, 12740.52, 12740.52, 111… ## $ benefits_plan <chr> "Police", "City", "Police", "Police", "City… ## $ job_class_link <chr> "http://per.lacity.org/perspecs/2223.pdf", … ``` --- Moving forward we are only going to focus on year 2018, and use `job_class_title`, `employment_type`, and `base_pay`. Let's clean our data accordingly and move on with the smaller `lapd` data that we need. --- ```r lapd %>% filter(year == 2018) %>% select(job_class_title, employment_type, base_pay) ``` ``` ## # A tibble: 14,824 x 3 ## job_class_title employment_type base_pay ## <chr> <chr> <dbl> ## 1 Police Detective II Full Time 119322. ## 2 Police Sergeant I Full Time 113271. ## 3 Police Lieutenant II Full Time 148116 ## 4 Police Service Representative II Full Time 78677. ## 5 Police Officer III Full Time 109374. ## 6 Police Officer II Full Time 95002. ## # … with 14,818 more rows ``` --- ```r lapd <- lapd %>% filter(year == 2018) %>% select(job_class_title, employment_type, base_pay) ```