class: title-slide <br> <br> .right-panel[ # Aggregating Data ## Dr. Mine Dogucu ] --- class: middle .pull-left[ ## Data Observations ] .pull-left[ ## Aggregate Data Summaries of observations ] --- class: inverse middle .font75[Aggregating Categorical Data] --- class: middle Categorical data are summarized with **counts** or **proportions** --- class: middle ```r lapd %>% count(employment_type) ``` ``` ## # A tibble: 3 x 2 ## employment_type n ## <fct> <int> ## 1 Full Time 14664 ## 2 Part Time 132 ## 3 Per Event 28 ``` --- ```r lapd %>% count(employment_type) %>% mutate(prop = n/sum(n)) ``` ``` ## # A tibble: 3 x 3 ## employment_type n prop ## <fct> <int> <dbl> ## 1 Full Time 14664 0.989 ## 2 Part Time 132 0.00890 ## 3 Per Event 28 0.00189 ``` --- class: inverse middle .font75[Aggregating Numerical Data] --- class: middle ## Review **mean** average **median** the middle value when the data are ordered **mode** value that appears the most often (local maxima for continous variables) --- ## Mean .pull-left[ ```r summarize(lapd, mean(base_pay)) ``` ``` ## # A tibble: 1 x 1 ## `mean(base_pay)` ## <dbl> ## 1 85149. ``` ] -- .pull-right[ ```r mean(lapd$base_pay) ``` ``` ## [1] 85149.05 ``` ] --- ### Mean is not a good measure when the data are skewed <img src="03d-aggregate-data_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## Median .pull-left[ ```r summarize(lapd, median(base_pay)) ``` ``` ## # A tibble: 1 x 1 ## `median(base_pay)` ## <dbl> ## 1 97601. ``` ] -- .pull-right[ ```r median(lapd$base_pay) ``` ``` ## [1] 97600.66 ``` ] --- ## Mode (?) ```r count(lapd, base_pay, sort = TRUE) ``` ``` ## # A tibble: 8,441 x 2 ## base_pay n ## <dbl> <int> ## 1 0 985 ## 2 119322. 277 ## 3 109378. 204 ## 4 112976 167 ## 5 98615. 139 ## 6 95971. 127 ## # … with 8,435 more rows ``` .font15[Mode is local maxima for continuous variables. This is not exactly the mode calculation for base_pay but we are learning sorting here.] --- ## Quantiles ```r summarize(lapd, quantile(base_pay, c(0.25, 0.50, 0.75))) ``` ``` ## # A tibble: 3 x 1 ## `quantile(base_pay, c(0.25, 0.5, 0.75))` ## <dbl> ## 1 67266. ## 2 97601. ## 3 109368. ``` We would expect 25% of the data to be less than 67265.5475 --- ```r summarize(lapd, mean(base_pay), median(base_pay)) ``` ``` ## # A tibble: 1 x 2 ## `mean(base_pay)` `median(base_pay)` ## <dbl> <dbl> ## 1 85149. 97601. ``` Note how the variables names in this table is not easy to read. --- ```r summarize(lapd, mean_base_pay = mean(base_pay), med_base_pay = median(base_pay)) ``` ``` ## # A tibble: 1 x 2 ## mean_base_pay med_base_pay ## <dbl> <dbl> ## 1 85149. 97601. ``` --- class: inverse middle .font75[Aggregating Data by Groups] --- `group_by()` <img src="img/data-wrangle.003.jpeg" width="80%" style="display: block; margin: auto;" /> --- Q. What is the median salary for each employment type? --- ```r lapd %>% group_by(employment_type) ``` ``` ## # A tibble: 14,824 x 4 ## # Groups: employment_type [3] ## job_class_title employment_type base_pay base_pay_level ## <fct> <fct> <dbl> <chr> ## 1 Police Detective II Full Time 119322. Greater than Median ## 2 Police Sergeant I Full Time 113271. Greater than Median ## 3 Police Lieutenant II Full Time 148116 Greater than Median ## 4 Police Service Representative II Full Time 78677. Greater than Median ## 5 Police Officer III Full Time 109374. Greater than Median ## 6 Police Officer II Full Time 95002. Greater than Median ## # … with 14,818 more rows ``` --- ```r lapd %>% group_by(employment_type) %>% summarize(med_base_pay = median(base_pay)) ``` ``` ## # A tibble: 3 x 2 ## employment_type med_base_pay ## <fct> <dbl> ## 1 Full Time 97996. ## 2 Part Time 14474. ## 3 Per Event 4275 ``` --- We can also remind ourselves how many staff members there were in each group. ```r lapd %>% group_by(employment_type) %>% summarize(med_base_pay = median(base_pay), n = n()) ``` ``` ## # A tibble: 3 x 3 ## employment_type med_base_pay n ## <fct> <dbl> <int> ## 1 Full Time 97996. 14664 ## 2 Part Time 14474. 132 ## 3 Per Event 4275 28 ```