+ - 0:00:00
Notes for current slide
Notes for next slide



Aggregating Data

Dr. Mine Dogucu

1 / 23

Data

Observations

Aggregate Data

Summaries of observations

2 / 23

Aggregating Categorical Data

3 / 23

Categorical data are summarized with counts or proportions

4 / 23
lapd %>%
count(employment_type)
## # A tibble: 3 x 2
## employment_type n
## <fct> <int>
## 1 Full Time 14664
## 2 Part Time 132
## 3 Per Event 28
5 / 23
lapd %>%
count(employment_type) %>%
mutate(prop = n/sum(n))
## # A tibble: 3 x 3
## employment_type n prop
## <fct> <int> <dbl>
## 1 Full Time 14664 0.989
## 2 Part Time 132 0.00890
## 3 Per Event 28 0.00189
6 / 23

Aggregating Numerical Data

7 / 23

Review

mean average

median the middle value when the data are ordered

mode value that appears the most often (local maxima for continous variables)

8 / 23

Mean

summarize(lapd,
mean(base_pay))
## # A tibble: 1 x 1
## `mean(base_pay)`
## <dbl>
## 1 85149.
9 / 23

Mean

summarize(lapd,
mean(base_pay))
## # A tibble: 1 x 1
## `mean(base_pay)`
## <dbl>
## 1 85149.
mean(lapd$base_pay)
## [1] 85149.05
10 / 23

Mean is not a good measure when the data are skewed

11 / 23

Median

summarize(lapd,
median(base_pay))
## # A tibble: 1 x 1
## `median(base_pay)`
## <dbl>
## 1 97601.
12 / 23

Median

summarize(lapd,
median(base_pay))
## # A tibble: 1 x 1
## `median(base_pay)`
## <dbl>
## 1 97601.
median(lapd$base_pay)
## [1] 97600.66
13 / 23

Mode (?)

count(lapd, base_pay, sort = TRUE)
## # A tibble: 8,441 x 2
## base_pay n
## <dbl> <int>
## 1 0 985
## 2 119322. 277
## 3 109378. 204
## 4 112976 167
## 5 98615. 139
## 6 95971. 127
## # … with 8,435 more rows

Mode is local maxima for continuous variables. This is not exactly the mode calculation for base_pay but we are learning sorting here.

14 / 23

Quantiles

summarize(lapd, quantile(base_pay, c(0.25, 0.50, 0.75)))
## # A tibble: 3 x 1
## `quantile(base_pay, c(0.25, 0.5, 0.75))`
## <dbl>
## 1 67266.
## 2 97601.
## 3 109368.

We would expect 25% of the data to be less than 67265.5475

15 / 23
summarize(lapd,
mean(base_pay),
median(base_pay))
## # A tibble: 1 x 2
## `mean(base_pay)` `median(base_pay)`
## <dbl> <dbl>
## 1 85149. 97601.

Note how the variables names in this table is not easy to read.

16 / 23
summarize(lapd,
mean_base_pay = mean(base_pay),
med_base_pay = median(base_pay))
## # A tibble: 1 x 2
## mean_base_pay med_base_pay
## <dbl> <dbl>
## 1 85149. 97601.
17 / 23

Aggregating Data by Groups

18 / 23

group_by()

19 / 23

Q. What is the median salary for each employment type?

20 / 23
lapd %>%
group_by(employment_type)
## # A tibble: 14,824 x 4
## # Groups: employment_type [3]
## job_class_title employment_type base_pay base_pay_level
## <fct> <fct> <dbl> <chr>
## 1 Police Detective II Full Time 119322. Greater than Median
## 2 Police Sergeant I Full Time 113271. Greater than Median
## 3 Police Lieutenant II Full Time 148116 Greater than Median
## 4 Police Service Representative II Full Time 78677. Greater than Median
## 5 Police Officer III Full Time 109374. Greater than Median
## 6 Police Officer II Full Time 95002. Greater than Median
## # … with 14,818 more rows
21 / 23
lapd %>%
group_by(employment_type) %>%
summarize(med_base_pay = median(base_pay))
## # A tibble: 3 x 2
## employment_type med_base_pay
## <fct> <dbl>
## 1 Full Time 97996.
## 2 Part Time 14474.
## 3 Per Event 4275
22 / 23

We can also remind ourselves how many staff members there were in each group.

lapd %>%
group_by(employment_type) %>%
summarize(med_base_pay = median(base_pay),
n = n())
## # A tibble: 3 x 3
## employment_type med_base_pay n
## <fct> <dbl> <int>
## 1 Full Time 97996. 14664
## 2 Part Time 14474. 132
## 3 Per Event 4275 28
23 / 23

Data

Observations

Aggregate Data

Summaries of observations

2 / 23
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow