+ - 0:00:00
Notes for current slide
Notes for next slide



Iteration

Dr. Mine Dogucu

1 / 21

Movie Specific Information

2 / 21
read_html("https://www.imdb.com/title/tt1375666/?ref_=adv_li_tt") %>%
html_nodes("time") %>%
html_text()
## [1] "\n 2h 28min\n "
## [2] "148 min"
3 / 21
time <- read_html("https://www.imdb.com/title/tt1375666/?ref_=adv_li_tt") %>%
html_nodes("time") %>%
html_text()
time[2]
## [1] "148 min"
4 / 21
time <- read_html("https://www.imdb.com/title/tt1375666/?ref_=adv_li_tt") %>%
html_nodes("time") %>%
html_text()
time[2] %>%
str_remove(" min") %>%
as.numeric()
## [1] 148
5 / 21
scrape_time <- function(url) {
time <- read_html(url) %>%
html_nodes("time") %>%
html_text()
time[2] %>%
str_remove(" min") %>%
as.numeric()
}
6 / 21
scrape_time("https://www.imdb.com/title/tt1375666/?ref_=adv_li_tt")
## [1] 148
7 / 21

Goal

What if we wanted to use this function with most popular 50 movies of 2010?

8 / 21

Scraping First Movie

scrape_time("https://www.imdb.com/title/tt1375666/?ref_=adv_li_tt")
## [1] 148

Scraping Second Movie

scrape_time("https://www.imdb.com/title/tt1591095/?ref_=adv_li_tt")
## [1] 103

Scraping Third Movie...

scrape_time("https://www.imdb.com/title/tt1285016/?ref_=adv_li_tt")
## [1] 120
9 / 21

In previous lecture

scrape_movie_title <- function(year) {
page <- paste0("https://www.imdb.com/search/title/?title_type=feature&year=", year, "-01-01,", year, "-12-31&sort=user_rating,desc")
read_html(page) %>%
html_nodes(".lister-item-header a") %>%
html_text()
}

In this lecture

scrape_movie_urls <- function(year) {
page <- paste0("https://www.imdb.com/search/title/?title_type=feature&year=", year, "-01-01,", year, "-12-31")
read_html(page) %>%
html_nodes(".lister-item-header a") %>%
html_attr("href")
}

html_attr() allows us to scrape the attribute (in this case href)

10 / 21
scrape_movie_urls(2010)
## [1] "/title/tt1375666/?ref_=adv_li_tt" "/title/tt1591095/?ref_=adv_li_tt"
## [3] "/title/tt1285016/?ref_=adv_li_tt" "/title/tt1179056/?ref_=adv_li_tt"
## [5] "/title/tt0446029/?ref_=adv_li_tt" "/title/tt1130884/?ref_=adv_li_tt"
## [7] "/title/tt1375670/?ref_=adv_li_tt" "/title/tt0926084/?ref_=adv_li_tt"
## [9] "/title/tt1250777/?ref_=adv_li_tt" "/title/tt0947798/?ref_=adv_li_tt"
## [11] "/title/tt1458175/?ref_=adv_li_tt" "/title/tt1245526/?ref_=adv_li_tt"
## [13] "/title/tt0840361/?ref_=adv_li_tt" "/title/tt1242432/?ref_=adv_li_tt"
## [15] "/title/tt0758752/?ref_=adv_li_tt" "/title/tt1014759/?ref_=adv_li_tt"
## [17] "/title/tt1588170/?ref_=adv_li_tt" "/title/tt1273235/?ref_=adv_li_tt"
## [19] "/title/tt0398286/?ref_=adv_li_tt" "/title/tt1386588/?ref_=adv_li_tt"
## [21] "/title/tt0814255/?ref_=adv_li_tt" "/title/tt1403865/?ref_=adv_li_tt"
## [23] "/title/tt0938283/?ref_=adv_li_tt" "/title/tt1228705/?ref_=adv_li_tt"
## [25] "/title/tt0800320/?ref_=adv_li_tt" "/title/tt1104001/?ref_=adv_li_tt"
## [27] "/title/tt1255953/?ref_=adv_li_tt" "/title/tt1314655/?ref_=adv_li_tt"
## [29] "/title/tt1325004/?ref_=adv_li_tt" "/title/tt1282140/?ref_=adv_li_tt"
## [31] "/title/tt0980970/?ref_=adv_li_tt" "/title/tt0892769/?ref_=adv_li_tt"
## [33] "/title/tt0964517/?ref_=adv_li_tt" "/title/tt1504320/?ref_=adv_li_tt"
## [35] "/title/tt1231587/?ref_=adv_li_tt" "/title/tt0944835/?ref_=adv_li_tt"
## [37] "/title/tt1126591/?ref_=adv_li_tt" "/title/tt1001526/?ref_=adv_li_tt"
## [39] "/title/tt1323594/?ref_=adv_li_tt" "/title/tt0455407/?ref_=adv_li_tt"
## [41] "/title/tt1120985/?ref_=adv_li_tt" "/title/tt0435761/?ref_=adv_li_tt"
## [43] "/title/tt1263750/?ref_=adv_li_tt" "/title/tt1020558/?ref_=adv_li_tt"
## [45] "/title/tt1228987/?ref_=adv_li_tt" "/title/tt1403981/?ref_=adv_li_tt"
## [47] "/title/tt1465522/?ref_=adv_li_tt" "/title/tt0955308/?ref_=adv_li_tt"
## [49] "/title/tt0429493/?ref_=adv_li_tt" "/title/tt1155076/?ref_=adv_li_tt"
11 / 21
movie_urls <- scrape_movie_urls(2010)
12 / 21

We want to use scrape_time() function with each movie id in movie_urls.

scrape_time <- function(movie_url) {
time <- paste0("https://www.imdb.com", movie_url) %>%
read_html() %>%
html_nodes("time") %>%
html_text()
time[2] %>%
str_remove(" min") %>%
as.numeric()
}
13 / 21

Mapping (A simple example)

double <- function(number) {
number*2
}
14 / 21

Mapping (A simple example)

double <- function(number) {
number*2
}
double(2)
## [1] 4
double(4)
## [1] 8
double(6)
## [1] 12
15 / 21
even <- seq(from = 2, to = 10, by =2)
even
## [1] 2 4 6 8 10
16 / 21

Mapping (A simple example)

Mapping allows us to apply a function to each element of a vector (or a list) and have a vector (or a list, or a data frame) as an output.

map_dbl(even, double)
## [1] 4 8 12 16 20
17 / 21

English: Take each element of the movie_urls vector and use it in (map it to) the scrape_time() function.

map_dbl(movie_urls, scrape_time)
## [1] 148 103 120 95 112 138 102 146 117 108 133 111 125 108 112 108 144 104 100
## [20] 107 118 110 103 124 106 125 131 80 124 92 113 98 116 118 101 100 119 95
## [39] 95 101 112 103 107 97 116 113 89 140 117 140
18 / 21

Map functions

Function Output
map() a list
map_lgl() a logical vector
map_int() an integer vector
map_dbl() a double vector
map_chr() a character vector
map_df() a data frame
19 / 21
scrape_movie <- function(movie_url) {
time <- paste0("https://www.imdb.com", movie_url) %>%
read_html() %>%
html_nodes("time") %>%
html_text()
time_min <- time[2] %>%
str_remove(" min") %>%
as.numeric()
rating <- paste0("https://www.imdb.com", movie_url) %>%
read_html() %>%
html_nodes("strong span") %>%
html_text()
tibble(time = time_min,
rating = rating)
}
20 / 21
map_df(movie_urls, scrape_movie)
## # A tibble: 50 x 2
## time rating
## <dbl> <chr>
## 1 148 8.8
## 2 103 6.8
## 3 120 7.7
## 4 95 5.2
## 5 112 7.5
## 6 138 8.2
## 7 102 5.9
## 8 146 7.7
## 9 117 7.6
## 10 108 8.0
## # … with 40 more rows
21 / 21

Movie Specific Information

2 / 21
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow