class: title-slide <br> <br> .pull-right[ # Iteration ## Dr. Mine Dogucu ] --- class: center middle ### Movie Specific Information <video width="80%" height="45%%" align = "center" controls> <source src="screencast/5-scrape-movie.mp4" type="video/mp4"> </video> --- ```r read_html("https://www.imdb.com/title/tt1375666/?ref_=adv_li_tt") %>% html_nodes("time") %>% html_text() ``` ``` ## [1] "\n 2h 28min\n " ## [2] "148 min" ``` --- ```r time <- read_html("https://www.imdb.com/title/tt1375666/?ref_=adv_li_tt") %>% html_nodes("time") %>% html_text() time[2] ``` ``` ## [1] "148 min" ``` --- ```r time <- read_html("https://www.imdb.com/title/tt1375666/?ref_=adv_li_tt") %>% html_nodes("time") %>% html_text() time[2] %>% str_remove(" min") %>% as.numeric() ``` ``` ## [1] 148 ``` --- ```r scrape_time <- function(url) { time <- read_html(url) %>% html_nodes("time") %>% html_text() time[2] %>% str_remove(" min") %>% as.numeric() } ``` --- ```r scrape_time("https://www.imdb.com/title/tt1375666/?ref_=adv_li_tt") ``` ``` ## [1] 148 ``` --- ## Goal What if we wanted to use this function with [most popular 50 movies of 2010](https://www.imdb.com/search/title/?title_type=feature&year=2010-01-01,2010-12-31)? <img src="img/featured-popularity.png" width="80%" style="display: block; margin: auto;" /> --- Scraping First Movie ```r scrape_time("https://www.imdb.com/title/tt1375666/?ref_=adv_li_tt") ``` ``` ## [1] 148 ``` Scraping Second Movie ```r scrape_time("https://www.imdb.com/title/tt1591095/?ref_=adv_li_tt") ``` ``` ## [1] 103 ``` Scraping Third Movie... ```r scrape_time("https://www.imdb.com/title/tt1285016/?ref_=adv_li_tt") ``` ``` ## [1] 120 ``` --- .pull-left[ In previous lecture ```r scrape_movie_title <- function(year) { page <- paste0("https://www.imdb.com/search/title/?title_type=feature&year=", year, "-01-01,", year, "-12-31&sort=user_rating,desc") read_html(page) %>% html_nodes(".lister-item-header a") %>% html_text() } ``` ] .pull-right[ In this lecture ```r scrape_movie_urls <- function(year) { page <- paste0("https://www.imdb.com/search/title/?title_type=feature&year=", year, "-01-01,", year, "-12-31") read_html(page) %>% html_nodes(".lister-item-header a") %>% html_attr("href") } ``` ] `html_attr()` allows us to scrape the attribute (in this case href) --- ```r scrape_movie_urls(2010) ``` ``` ## [1] "/title/tt1375666/?ref_=adv_li_tt" "/title/tt1591095/?ref_=adv_li_tt" ## [3] "/title/tt1285016/?ref_=adv_li_tt" "/title/tt1179056/?ref_=adv_li_tt" ## [5] "/title/tt0446029/?ref_=adv_li_tt" "/title/tt1130884/?ref_=adv_li_tt" ## [7] "/title/tt1375670/?ref_=adv_li_tt" "/title/tt0926084/?ref_=adv_li_tt" ## [9] "/title/tt1250777/?ref_=adv_li_tt" "/title/tt0947798/?ref_=adv_li_tt" ## [11] "/title/tt1458175/?ref_=adv_li_tt" "/title/tt1245526/?ref_=adv_li_tt" ## [13] "/title/tt0840361/?ref_=adv_li_tt" "/title/tt1242432/?ref_=adv_li_tt" ## [15] "/title/tt0758752/?ref_=adv_li_tt" "/title/tt1014759/?ref_=adv_li_tt" ## [17] "/title/tt1588170/?ref_=adv_li_tt" "/title/tt1273235/?ref_=adv_li_tt" ## [19] "/title/tt0398286/?ref_=adv_li_tt" "/title/tt1386588/?ref_=adv_li_tt" ## [21] "/title/tt0814255/?ref_=adv_li_tt" "/title/tt1403865/?ref_=adv_li_tt" ## [23] "/title/tt0938283/?ref_=adv_li_tt" "/title/tt1228705/?ref_=adv_li_tt" ## [25] "/title/tt0800320/?ref_=adv_li_tt" "/title/tt1104001/?ref_=adv_li_tt" ## [27] "/title/tt1255953/?ref_=adv_li_tt" "/title/tt1314655/?ref_=adv_li_tt" ## [29] "/title/tt1325004/?ref_=adv_li_tt" "/title/tt1282140/?ref_=adv_li_tt" ## [31] "/title/tt0980970/?ref_=adv_li_tt" "/title/tt0892769/?ref_=adv_li_tt" ## [33] "/title/tt0964517/?ref_=adv_li_tt" "/title/tt1504320/?ref_=adv_li_tt" ## [35] "/title/tt1231587/?ref_=adv_li_tt" "/title/tt0944835/?ref_=adv_li_tt" ## [37] "/title/tt1126591/?ref_=adv_li_tt" "/title/tt1001526/?ref_=adv_li_tt" ## [39] "/title/tt1323594/?ref_=adv_li_tt" "/title/tt0455407/?ref_=adv_li_tt" ## [41] "/title/tt1120985/?ref_=adv_li_tt" "/title/tt0435761/?ref_=adv_li_tt" ## [43] "/title/tt1263750/?ref_=adv_li_tt" "/title/tt1020558/?ref_=adv_li_tt" ## [45] "/title/tt1228987/?ref_=adv_li_tt" "/title/tt1403981/?ref_=adv_li_tt" ## [47] "/title/tt1465522/?ref_=adv_li_tt" "/title/tt0955308/?ref_=adv_li_tt" ## [49] "/title/tt0429493/?ref_=adv_li_tt" "/title/tt1155076/?ref_=adv_li_tt" ``` --- ```r movie_urls <- scrape_movie_urls(2010) ``` --- We want to use `scrape_time()` function with each movie id in `movie_urls`. ```r scrape_time <- function(movie_url) { time <- paste0("https://www.imdb.com", movie_url) %>% read_html() %>% html_nodes("time") %>% html_text() time[2] %>% str_remove(" min") %>% as.numeric() } ``` --- ## Mapping (A simple example) ```r double <- function(number) { number*2 } ``` -- ```r double(2) ``` ``` ## [1] 4 ``` ```r double(4) ``` ``` ## [1] 8 ``` ```r double(6) ``` ``` ## [1] 12 ``` --- ```r even <- seq(from = 2, to = 10, by =2) even ``` ``` ## [1] 2 4 6 8 10 ``` --- ## Mapping (A simple example) Mapping allows us to apply a function to each element of a vector (or a list) and have a vector (or a list, or a data frame) as an output. ```r map_dbl(even, double) ``` ``` ## [1] 4 8 12 16 20 ``` --- **English**: Take each element of the `movie_urls` vector and use it in (map it to) the `scrape_time()` function. ```r map_dbl(movie_urls, scrape_time) ``` ``` ## [1] 148 103 120 95 112 138 102 146 117 108 133 111 125 108 112 108 144 104 100 ## [20] 107 118 110 103 124 106 125 131 80 124 92 113 98 116 118 101 100 119 95 ## [39] 95 101 112 103 107 97 116 113 89 140 117 140 ``` --- ## Map functions | Function | Output | |-------------|--------------------| | `map()` | a list | | `map_lgl()` | a logical vector | | `map_int()` | an integer vector | | `map_dbl()` | a double vector | | `map_chr()` | a character vector | | `map_df()` | a data frame | --- ```r scrape_movie <- function(movie_url) { time <- paste0("https://www.imdb.com", movie_url) %>% read_html() %>% html_nodes("time") %>% html_text() time_min <- time[2] %>% str_remove(" min") %>% as.numeric() rating <- paste0("https://www.imdb.com", movie_url) %>% read_html() %>% html_nodes("strong span") %>% html_text() tibble(time = time_min, rating = rating) } ``` --- ```r map_df(movie_urls, scrape_movie) ``` ``` ## # A tibble: 50 x 2 ## time rating ## <dbl> <chr> ## 1 148 8.8 ## 2 103 6.8 ## 3 120 7.7 ## 4 95 5.2 ## 5 112 7.5 ## 6 138 8.2 ## 7 102 5.9 ## 8 146 7.7 ## 9 117 7.6 ## 10 108 8.0 ## # … with 40 more rows ```