+ - 0:00:00
Notes for current slide
Notes for next slide



Web Scraping

Dr. Mine Dogucu

1 / 28

3 / 28

4 / 28

5 / 28

6 / 28

7 / 28

8 / 28
9 / 28

10 / 28

read_html() - reads an html page.
html_nodes() - extracts the html nodes.
html_text() - extracts the text of the node.
html_attr() - extracts the attribute

11 / 28

Load packages

library(rvest)
library(tidyverse)
12 / 28

Check if a bot has permisson to access page

robotstxt::paths_allowed("http://www.imdb.com")
##
www.imdb.com
## [1] TRUE
robotstxt::paths_allowed("http://www.facebook.com")
##
www.facebook.com
## [1] FALSE
13 / 28

Read the entire page

page <- read_html("http://www.imdb.com/chart/top")
page
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ...
14 / 28

Scrape titles

15 / 28
page %>%
html_nodes(".titleColumn a")
## {xml_nodeset (250)}
## [1] <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [2] <a href="/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [3] <a href="/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [4] <a href="/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [5] <a href="/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [6] <a href="/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [7] <a href="/title/tt0167260/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [8] <a href="/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [9] <a href="/title/tt0060196/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [10] <a href="/title/tt0120737/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [11] <a href="/title/tt0137523/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [12] <a href="/title/tt0109830/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [13] <a href="/title/tt1375666/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [14] <a href="/title/tt0167261/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [15] <a href="/title/tt0080684/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [16] <a href="/title/tt0133093/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [17] <a href="/title/tt0099685/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [18] <a href="/title/tt0073486/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [19] <a href="/title/tt0047478/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [20] <a href="/title/tt0114369/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## ...
16 / 28
page %>%
html_nodes(".titleColumn a") %>%
html_text()
## [1] "The Shawshank Redemption"
## [2] "The Godfather"
## [3] "The Godfather: Part II"
## [4] "The Dark Knight"
## [5] "12 Angry Men"
## [6] "Schindler's List"
## [7] "The Lord of the Rings: The Return of the King"
## [8] "Pulp Fiction"
## [9] "The Good, the Bad and the Ugly"
## [10] "The Lord of the Rings: The Fellowship of the Ring"
## [11] "Fight Club"
## [12] "Forrest Gump"
## [13] "Inception"
## [14] "The Lord of the Rings: The Two Towers"
## [15] "Star Wars: Episode V - The Empire Strikes Back"
## [16] "The Matrix"
## [17] "Goodfellas"
## [18] "One Flew Over the Cuckoo's Nest"
## [19] "Seven Samurai"
## [20] "Se7en"
## [21] "Life Is Beautiful"
## [22] "City of God"
## [23] "The Silence of the Lambs"
## [24] "It's a Wonderful Life"
## [25] "Saving Private Ryan"
## [26] "Star Wars: Episode IV - A New Hope"
## [27] "The Green Mile"
## [28] "Spirited Away"
## [29] "Interstellar"
## [30] "Parasite"
## [31] "Léon: The Professional"
## [32] "Hara-Kiri"
## [33] "The Lion King"
## [34] "The Usual Suspects"
## [35] "The Pianist"
## [36] "Terminator 2: Judgment Day"
## [37] "Back to the Future"
## [38] "American History X"
## [39] "Modern Times"
## [40] "Gladiator"
## [41] "Psycho"
## [42] "The Departed"
## [43] "City Lights"
## [44] "Whiplash"
## [45] "The Intouchables"
## [46] "Grave of the Fireflies"
## [47] "The Prestige"
## [48] "Once Upon a Time in the West"
## [49] "Casablanca"
## [50] "Cinema Paradiso"
## [51] "Rear Window"
## [52] "Alien"
## [53] "Apocalypse Now"
## [54] "Memento"
## [55] "The Great Dictator"
## [56] "Indiana Jones and the Raiders of the Lost Ark"
## [57] "Django Unchained"
## [58] "The Lives of Others"
## [59] "Paths of Glory"
## [60] "Hamilton"
## [61] "WALL·E"
## [62] "Joker"
## [63] "The Shining"
## [64] "Avengers: Infinity War"
## [65] "Sunset Blvd."
## [66] "Witness for the Prosecution"
## [67] "Spider-Man: Into the Spider-Verse"
## [68] "Oldboy"
## [69] "Princess Mononoke"
## [70] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"
## [71] "The Dark Knight Rises"
## [72] "Once Upon a Time in America"
## [73] "Your Name."
## [74] "Aliens"
## [75] "Coco"
## [76] "Avengers: Endgame"
## [77] "Capharnaüm"
## [78] "American Beauty"
## [79] "Braveheart"
## [80] "High and Low"
## [81] "Toy Story"
## [82] "Das Boot"
## [83] "3 Idiots"
## [84] "Amadeus"
## [85] "Inglourious Basterds"
## [86] "Good Will Hunting"
## [87] "Star Wars: Episode VI - Return of the Jedi"
## [88] "Like Stars on Earth"
## [89] "Reservoir Dogs"
## [90] "2001: A Space Odyssey"
## [91] "Requiem for a Dream"
## [92] "The Hunt"
## [93] "Vertigo"
## [94] "M"
## [95] "Eternal Sunshine of the Spotless Mind"
## [96] "Citizen Kane"
## [97] "Dangal"
## [98] "Come and See"
## [99] "Singin' in the Rain"
## [100] "The Kid"
## [101] "Bicycle Thieves"
## [102] "Full Metal Jacket"
## [103] "Ikiru"
## [104] "Snatch"
## [105] "North by Northwest"
## [106] "Scarface"
## [107] "A Clockwork Orange"
## [108] "1917"
## [109] "Incendies"
## [110] "Taxi Driver"
## [111] "A Separation"
## [112] "Toy Story 3"
## [113] "The Sting"
## [114] "Lawrence of Arabia"
## [115] "Amélie"
## [116] "Metropolis"
## [117] "The Apartment"
## [118] "For a Few Dollars More"
## [119] "Double Indemnity"
## [120] "To Kill a Mockingbird"
## [121] "Up"
## [122] "Indiana Jones and the Last Crusade"
## [123] "Heat"
## [124] "L.A. Confidential"
## [125] "Green Book"
## [126] "Die Hard"
## [127] "Batman Begins"
## [128] "Yojimbo"
## [129] "Monty Python and the Holy Grail"
## [130] "Rashomon"
## [131] "Downfall"
## [132] "The Father"
## [133] "Children of Heaven"
## [134] "Ran"
## [135] "Unforgiven"
## [136] "Some Like It Hot"
## [137] "Howl's Moving Castle"
## [138] "All About Eve"
## [139] "Casino"
## [140] "The Wolf of Wall Street"
## [141] "A Beautiful Mind"
## [142] "The Great Escape"
## [143] "Pan's Labyrinth"
## [144] "There Will Be Blood"
## [145] "The Secret in Their Eyes"
## [146] "Judgment at Nuremberg"
## [147] "Lock, Stock and Two Smoking Barrels"
## [148] "Raging Bull"
## [149] "My Neighbor Totoro"
## [150] "The Treasure of the Sierra Madre"
## [151] "Dial M for Murder"
## [152] "Shutter Island"
## [153] "Three Billboards Outside Ebbing, Missouri"
## [154] "The Gold Rush"
## [155] "My Father and My Son"
## [156] "Chinatown"
## [157] "No Country for Old Men"
## [158] "V for Vendetta"
## [159] "Inside Out"
## [160] "The Thing"
## [161] "The Elephant Man"
## [162] "The Seventh Seal"
## [163] "The Sixth Sense"
## [164] "Warrior"
## [165] "Jurassic Park"
## [166] "Klaus"
## [167] "Trainspotting"
## [168] "The Truman Show"
## [169] "Gone with the Wind"
## [170] "Finding Nemo"
## [171] "Stalker"
## [172] "Memories of Murder"
## [173] "Kill Bill: Vol. 1"
## [174] "Wild Strawberries"
## [175] "Blade Runner"
## [176] "Fargo"
## [177] "The Bridge on the River Kwai"
## [178] "Wild Tales"
## [179] "Tokyo Story"
## [180] "Gran Torino"
## [181] "Room"
## [182] "The Third Man"
## [183] "On the Waterfront"
## [184] "The Deer Hunter"
## [185] "In the Name of the Father"
## [186] "Before Sunrise"
## [187] "Mary and Max"
## [188] "The Grand Budapest Hotel"
## [189] "Catch Me If You Can"
## [190] "Gone Girl"
## [191] "Hacksaw Ridge"
## [192] "Prisoners"
## [193] "Persona"
## [194] "Sherlock Jr."
## [195] "Andhadhun"
## [196] "The Big Lebowski"
## [197] "Barry Lyndon"
## [198] "The General"
## [199] "To Be or Not to Be"
## [200] "Ford v Ferrari"
## [201] "How to Train Your Dragon"
## [202] "12 Years a Slave"
## [203] "The Bandit"
## [204] "Mr. Smith Goes to Washington"
## [205] "Autumn Sonata"
## [206] "Mad Max: Fury Road"
## [207] "Dead Poets Society"
## [208] "Million Dollar Baby"
## [209] "Harry Potter and the Deathly Hallows: Part 2"
## [210] "Stand by Me"
## [211] "Ben-Hur"
## [212] "Network"
## [213] "Hachi: A Dog's Tale"
## [214] "The Handmaiden"
## [215] "Anand"
## [216] "Cool Hand Luke"
## [217] "Logan"
## [218] "Platoon"
## [219] "The Wages of Fear"
## [220] "Rush"
## [221] "Into the Wild"
## [222] "La Haine"
## [223] "The Passion of Joan of Arc"
## [224] "The 400 Blows"
## [225] "Monty Python's Life of Brian"
## [226] "Spotlight"
## [227] "Gangs of Wasseypur"
## [228] "Hotel Rwanda"
## [229] "Amores Perros"
## [230] "Monsters, Inc."
## [231] "Andrei Rublev"
## [232] "Rocky"
## [233] "Soul"
## [234] "Nausicaä of the Valley of the Wind"
## [235] "Rebecca"
## [236] "Before Sunset"
## [237] "In the Mood for Love"
## [238] "Time of the Gypsies"
## [239] "Raatchasan"
## [240] "Rang De Basanti"
## [241] "Rififi"
## [242] "Paris, Texas"
## [243] "Drishyam"
## [244] "Portrait of a Lady on Fire"
## [245] "Zack Snyder's Justice League"
## [246] "It Happened One Night"
## [247] "Tangerines"
## [248] "The Battle of Algiers"
## [249] "Drishyam"
## [250] "A Silent Voice: The Movie"
17 / 28
titles <- page %>%
html_nodes(".titleColumn a") %>%
html_text()
18 / 28
str(titles)
## chr [1:250] "The Shawshank Redemption" "The Godfather" ...
19 / 28

Scrape years

20 / 28
page %>%
html_nodes(".secondaryInfo") %>%
html_text()
## [1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)" "(2003)" "(1994)"
## [9] "(1966)" "(2001)" "(1999)" "(1994)" "(2010)" "(2002)" "(1980)" "(1999)"
## [17] "(1990)" "(1975)" "(1954)" "(1995)" "(1997)" "(2002)" "(1991)" "(1946)"
## [25] "(1998)" "(1977)" "(1999)" "(2001)" "(2014)" "(2019)" "(1994)" "(1962)"
## [33] "(1994)" "(1995)" "(2002)" "(1991)" "(1985)" "(1998)" "(1936)" "(2000)"
## [41] "(1960)" "(2006)" "(1931)" "(2014)" "(2011)" "(1988)" "(2006)" "(1968)"
## [49] "(1942)" "(1988)" "(1954)" "(1979)" "(1979)" "(2000)" "(1940)" "(1981)"
## [57] "(2012)" "(2006)" "(1957)" "(2020)" "(2008)" "(2019)" "(1980)" "(2018)"
## [65] "(1950)" "(1957)" "(2018)" "(2003)" "(1997)" "(1964)" "(2012)" "(1984)"
## [73] "(2016)" "(1986)" "(2017)" "(2019)" "(2018)" "(1999)" "(1995)" "(1963)"
## [81] "(1995)" "(1981)" "(2009)" "(1984)" "(2009)" "(1997)" "(1983)" "(2007)"
## [89] "(1992)" "(1968)" "(2000)" "(2012)" "(1958)" "(1931)" "(2004)" "(1941)"
## [97] "(2016)" "(1985)" "(1952)" "(1921)" "(1948)" "(1987)" "(1952)" "(2000)"
## [105] "(1959)" "(1983)" "(1971)" "(2019)" "(2010)" "(1976)" "(2011)" "(2010)"
## [113] "(1973)" "(1962)" "(2001)" "(1927)" "(1960)" "(1965)" "(1944)" "(1962)"
## [121] "(2009)" "(1989)" "(1995)" "(1997)" "(2018)" "(1988)" "(2005)" "(1961)"
## [129] "(1975)" "(1950)" "(2004)" "(2020)" "(1997)" "(1985)" "(1992)" "(1959)"
## [137] "(2004)" "(1950)" "(1995)" "(2013)" "(2001)" "(1963)" "(2006)" "(2007)"
## [145] "(2009)" "(1961)" "(1998)" "(1980)" "(1988)" "(1948)" "(1954)" "(2010)"
## [153] "(2017)" "(1925)" "(2005)" "(1974)" "(2007)" "(2005)" "(2015)" "(1982)"
## [161] "(1980)" "(1957)" "(1999)" "(2011)" "(1993)" "(2019)" "(1996)" "(1998)"
## [169] "(1939)" "(2003)" "(1979)" "(2003)" "(2003)" "(1957)" "(1982)" "(1996)"
## [177] "(1957)" "(2014)" "(1953)" "(2008)" "(2015)" "(1949)" "(1954)" "(1978)"
## [185] "(1993)" "(1995)" "(2009)" "(2014)" "(2002)" "(2014)" "(2016)" "(2013)"
## [193] "(1966)" "(1924)" "(2018)" "(1998)" "(1975)" "(1926)" "(1942)" "(2019)"
## [201] "(2010)" "(2013)" "(1996)" "(1939)" "(1978)" "(2015)" "(1989)" "(2004)"
## [209] "(2011)" "(1986)" "(1959)" "(1976)" "(2009)" "(2016)" "(1971)" "(1967)"
## [217] "(2017)" "(1986)" "(1953)" "(2013)" "(2007)" "(1995)" "(1928)" "(1959)"
## [225] "(1979)" "(2015)" "(2012)" "(2004)" "(2000)" "(2001)" "(1966)" "(1976)"
## [233] "(2020)" "(1984)" "(1940)" "(2004)" "(2000)" "(1988)" "(2018)" "(2006)"
## [241] "(1955)" "(1984)" "(2013)" "(2019)" "(2021)" "(1934)" "(2013)" "(1966)"
## [249] "(2015)" "(2016)"
21 / 28
page %>%
html_nodes(".secondaryInfo") %>%
html_text() %>%
str_remove("\\(") %>%
str_remove("\\)") %>%
as.numeric()
## [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 2001 1999 1994 2010 2002 1980
## [16] 1999 1990 1975 1954 1995 1997 2002 1991 1946 1998 1977 1999 2001 2014 2019
## [31] 1994 1962 1994 1995 2002 1991 1985 1998 1936 2000 1960 2006 1931 2014 2011
## [46] 1988 2006 1968 1942 1988 1954 1979 1979 2000 1940 1981 2012 2006 1957 2020
## [61] 2008 2019 1980 2018 1950 1957 2018 2003 1997 1964 2012 1984 2016 1986 2017
## [76] 2019 2018 1999 1995 1963 1995 1981 2009 1984 2009 1997 1983 2007 1992 1968
## [91] 2000 2012 1958 1931 2004 1941 2016 1985 1952 1921 1948 1987 1952 2000 1959
## [106] 1983 1971 2019 2010 1976 2011 2010 1973 1962 2001 1927 1960 1965 1944 1962
## [121] 2009 1989 1995 1997 2018 1988 2005 1961 1975 1950 2004 2020 1997 1985 1992
## [136] 1959 2004 1950 1995 2013 2001 1963 2006 2007 2009 1961 1998 1980 1988 1948
## [151] 1954 2010 2017 1925 2005 1974 2007 2005 2015 1982 1980 1957 1999 2011 1993
## [166] 2019 1996 1998 1939 2003 1979 2003 2003 1957 1982 1996 1957 2014 1953 2008
## [181] 2015 1949 1954 1978 1993 1995 2009 2014 2002 2014 2016 2013 1966 1924 2018
## [196] 1998 1975 1926 1942 2019 2010 2013 1996 1939 1978 2015 1989 2004 2011 1986
## [211] 1959 1976 2009 2016 1971 1967 2017 1986 1953 2013 2007 1995 1928 1959 1979
## [226] 2015 2012 2004 2000 2001 1966 1976 2020 1984 1940 2004 2000 1988 2018 2006
## [241] 1955 1984 2013 2019 2021 1934 2013 1966 2015 2016
22 / 28
years <-
page %>%
html_nodes(".secondaryInfo") %>%
html_text() %>%
str_remove("\\(") %>%
str_remove("\\)") %>%
as.numeric()
23 / 28

Scrape ratings

24 / 28
ratings <- page %>%
html_nodes("strong") %>%
html_text() %>%
as.numeric()
25 / 28
imdb_top_250 <- tibble(
title = titles,
year = years,
rating = ratings
)
26 / 28
imdb_top_250 %>%
group_by(year) %>%
summarize(avg_rating = mean(rating)) %>%
arrange(desc(avg_rating))
## # A tibble: 85 x 2
## year avg_rating
## <dbl> <dbl>
## 1 1972 9.1
## 2 1994 8.76
## 3 1946 8.6
## 4 1977 8.6
## 5 1990 8.6
## 6 1974 8.55
## 7 1991 8.55
## 8 1936 8.5
## 9 2008 8.5
## 10 2002 8.48
## # … with 75 more rows
27 / 28
imdb_top_250 %>%
filter(year == 1972)
## # A tibble: 1 x 3
## title year rating
## <chr> <dbl> <dbl>
## 1 The Godfather 1972 9.1
28 / 28
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow