class: title-slide <br> <br> .pull-right[ # Web Scraping ## Dr. Mine Dogucu ] --- class: center middle [SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=en) <hr> [IMDB 250 Top Rated Movies](https://www.imdb.com/chart/top) --- class:middle <img src="img/web-scrape.png" width="80%" style="display: block; margin: auto;" /> --- <img src="img/imdb-top250.png" width="80%" style="display: block; margin: auto;" /> --- <img src="img/imdb-csv.png" width="80%" style="display: block; margin: auto;" /> --- <img src="img/browser.png" width="80%" style="display: block; margin: auto;" /> --- <img src="img/imdb-getsource.png" width="80%" style="display: block; margin: auto;" /> --- <img src="img/imdb-source.png" width="80%" style="display: block; margin: auto;" /> --- class: center middle <video width="80%" height="45%%" align = "center" controls> <source src="screencast/5-selector-gadget.mp4" type="video/mp4"> </video> --- <img src="img/imdb-source.png" width="80%" style="display: block; margin: auto;" /> --- <img src="img/rvest-logo.png" width="20%" style="display: block; margin: auto;" /> `read_html()` - reads an html page. `html_nodes()` - extracts the html nodes. `html_text()` - extracts the text of the node. `html_attr()` - extracts the attribute --- ## Load packages ```r library(rvest) library(tidyverse) ``` --- ### Check if a bot has permisson to access page ```r robotstxt::paths_allowed("http://www.imdb.com") ``` ``` ## www.imdb.com ``` ``` ## [1] TRUE ``` ```r robotstxt::paths_allowed("http://www.facebook.com") ``` ``` ## www.facebook.com ``` ``` ## [1] FALSE ``` --- # Read the entire page ```r page <- read_html("http://www.imdb.com/chart/top") page ``` ``` ## {html_document} ## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ... ``` --- class: inverse middle .font50[Scrape titles] --- ```r page %>% html_nodes(".titleColumn a") ``` ``` ## {xml_nodeset (250)} ## [1] <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [2] <a href="/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [3] <a href="/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [4] <a href="/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [5] <a href="/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [6] <a href="/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [7] <a href="/title/tt0167260/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [8] <a href="/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [9] <a href="/title/tt0060196/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [10] <a href="/title/tt0120737/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [11] <a href="/title/tt0137523/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [12] <a href="/title/tt0109830/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [13] <a href="/title/tt1375666/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [14] <a href="/title/tt0167261/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [15] <a href="/title/tt0080684/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [16] <a href="/title/tt0133093/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [17] <a href="/title/tt0099685/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [18] <a href="/title/tt0073486/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [19] <a href="/title/tt0047478/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## [20] <a href="/title/tt0114369/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-3 ... ## ... ``` --- ```r page %>% html_nodes(".titleColumn a") %>% html_text() ``` ``` ## [1] "The Shawshank Redemption" ## [2] "The Godfather" ## [3] "The Godfather: Part II" ## [4] "The Dark Knight" ## [5] "12 Angry Men" ## [6] "Schindler's List" ## [7] "The Lord of the Rings: The Return of the King" ## [8] "Pulp Fiction" ## [9] "The Good, the Bad and the Ugly" ## [10] "The Lord of the Rings: The Fellowship of the Ring" ## [11] "Fight Club" ## [12] "Forrest Gump" ## [13] "Inception" ## [14] "The Lord of the Rings: The Two Towers" ## [15] "Star Wars: Episode V - The Empire Strikes Back" ## [16] "The Matrix" ## [17] "Goodfellas" ## [18] "One Flew Over the Cuckoo's Nest" ## [19] "Seven Samurai" ## [20] "Se7en" ## [21] "Life Is Beautiful" ## [22] "City of God" ## [23] "The Silence of the Lambs" ## [24] "It's a Wonderful Life" ## [25] "Saving Private Ryan" ## [26] "Star Wars: Episode IV - A New Hope" ## [27] "The Green Mile" ## [28] "Spirited Away" ## [29] "Interstellar" ## [30] "Parasite" ## [31] "Léon: The Professional" ## [32] "Hara-Kiri" ## [33] "The Lion King" ## [34] "The Usual Suspects" ## [35] "The Pianist" ## [36] "Terminator 2: Judgment Day" ## [37] "Back to the Future" ## [38] "American History X" ## [39] "Modern Times" ## [40] "Gladiator" ## [41] "Psycho" ## [42] "The Departed" ## [43] "City Lights" ## [44] "Whiplash" ## [45] "The Intouchables" ## [46] "Grave of the Fireflies" ## [47] "The Prestige" ## [48] "Once Upon a Time in the West" ## [49] "Casablanca" ## [50] "Cinema Paradiso" ## [51] "Rear Window" ## [52] "Alien" ## [53] "Apocalypse Now" ## [54] "Memento" ## [55] "The Great Dictator" ## [56] "Indiana Jones and the Raiders of the Lost Ark" ## [57] "Django Unchained" ## [58] "The Lives of Others" ## [59] "Paths of Glory" ## [60] "Hamilton" ## [61] "WALL·E" ## [62] "Joker" ## [63] "The Shining" ## [64] "Avengers: Infinity War" ## [65] "Sunset Blvd." ## [66] "Witness for the Prosecution" ## [67] "Spider-Man: Into the Spider-Verse" ## [68] "Oldboy" ## [69] "Princess Mononoke" ## [70] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb" ## [71] "The Dark Knight Rises" ## [72] "Once Upon a Time in America" ## [73] "Your Name." ## [74] "Aliens" ## [75] "Coco" ## [76] "Avengers: Endgame" ## [77] "Capharnaüm" ## [78] "American Beauty" ## [79] "Braveheart" ## [80] "High and Low" ## [81] "Toy Story" ## [82] "Das Boot" ## [83] "3 Idiots" ## [84] "Amadeus" ## [85] "Inglourious Basterds" ## [86] "Good Will Hunting" ## [87] "Star Wars: Episode VI - Return of the Jedi" ## [88] "Like Stars on Earth" ## [89] "Reservoir Dogs" ## [90] "2001: A Space Odyssey" ## [91] "Requiem for a Dream" ## [92] "The Hunt" ## [93] "Vertigo" ## [94] "M" ## [95] "Eternal Sunshine of the Spotless Mind" ## [96] "Citizen Kane" ## [97] "Dangal" ## [98] "Come and See" ## [99] "Singin' in the Rain" ## [100] "The Kid" ## [101] "Bicycle Thieves" ## [102] "Full Metal Jacket" ## [103] "Ikiru" ## [104] "Snatch" ## [105] "North by Northwest" ## [106] "Scarface" ## [107] "A Clockwork Orange" ## [108] "1917" ## [109] "Incendies" ## [110] "Taxi Driver" ## [111] "A Separation" ## [112] "Toy Story 3" ## [113] "The Sting" ## [114] "Lawrence of Arabia" ## [115] "Amélie" ## [116] "Metropolis" ## [117] "The Apartment" ## [118] "For a Few Dollars More" ## [119] "Double Indemnity" ## [120] "To Kill a Mockingbird" ## [121] "Up" ## [122] "Indiana Jones and the Last Crusade" ## [123] "Heat" ## [124] "L.A. Confidential" ## [125] "Green Book" ## [126] "Die Hard" ## [127] "Batman Begins" ## [128] "Yojimbo" ## [129] "Monty Python and the Holy Grail" ## [130] "Rashomon" ## [131] "Downfall" ## [132] "The Father" ## [133] "Children of Heaven" ## [134] "Ran" ## [135] "Unforgiven" ## [136] "Some Like It Hot" ## [137] "Howl's Moving Castle" ## [138] "All About Eve" ## [139] "Casino" ## [140] "The Wolf of Wall Street" ## [141] "A Beautiful Mind" ## [142] "The Great Escape" ## [143] "Pan's Labyrinth" ## [144] "There Will Be Blood" ## [145] "The Secret in Their Eyes" ## [146] "Judgment at Nuremberg" ## [147] "Lock, Stock and Two Smoking Barrels" ## [148] "Raging Bull" ## [149] "My Neighbor Totoro" ## [150] "The Treasure of the Sierra Madre" ## [151] "Dial M for Murder" ## [152] "Shutter Island" ## [153] "Three Billboards Outside Ebbing, Missouri" ## [154] "The Gold Rush" ## [155] "My Father and My Son" ## [156] "Chinatown" ## [157] "No Country for Old Men" ## [158] "V for Vendetta" ## [159] "Inside Out" ## [160] "The Thing" ## [161] "The Elephant Man" ## [162] "The Seventh Seal" ## [163] "The Sixth Sense" ## [164] "Warrior" ## [165] "Jurassic Park" ## [166] "Klaus" ## [167] "Trainspotting" ## [168] "The Truman Show" ## [169] "Gone with the Wind" ## [170] "Finding Nemo" ## [171] "Stalker" ## [172] "Memories of Murder" ## [173] "Kill Bill: Vol. 1" ## [174] "Wild Strawberries" ## [175] "Blade Runner" ## [176] "Fargo" ## [177] "The Bridge on the River Kwai" ## [178] "Wild Tales" ## [179] "Tokyo Story" ## [180] "Gran Torino" ## [181] "Room" ## [182] "The Third Man" ## [183] "On the Waterfront" ## [184] "The Deer Hunter" ## [185] "In the Name of the Father" ## [186] "Before Sunrise" ## [187] "Mary and Max" ## [188] "The Grand Budapest Hotel" ## [189] "Catch Me If You Can" ## [190] "Gone Girl" ## [191] "Hacksaw Ridge" ## [192] "Prisoners" ## [193] "Persona" ## [194] "Sherlock Jr." ## [195] "Andhadhun" ## [196] "The Big Lebowski" ## [197] "Barry Lyndon" ## [198] "The General" ## [199] "To Be or Not to Be" ## [200] "Ford v Ferrari" ## [201] "How to Train Your Dragon" ## [202] "12 Years a Slave" ## [203] "The Bandit" ## [204] "Mr. Smith Goes to Washington" ## [205] "Autumn Sonata" ## [206] "Mad Max: Fury Road" ## [207] "Dead Poets Society" ## [208] "Million Dollar Baby" ## [209] "Harry Potter and the Deathly Hallows: Part 2" ## [210] "Stand by Me" ## [211] "Ben-Hur" ## [212] "Network" ## [213] "Hachi: A Dog's Tale" ## [214] "The Handmaiden" ## [215] "Anand" ## [216] "Cool Hand Luke" ## [217] "Logan" ## [218] "Platoon" ## [219] "The Wages of Fear" ## [220] "Rush" ## [221] "Into the Wild" ## [222] "La Haine" ## [223] "The Passion of Joan of Arc" ## [224] "The 400 Blows" ## [225] "Monty Python's Life of Brian" ## [226] "Spotlight" ## [227] "Gangs of Wasseypur" ## [228] "Hotel Rwanda" ## [229] "Amores Perros" ## [230] "Monsters, Inc." ## [231] "Andrei Rublev" ## [232] "Rocky" ## [233] "Soul" ## [234] "Nausicaä of the Valley of the Wind" ## [235] "Rebecca" ## [236] "Before Sunset" ## [237] "In the Mood for Love" ## [238] "Time of the Gypsies" ## [239] "Raatchasan" ## [240] "Rang De Basanti" ## [241] "Rififi" ## [242] "Paris, Texas" ## [243] "Drishyam" ## [244] "Portrait of a Lady on Fire" ## [245] "Zack Snyder's Justice League" ## [246] "It Happened One Night" ## [247] "Tangerines" ## [248] "The Battle of Algiers" ## [249] "Drishyam" ## [250] "A Silent Voice: The Movie" ``` --- ```r titles <- page %>% html_nodes(".titleColumn a") %>% html_text() ``` --- ```r str(titles) ``` ``` ## chr [1:250] "The Shawshank Redemption" "The Godfather" ... ``` --- class: inverse middle .font50[Scrape years] --- ```r page %>% html_nodes(".secondaryInfo") %>% html_text() ``` ``` ## [1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)" "(2003)" "(1994)" ## [9] "(1966)" "(2001)" "(1999)" "(1994)" "(2010)" "(2002)" "(1980)" "(1999)" ## [17] "(1990)" "(1975)" "(1954)" "(1995)" "(1997)" "(2002)" "(1991)" "(1946)" ## [25] "(1998)" "(1977)" "(1999)" "(2001)" "(2014)" "(2019)" "(1994)" "(1962)" ## [33] "(1994)" "(1995)" "(2002)" "(1991)" "(1985)" "(1998)" "(1936)" "(2000)" ## [41] "(1960)" "(2006)" "(1931)" "(2014)" "(2011)" "(1988)" "(2006)" "(1968)" ## [49] "(1942)" "(1988)" "(1954)" "(1979)" "(1979)" "(2000)" "(1940)" "(1981)" ## [57] "(2012)" "(2006)" "(1957)" "(2020)" "(2008)" "(2019)" "(1980)" "(2018)" ## [65] "(1950)" "(1957)" "(2018)" "(2003)" "(1997)" "(1964)" "(2012)" "(1984)" ## [73] "(2016)" "(1986)" "(2017)" "(2019)" "(2018)" "(1999)" "(1995)" "(1963)" ## [81] "(1995)" "(1981)" "(2009)" "(1984)" "(2009)" "(1997)" "(1983)" "(2007)" ## [89] "(1992)" "(1968)" "(2000)" "(2012)" "(1958)" "(1931)" "(2004)" "(1941)" ## [97] "(2016)" "(1985)" "(1952)" "(1921)" "(1948)" "(1987)" "(1952)" "(2000)" ## [105] "(1959)" "(1983)" "(1971)" "(2019)" "(2010)" "(1976)" "(2011)" "(2010)" ## [113] "(1973)" "(1962)" "(2001)" "(1927)" "(1960)" "(1965)" "(1944)" "(1962)" ## [121] "(2009)" "(1989)" "(1995)" "(1997)" "(2018)" "(1988)" "(2005)" "(1961)" ## [129] "(1975)" "(1950)" "(2004)" "(2020)" "(1997)" "(1985)" "(1992)" "(1959)" ## [137] "(2004)" "(1950)" "(1995)" "(2013)" "(2001)" "(1963)" "(2006)" "(2007)" ## [145] "(2009)" "(1961)" "(1998)" "(1980)" "(1988)" "(1948)" "(1954)" "(2010)" ## [153] "(2017)" "(1925)" "(2005)" "(1974)" "(2007)" "(2005)" "(2015)" "(1982)" ## [161] "(1980)" "(1957)" "(1999)" "(2011)" "(1993)" "(2019)" "(1996)" "(1998)" ## [169] "(1939)" "(2003)" "(1979)" "(2003)" "(2003)" "(1957)" "(1982)" "(1996)" ## [177] "(1957)" "(2014)" "(1953)" "(2008)" "(2015)" "(1949)" "(1954)" "(1978)" ## [185] "(1993)" "(1995)" "(2009)" "(2014)" "(2002)" "(2014)" "(2016)" "(2013)" ## [193] "(1966)" "(1924)" "(2018)" "(1998)" "(1975)" "(1926)" "(1942)" "(2019)" ## [201] "(2010)" "(2013)" "(1996)" "(1939)" "(1978)" "(2015)" "(1989)" "(2004)" ## [209] "(2011)" "(1986)" "(1959)" "(1976)" "(2009)" "(2016)" "(1971)" "(1967)" ## [217] "(2017)" "(1986)" "(1953)" "(2013)" "(2007)" "(1995)" "(1928)" "(1959)" ## [225] "(1979)" "(2015)" "(2012)" "(2004)" "(2000)" "(2001)" "(1966)" "(1976)" ## [233] "(2020)" "(1984)" "(1940)" "(2004)" "(2000)" "(1988)" "(2018)" "(2006)" ## [241] "(1955)" "(1984)" "(2013)" "(2019)" "(2021)" "(1934)" "(2013)" "(1966)" ## [249] "(2015)" "(2016)" ``` --- ```r page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% str_remove("\\)") %>% as.numeric() ``` ``` ## [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 2001 1999 1994 2010 2002 1980 ## [16] 1999 1990 1975 1954 1995 1997 2002 1991 1946 1998 1977 1999 2001 2014 2019 ## [31] 1994 1962 1994 1995 2002 1991 1985 1998 1936 2000 1960 2006 1931 2014 2011 ## [46] 1988 2006 1968 1942 1988 1954 1979 1979 2000 1940 1981 2012 2006 1957 2020 ## [61] 2008 2019 1980 2018 1950 1957 2018 2003 1997 1964 2012 1984 2016 1986 2017 ## [76] 2019 2018 1999 1995 1963 1995 1981 2009 1984 2009 1997 1983 2007 1992 1968 ## [91] 2000 2012 1958 1931 2004 1941 2016 1985 1952 1921 1948 1987 1952 2000 1959 ## [106] 1983 1971 2019 2010 1976 2011 2010 1973 1962 2001 1927 1960 1965 1944 1962 ## [121] 2009 1989 1995 1997 2018 1988 2005 1961 1975 1950 2004 2020 1997 1985 1992 ## [136] 1959 2004 1950 1995 2013 2001 1963 2006 2007 2009 1961 1998 1980 1988 1948 ## [151] 1954 2010 2017 1925 2005 1974 2007 2005 2015 1982 1980 1957 1999 2011 1993 ## [166] 2019 1996 1998 1939 2003 1979 2003 2003 1957 1982 1996 1957 2014 1953 2008 ## [181] 2015 1949 1954 1978 1993 1995 2009 2014 2002 2014 2016 2013 1966 1924 2018 ## [196] 1998 1975 1926 1942 2019 2010 2013 1996 1939 1978 2015 1989 2004 2011 1986 ## [211] 1959 1976 2009 2016 1971 1967 2017 1986 1953 2013 2007 1995 1928 1959 1979 ## [226] 2015 2012 2004 2000 2001 1966 1976 2020 1984 1940 2004 2000 1988 2018 2006 ## [241] 1955 1984 2013 2019 2021 1934 2013 1966 2015 2016 ``` --- ```r years <- page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% str_remove("\\)") %>% as.numeric() ``` --- class: middle inverse .font50[Scrape ratings] --- ```r ratings <- page %>% html_nodes("strong") %>% html_text() %>% as.numeric() ``` --- ```r imdb_top_250 <- tibble( title = titles, year = years, rating = ratings ) ``` --- ```r imdb_top_250 %>% group_by(year) %>% summarize(avg_rating = mean(rating)) %>% arrange(desc(avg_rating)) ``` ``` ## # A tibble: 85 x 2 ## year avg_rating ## <dbl> <dbl> ## 1 1972 9.1 ## 2 1994 8.76 ## 3 1946 8.6 ## 4 1977 8.6 ## 5 1990 8.6 ## 6 1974 8.55 ## 7 1991 8.55 ## 8 1936 8.5 ## 9 2008 8.5 ## 10 2002 8.48 ## # … with 75 more rows ``` --- ```r imdb_top_250 %>% filter(year == 1972) ``` ``` ## # A tibble: 1 x 3 ## title year rating ## <chr> <dbl> <dbl> ## 1 The Godfather 1972 9.1 ```