rvest
-
robotstxt::paths_allowed(url)
checks if a page is reachable -
read_html(url)
gets the content of the HTML page, returns a XML object -
html_table()
get the tables out of a fetched HTML page, returns a R Data Structure#List of tibbles (dataframes)- Process a table:
mytable %>% filter(X1 == "Version:") %>% pull(X2)
- Process a table:
-
html_elements()
get specific elements, returns a listhtml_elements(x, "h2")
html_elements(x, "#current_visitors")
html_elements(x, ".data")
-
html_element()
works likehtml_elements()
, except that it returns a single node -
html_text()
get the text out of nodes -
rvest doesn't work with dynamically loaded content
- Download the loaded page as a local file
- Use another package
RSelenium