Re: [R] R web-scraping a multiple-level page

Boris Steipe Wed, 10 Apr 2019 03:35:27 -0700

For similar tasks I usually write a while loop operating on a queue. 
Conceptually:


initialize queue with first page
add first url to harvested urls

while queue not empty (2)
  unshift url from queue
  collect valid child pages that are not already in harvested list (1)
  add to harvested list
  add to queue

process all harvested pages



(1) - grep for the base url so you don't leave the site
    - use %in% to ensure you are not caught in a cycle

(2) Restrict the condition with a maximum number of cycles. More often than not 
assumptions about the world turn out to be overly rational.

Hope this helps,
B.




> On 2019-04-10, at 04:35, Ilio Fornasero <iliofornas...@hotmail.com> wrote:
> 
> Hello.
> 
> I am trying to scrape a FAO webpage including multiple links from any of 
> which I would like to collect the "News" part.
> 
> Yet, I have done this:
> 
> fao_base = 'http://www.fao.org'
> fao_second_level = paste0(stem, '/countryprofiles/en/')
> 
> all_children = read_html(fao_second_level) %>%
>  html_nodes(xpath = '//a[contains(@href, "?iso3=")]/@href') %>%
>  html_text %>% paste0(fao_base, .)
> 
> Any suggestion on how to go on? I guess with a loop but I didn't have any 
> success, yet.
> Thanks
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R web-scraping a multiple-level page

Reply via email to