Re: [R] Using Rvest to scrape pages

David Winsemius Sun, 12 Jul 2020 16:30:14 -0700


On 7/12/20 10:42 AM, Tiffany Adekola wrote:

Dear All,

I am just learning how to use R programming. I want to extract reviews
from a page and loop till I extract for all pages:

#specify the first page URL
fpURL <- 'https://wordpress.org/support/plugin/easyrecipe/reviews/'

#read the HTML contents in the first page URL
contentfpURL <- read_html(fpURL)

#identify the anchor tags in the first page URL
fpAnchors <- html_nodes(contentfpURL, css='a.bbp-topic-permalink')

#extract the HREF attribute value of each anchor tag
fpHREF <- html_attr(fpAnchors, 'href')

#create empty lists to store titles & contents found in the HREF
attribute value of each anchor tag
titles = c()
contents = c()

#loop the following actions for each HREF found firstpage
for (u in fpHREF) {

    #read the HTML content of the review page
    fpURL = read_html(u)

   #identify the title anchor and read the title text
   fpreviewT = html_text(html_nodes(fpURL, css='h1.page-title'))

   #identify the content anchor and read the content text
   fpreviewC = html_text(html_nodes(fpURL, css='div.bbp-topic-content'))

   #store the review titles and contents in the previous lists
   titles = c(titles, fpreviewT)
   contents = c(contents, fpreviewC)
}
#identify the anchor tag pointing to the next summary page
npAnchor <- html_text(html_node(contentfpURL, css='a.next page-numbers'))

#extract the HREF attribute value of the anchor tag pointing to the
next summary page
npHREF <- html_attr(npAnchor, 'href')

The error occurs with the line above, but if you look at the argument to`html_attr` you see that the problem is higher up


str(npAnchor)
# chr NA

Perhaps the problem occurs here:


html_node(contentfpURL, css='a.next page-numbers')
#{xml_missing}
#<NA>

--

David.


#loop the following actions for every next summary page HREF attribute
for (u in npHREF) {

   #specify the URL of the summary page
   spURL <- read_html('npHREF')

   #identify all the anchor tags on that summary page
   spAnchors <- html_nodes(spURL, css='a.bbp-topic-permalink')

   #extract the HREF attribute value of each anchor tag
   spHREF <- html_attr(spAnchors, 'href')

   #loop the following actions for each HREF found on that summarypage

    for (u in fpHREF) {
      #read the HTML contents of the review page
      spURL = read_html(u)

       #identify the title anchor and read the title text
       spreviewT = html_text(html_nodes(spURL, css='h1.page-title'))

       #identify the content anchor and read the content text
       spreviewC = html_text(html_nodes(spURL, css='div.bbp-topic-content'))

       #store the review titles and contents in the previous lists
       titles = c(titles, spreviewT)
       contents = c(contents, spreviewC)
       }
}

I got stuck at the step to extract the HREF attribute value of the
anchor tag pointing to the next summary page with the error: Error in
UseMethod("xml_attr") :
   no applicable method for 'xml_attr' applied to an object of class "character"

  I will appreciate any help with this task.
Thanks in advance.

---Tiffany

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Using Rvest to scrape pages

Reply via email to