Hi Michael

If you just want all of the text that is displayed in the
HTML docment, then you might use an XPath expression to get
all the text() nodes and get their value.

An example is

  doc = htmlParse("http://www.omegahat.org/";)
  txt = xpathSApply(doc, "//body//text()", xmlValue)

The result is a character vector that contains all the text.

By limiting the nodes to the body, we avoid the content in <head>
such as inlined JavaScript or CSS.

It is also possible that a document may have <script> elements
in the document containing JavaScript that you don't want.
You can omit these

  txt = xpathSApply(doc, "//body//text()[not(ancestor::script)]", xmlValue)

And if there were other elements we wanted to ignore, then you could use

 txt = xpathSApply(doc,
                   "//body//text()[not(ancestor::script) and 
not(ancestor::otherElement)]",
                   xmlValue)


HTH,

 D.


Michael Conklin wrote:
> I would like to be able to submit a list of URLs of various webpages and 
> extract the "content" i.e. not the mark-up of those pages. I can find plenty 
> of examples in the XML library of extracting links from pages but I cannot 
> seem to find a way to extract the text.  Any help would be greatly 
> appreciated - I will not know the structure of the URLs I would submit in 
> advance.  Any suggestions on where to look would be greatly appreciated.
> 
> Mike
> 
> W. Michael Conklin
> Chief Methodologist
> 
> MarketTools, Inc. | www.markettools.com<http://www.markettools.com>
> 6465 Wayzata Blvd | Suite 170 |  St. Louis Park, MN 55426.  PHONE: 
> 952.417.4719 | CELL: 612.201.8978
> This email and attachment(s) may contain confidential and/or proprietary 
> information and is intended only for the intended addressee(s) or its 
> authorized agent(s). Any disclosure, printing, copying or use of such 
> information is strictly prohibited. If this email and/or attachment(s) were 
> received in error, please immediately notify the sender and delete all copies
> 
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to