Peng Yu wrote: > On Wed, Nov 25, 2009 at 12:19 AM, cls59 <ch...@sharpsteen.net> wrote: >> >> Peng Yu wrote: >>> I'm interested in parsing an html page. I should use XML, right? Could >>> you somebody show me some example code? Is there a tutorial for this >>> package? >>> >> Did you try looking through the help pages for the XML package or browsing >> the Omegahat website? >> >> Look at: >> >> library(XML) >> ?htmlTreeParse >> >> And the relevant web page for documentation and examples is: >> >> http://www.omegahat.org/RSXML/ > > > http://www.omegahat.org/RSXML/shortIntro.html > > I'm trying the example on the above webpage. But I'm not sure why I > got the following error. Would you help to take a look? > > > $ Rscript main.R >> library(XML) >> >> download.file('http://www.omegahat.org/RSXML/index.html','index.html') > trying URL 'http://www.omegahat.org/RSXML/index.html' > Content type 'text/html; charset=ISO-8859-1' length 3021 bytes > opened URL > ================================================== > downloaded 3021 bytes > >> doc = xmlInternalTreeParse("index.html")
You are trying to parse an HTML document as if it were XML. But HTML is often not well-formed. So use htmlParse() for a more forgiving parser. Or use the RTidyHTML package (www.omegahat.org/RTidyHTML) to make the HTML well-formed before passing it to xmlTreeParse() (aka xmlInternalTreeParse()). That package is an interface to libtidy. D. > Opening and ending tag mismatch: dd line 68 and dl > Opening and ending tag mismatch: li line 67 and body > Opening and ending tag mismatch: dt line 66 and html > Premature end of data in tag dd line 64 > Premature end of data in tag li line 63 > Premature end of data in tag dt line 62 > Premature end of data in tag dl line 61 > Premature end of data in tag body line 5 > Premature end of data in tag html line 1 > Error: 1: Opening and ending tag mismatch: dd line 68 and dl > 2: Opening and ending tag mismatch: li line 67 and body > 3: Opening and ending tag mismatch: dt line 66 and html > 4: Premature end of data in tag dd line 64 > 5: Premature end of data in tag li line 63 > 6: Premature end of data in tag dt line 62 > 7: Premature end of data in tag dl line 61 > 8: Premature end of data in tag body line 5 > 9: Premature end of data in tag html line 1 > Execution halted > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.