On Tue, May 13, 2008 at 6:06 AM, Per Jessen <[EMAIL PROTECTED]> wrote: > Shelley wrote: > >> I want to know whether there are some good HTML parsers written in >> PHP. >> >> That is, >> the parser checks whether html tags like table, tr, td, div, dt, dl, >> dd, script, ul, li, span, h1, h2, etc. are nested correctly. >> If any tags not matched, just remove them. > > Except for the last part, any XML parser will do. Sablotron, xalan, > libxsl etc. > > > /Per Jessen, Zürich
... except when the HTML is not well formed XML, as I find is often the case when accepting input from users. That "last part," as you say, is kind of essential. It could be as simple as tags that don't close in HTML (e.g. <img>, <br>, <hr>) or it could be something much trickier to clean up such as mismatched tags, improper nesting, missing closing tags (since some browsers are too forgiving of not closing <td>, <li> or <option>), HTML entities that are not valid in XML, etc. In these cases, the DOM-type parsers will usually choke. You might be able to salvage something with the stream-based parsers like SAX. (I've never tried it.) Andrew