On Tue, May 13, 2008 at 6:06 AM, Per Jessen <[EMAIL PROTECTED]> wrote:
> Shelley wrote:
>
>> I want to know whether there are some good HTML parsers written in
>> PHP.
>>
>> That is,
>> the parser checks whether html tags like table, tr, td, div, dt, dl,
>> dd, script, ul, li, span, h1, h2, etc. are nested correctly.
>> If any tags not matched, just remove them.
>
> Except for the last part, any XML parser will do.  Sablotron, xalan,
> libxsl etc.
>
>
> /Per Jessen, Zürich

... except when the HTML is not well formed XML, as I find is often
the case when accepting input from users. That "last part," as you
say, is kind of essential. It could be as simple as tags that don't
close in HTML (e.g. <img>, <br>, <hr>) or it could be something much
trickier to clean up such as mismatched tags, improper nesting,
missing closing tags (since some browsers are too forgiving of not
closing <td>, <li> or <option>), HTML entities that are not valid in
XML, etc. In these cases, the DOM-type parsers will usually choke. You
might be able to salvage something with the stream-based parsers like
SAX. (I've never tried it.)

Andrew

Reply via email to