[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-07-26 Thread Matt Basta

Matt Basta  added the comment:

The number of problems produced by this bug can be greatly reduced by adding a 
relatively small check to the parser. Currently, 

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-07-27 Thread Matt Basta

Matt Basta  added the comment:

> So I think the example is invalid (should escape the <), and that HTMLParser 
> is not buggy.

On the other hand, the HTML5 spec clearly dictates otherwise:

http://www.w3.org/TR/html5/syntax.html#cdata-rcdata-restrictions
The text in raw text and RCDATA elements must not contain any occurrences of 
the string "), 
or U+002F SOLIDUS (/).


Additionally, no browsers (perhaps unless they are in quirks mode) currently 
obey the HTML4 variant of the rule. This is due largely in part to the need to 
include strings such as "" within a script tag itself. This 
behavior can be observed firsthand by loading this snippet in a browser:

<span></span>This should not be visible.

--

___
Python tracker 
<http://bugs.python.org/issue670664>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-07-27 Thread Matt Basta

Matt Basta  added the comment:

> Yes, but we don't claim to support HTML5 yet.

There's also no claim in the docs or the source that HTMLParser specifically 
adheres to HTML4, either.

Ideally, the parser should strive for parity with the functionality of major 
web browsers, as they are the de-facto standard for HTML parser behavior. All 
of the browsers on my machine, for instance, will even parse the following 
snippet with the behavior described in the HTML5 spec:


http://www.w3.org/TR/html4/strict.dtd";>
<span></span>This should not be visible.


Even in pre-HTML5 browsers, this is the way that HTML gets parsed. For the heck 
of it, I downloaded an old copy of Firefox 2.0 and ran the above snippet. The 
behavior is consistent.

While I would otherwise agree that keeping to the HTML4 spec is the right thing 
to do, this is a quirk of the spec that is not only ignored by browsers (as can 
be seen in FX2) and changed in a future version of the spec, but is causing 
problems for a good number of developers.

It could be argued that the patch is a far more elegant solution for Beautiful 
Soup developers than the workaround in msg88864.

--

___
Python tracker 
<http://bugs.python.org/issue670664>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-08-01 Thread Matt Basta

Matt Basta  added the comment:

Seeing as everyone seems pretty satisfied with the 2.7 version, I'd be happy to 
put together a patch for 3 as well.

To confirm, though, this fix is NOT going behind the strict parameter, correct?

--

___
Python tracker 
<http://bugs.python.org/issue670664>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com