[issue7311] Bug on regexp of HTMLParser

Ezio Melotti Tue, 05 Apr 2011 11:52:01 -0700

Ezio Melotti <[email protected]> added the comment:

With 3.2 the situation is more complicated because there is a strict and a 
non-strict mode.
The strict mode uses:
attrfind = re.compile(
    r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?')


and the tolerant mode uses:
attrfind_tolerant = re.compile(
    r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[^>\s]*))?')

This means that the strict mode doesn't allow valid non-ASCII chars, and that 
tolerant mode is a little too permissive.

The attached patch changes the strict regex to be more permissive and leaves 
the tolerant regex unchanged. The difference between the two are now so small 
that the tolerant version could be removed, except that re.search is used 
instead of re.match when the tolerant regex is used.

----------
nosy: +r.david.murray
Added file: http://bugs.python.org/file21545/issue7311-3.diff

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue7311>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue7311] Bug on regexp of HTMLParser

Reply via email to