[Python-Dev] Is this a bug of the HTMLParser?

2009-11-11 Thread Zhang Chiyuan
Hi all,

I'm using BeautifulSoup to parsing an HTML page and find it refused to
parse the page. By looking at the backtrace, I found it is a problem
with the python built-in HTMLParser.py. In fact, the web page I'm
parsing is with some Chinese characters. there is a tag like  , note this is legacy html page where the
attributes are not quoted. However, the regexp defined in
HTMLParser.py is :

 attrfind = re.compile(
 r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
 r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_...@]*))?')

Note that the Chinese character (also any other non-english
characters), so it fire an error parsing this. I'm not sure whether
the HTML standard allow un-quoted non-ASCII characters in the
attributes. If it allows, this seems to be a bug. and the regexp to
better be [^>\s] IMHO.

BTW: It seems something like :


var st = "

Re: [Python-Dev] Is this a bug of the HTMLParser?

2009-11-12 Thread Zhang Chiyuan
filed: http://bugs.python.org/issue7311

On Thu, Nov 12, 2009 at 12:24 AM, Michael Foord
wrote:

> Hello Zhang Chiyuan,
>
> Can you file a bug on the Python issue tracker please:
>
>   http://bugs.python.org
>
> Thanks
>
> Michael Foord
>
> Zhang Chiyuan wrote:
>
>> Hi all,
>>
>> I'm using BeautifulSoup to parsing an HTML page and find it refused to
>> parse the page. By looking at the backtrace, I found it is a problem
>> with the python built-in HTMLParser.py. In fact, the web page I'm
>> parsing is with some Chinese characters. there is a tag like > src=/foo/bar.png alt=中文> , note this is legacy html page where the
>> attributes are not quoted. However, the regexp defined in
>> HTMLParser.py is :
>>
>>  attrfind = re.compile(
>> r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
>> r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_...@]*))?')
>>
>> Note that the Chinese character (also any other non-english
>> characters), so it fire an error parsing this. I'm not sure whether
>> the HTML standard allow un-quoted non-ASCII characters in the
>> attributes. If it allows, this seems to be a bug. and the regexp to
>> better be [^>\s] IMHO.
>>
>> BTW: It seems something like :
>>
>> 
>> var st = "<a></";
>> 
>>
>> can not be parsed. :-/
>>
>> --
>> pluskid
>> http://blog.pluskid.org
>> ___
>> Python-Dev mailing list
>> Python-Dev@python.org
>> http://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe:
>> http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
>>
>>
>
>
> --
> http://www.ironpythoninaction.com/
>
>


-- 
pluskid
http://blog.pluskid.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com