[Python-Dev] sgmllib Comments

2006-06-11 Thread Sam Ruby
Planet is a feed aggregator written in Python.  It depends heavily on 
SGMLLib.  A recent bug report turned out to be a deficiency in sgmllib, 
and I've submitted a test case and a patch[1] (use or discard the patch, 
it is the test that I care about).

While looking around, a few things surfaced.  For starters, it would 
seem that the version of sgmllib in SVN HEAD will selectively unescape 
certain character references that might appear in an attribute.  I say 
selectively, as:

  * it will unescape  &
  * it won't unescape ©
  * it will unescape  &
  * it won't unescape &
  * it will unescape  ’
  * it won't unescape ’

There are a number of issues here.  While not unescaping anything is 
suboptimal, at least the recipient is aware of exactly which characters 
have been unescaped (i.e., none of them).  The proposed solution makes 
it impossible for the recipient to know which characters are unescaped, 
and which are original.  (Note: feeds often contain such abominations as 
© which the new code will treat indistinguishably from ©)

Additionally, there is a unicode issue here - one that is shared by 
handle_charref, but at least that method is overrideable.  If unescaping 
remains, do it for hex character references and for values greather than 
8-bits, i.e., use unichr instead of chr if the value is greater than 127.

- Sam Ruby

[1] http://tinyurl.com/j4a6n
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] sgmllib Comments

2006-06-11 Thread Sam Ruby
Fred L. Drake, Jr. wrote:
> On Sunday 11 June 2006 16:26, Sam Ruby wrote:
>  > Planet is a feed aggregator written in Python.  It depends heavily on
>  > SGMLLib.  A recent bug report turned out to be a deficiency in sgmllib,
>  > and I've submitted a test case and a patch[1] (use or discard the patch,
>  > it is the test that I care about).
> 
> And it's a nice aggregator to use, indeed!
> 
>  > While looking around, a few things surfaced.  For starters, it would
>  > seem that the version of sgmllib in SVN HEAD will selectively unescape
>  > certain character references that might appear in an attribute.  I say
>  > selectively, as:
>  >
>  >   * it will unescape  &
>  >   * it won't unescape ©
>  >   * it will unescape  &
>  >   * it won't unescape &
>  >   * it will unescape  ’
>  >   * it won't unescape ’
> 
> And just why would you use sgmllib to handle RSS or ATOM feeds?  Neither is 
> defined in terms of SGML.  The sgmllib documentation also notes that it isn't 
> really a fully general SGML parser (it isn't), but that it exists primarily 
> as a foundation for htmllib.

The feed itself is read first with SAX (then with a fallback using 
sgmllib if the feed is not well formed, but that's beside the point). 
Then the embedded HTML portions are then processed with subclasses of 
sgmllib.

>  > There are a number of issues here.  While not unescaping anything is
>  > suboptimal, at least the recipient is aware of exactly which characters
>  > have been unescaped (i.e., none of them).  The proposed solution makes
>  > it impossible for the recipient to know which characters are unescaped,
>  > and which are original.  (Note: feeds often contain such abominations as
>  > © which the new code will treat indistinguishably from ©)
> 
> My suspicion is that the "right" thing to do at the sgmllib level is to 
> categorize the markup and call a method depending on what the entity 
> reference is, and let that handle whatever it is.  For SGML, that means we 
> have things like &name; (entity references), { (character references), 
> and that's it.  ģ isn't legal SGML under any circumstance; 
> the "&#x;" syntax was introduced with XML.

... but it effectively is valid HTML.  And as you point out below 
sgmllib's raison d’être is to support htmllib.

>  > Additionally, there is a unicode issue here - one that is shared by
>  > handle_charref, but at least that method is overrideable.  If unescaping
>  > remains, do it for hex character references and for values greather than
>  > 8-bits, i.e., use unichr instead of chr if the value is greater than 127.
> 
> For SGML, it's worse than that, since the document character set is defined 
> in 
> the SGML declaration, which is a far hairier beast than an XML 
> declaration.  :-)

understood

> It really sounds like sgmllib is the wrong foundation for this.  While the 
> module has some questionable behaviors, none of them are signifcant in the 
> context it's intended context (support for htmllib).  Now, I understand that 
> RSS has historical issues, with HTML-as-practiced getting embedded as payload 
> data with various flavors of escaping applied, and I'm not an expert in the 
> details of that.  Have you looked at HTMLParser as an alternate to sgmllib?  
> It has better support for XHTML constructs.

HTMLParser is less forgiving, and generally less suitable for consuming 
HTML as practiced.

- Sam Ruby

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] sgmllib Comments

2006-06-11 Thread Sam Ruby
Terry Reedy wrote:
> "Fred L. Drake, Jr." <[EMAIL PROTECTED]> wrote in message 
> news:[EMAIL PROTECTED]
>> On Sunday 11 June 2006 16:26, Sam Ruby wrote:
>>> Planet is a feed aggregator written in Python.  It depends heavily on
>>> SGMLLib.  A recent bug report turned out to be a deficiency in sgmllib,
>>> and I've submitted a test case and a patch[1] (use or discard the 
>>> patch,
>>> it is the test that I care about).
> ...
>>> and which are original.  (Note: feeds often contain such abominations 
>>> as
>>> &copy; which the new code will treat indistinguishably from ©)
> 
>> It really sounds like sgmllib is the wrong foundation for this.
> ...
>> Have you looked at HTMLParser as an alternate to sgmllib?
>> It has better support for XHTML constructs.
> 
> Have you (the OP), checked how related Python projects, such as Mark 
> Pilgrim's feed parser,
> http://www.feedparser.org/
> handle the same sort of input (I have only looked at docs and tests, not 
> code).

Just to be clear: Planet uses Mark's feed parser, which uses SGMLlib.

I'm a committer on that project:

http://sourceforge.net/project/memberlist.php?group_id=112328

I was investigating a bug in sgmllib which affected the feed parser (and 
therefore Planet), and noticed that there were changes in the SVN head 
of Python which broke three feed parser unit tests.

It is my belief that these changes will break other existing users of 
sgmllib.

- Sam Ruby
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] sgmllib Comments

2006-06-11 Thread Sam Ruby
Fred L. Drake, Jr. wrote:
> On Monday 12 June 2006 00:05, Sam Ruby wrote:
>  > Just to be clear: Planet uses Mark's feed parser, which uses SGMLlib.
> 
> Cool.
> 
>  > I was investigating a bug in sgmllib which affected the feed parser (and
>  > therefore Planet), and noticed that there were changes in the SVN head
>  > of Python which broke three feed parser unit tests.
>  >
>  > It is my belief that these changes will break other existing users of
>  > sgmllib.
> 
> This is good to know; thanks for pointing it out.
> 
> If you can summarize the specific changes to sgmllib that cause problems for 
> the feed parser, and identify the tests there that rely on the old behavior, 
> I'll be glad to look at the problems.  I expect to have some time in the next 
> few evenings, so I should be able to look at these soon.
> 
> Is the SourceForge CVS the definitive development source for the feed parser?

Yes: but if you check out the CVS HEAD, you won't see any failures as 
I've committed changes that mitigate the problems I've found.

However, if you get the latest release instead, you will see that feeds 
that contain < & or > in attribute values will get these 
converted to <, &, and > characters instead.  In some cases, this can 
cause problems.  Particularly if the output is reparsed by sgmllib.

Additionally, entity references in the range of  to ÿ will 
cause the released Feed Parser to die with a UnicodeDecodeError.

My workarounds are to re-escape < and > characters, and to escape bare 
ampersands - beyond that I can't really tell for sure which ampersands 
need to be re-escaped, and which ones I should leave as is.

And I first try decoding attributes in the original declared encoding 
and then fall back to iso-8859-1.  If a single attribute value contains 
both non-ASCII utf-8 characters and a numeric character reference above 
€ then this will produce incorrect results.

I also have committed a workaround to the incorrect parsing of 
attributes with quoted markup that I originally reported.

- Sam Ruby
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] sgmllib Comments

2006-06-11 Thread Sam Ruby
Martin v. Löwis wrote:
> 
> Alternatively, a callback function could be provided for character
> references. Unfortunately, the existing callback is unsuitable,
> as it is supposed to do the full processing; this callback should
> return the replacement text. Generally assuming Unicode would be
> wrong, though.
> 
> Would you like to contribute a patch?

If we can agree on the behavior, I would be glad to write up a patch.

It seems to me that the simplest way to proceed would be for the code 
that attempts to resolve character references (both named and numeric) 
in attributes to be isolated in a single method.  Subclasses that desire 
different behavior (including the existing Python 2.4 and prior 
behaviour) could simply override this method.

- Sam Ruby
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] sgmllib Comments

2006-06-12 Thread Sam Ruby
Martin v. Löwis wrote:
> Sam Ruby wrote:
>> If we can agree on the behavior, I would be glad to write up a patch.
>>
>> It seems to me that the simplest way to proceed would be for the code
>> that attempts to resolve character references (both named and numeric)
>> in attributes to be isolated in a single method.  Subclasses that desire
>> different behavior (including the existing Python 2.4 and prior
>> behaviour) could simply override this method.
> 
> In SGML, this is problematic: The named things are not character
> references, they are entity references, and it isn't necessarily
> the case that they expand to a character. For example, &author;
> might expand to "Martin v. Löwis", and &logo; might refer to a
> bitmap image which is unparsed.
> 
> That said, providing a overridable replacement function sounds
> like the right approach. To keep with tradition, I would still
> distinguish between character references and entity references,
> i.e. providing two overridable functions instead. Returning
> None could mean that no replacement is available.
> 
> As for default implementations, I think they should do what
> currently happens: entity references are replaced according to
> entitydefs, character references are replaced to bytes if
> they are smaller than 256.
> 
> Contrary to what others said, it appears that SGML *does*
> support hexadecimal character references, provided that
> the SGML declaraction contains the HCRO definition (which,
> for HTML and XML, is defined as HCRO "&#x"). So it seems
> safe to process hex character references by default (although
> it isn't safe to assume Unicode, IMO).

I don't see why expanding to multiple characters is a problem.

Just so that we have a tracking number and real code to anchor this 
discussion, I've opened the following and attached a patch:

http://python.org/sf/1504676

This implementation does handle multiple character expansions.  It does 
default to exactly what the current code does.  It does *not* currently 
handle hexadecimal character references.

It also does pass all the current sgmllib tests, though I did not 
include any additional tests in this initial patch.

- Sam Ruby
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com