New submission from Larry Trammell :
== The Problem ==
I have observed a "loss of data" problem using the Python SAX parser, when
processing an oversize but very simple machine-generated xhtml file. The file
represents a single N x 11 data table. W3C "tidy" reports
Larry Trammell added the comment:
Not a bug, strictly speaking... more like user abuse.
The parsers (expat as well as SAX) must be able to return content text as a
sequence of pieces when necessary. For example, as a text sequence interrupted
by grouping or styling tags (like or ). Or
Larry Trammell added the comment:
I can't find any real errors in documentation. There are subtle design and
implementation decisions that result in unexpected rare side effects. After
processing hundreds of thousands of lines one way, why would the parser
suddenly decide to proces
Larry Trammell added the comment:
Assuming that my understanding is completely correct, the situation is that the
xml parser has an unspecified behavior. This is true in any text content
handler, at any time, and applies to the expat parser as well as SAX. In some
rare cases, the behavior
Larry Trammell added the comment:
Sure... I'll cut and paste some of the text I was organizing to go into a
possible new issue page.
The only relevant documentation I could find was in the "xml.sax.handler" page
in the Python 3.9.2 Documentation for the Python Standard Lib
Larry Trammell added the comment:
Great minds think alike I guess...
I was thinking of a much smaller carryover size... maybe 1K. With individual
text blocks longer than that, the user will almost certainly be dealing with
collecting and aggregating content text anyway, and in that case
Larry Trammell added the comment:
Oh, and whether this affects only content text...
I would presume so, but I don't know how to tell for sure. Unspecified
behaviors can be very mysterious!
--
___
Python tracker
<https://bugs.py
Larry Trammell added the comment:
I think the existing ContentHandler.characters(content) documentation DOES say
that the text can come back in chunks... but it is subtle. It might be
possible to say more explicitly that any content no matter how small is allowed
to be returned as any
Larry Trammell added the comment:
If there were a decision NOT TO FIX... maybe then it would make sense to
consider documentation patches at a higher priority. That way, SAX-Python (and
expat-Python) tutorials across the Web could start patching their presentations
accordingly
Larry Trammell added the comment:
Eric, now that you know as much as I do about the nature and scope of the
peculiar parsing behavior, do you have any suggestions about how to proceed
from here?
--
___
Python tracker
<https://bugs.python.
New submission from Larry Trammell :
Issue 43483 was posted as a "bug" but retracted. Though the problem is real,
it is tricky to declare an UNSPECIFIED behavior to be a bug. See that issue
page for more discussion and a test case. A brief overview is repeated here.
SCENARIO - X
New submission from Larry Trammell :
With reference to improvement issue 43560 :
If those improvements remain unimplemented, or are demoted to "don't fix",
users are left in the tricky situation where XML parsing applications can fail,
apparently "losing content" in
Larry Trammell added the comment:
Check out issues
43560 (an enhancement issue to improve handling of small XML content chunks)
43561 (a documentation issue to give users warning about the hazard in the
interim before the changes are implemented
13 matches
Mail list logo