[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-12 Thread Larry Trammell
New submission from Larry Trammell : == The Problem == I have observed a "loss of data" problem using the Python SAX parser, when processing an oversize but very simple machine-generated xhtml file. The file represents a single N x 11 data table. W3C "tidy" reports

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-13 Thread Larry Trammell
Larry Trammell added the comment: Not a bug, strictly speaking... more like user abuse. The parsers (expat as well as SAX) must be able to return content text as a sequence of pieces when necessary. For example, as a text sequence interrupted by grouping or styling tags (like or ). Or

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-15 Thread Larry Trammell
Larry Trammell added the comment: I can't find any real errors in documentation. There are subtle design and implementation decisions that result in unexpected rare side effects. After processing hundreds of thousands of lines one way, why would the parser suddenly decide to proces

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell
Larry Trammell added the comment: Assuming that my understanding is completely correct, the situation is that the xml parser has an unspecified behavior. This is true in any text content handler, at any time, and applies to the expat parser as well as SAX. In some rare cases, the behavior

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell
Larry Trammell added the comment: Sure... I'll cut and paste some of the text I was organizing to go into a possible new issue page. The only relevant documentation I could find was in the "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python Standard Lib

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell
Larry Trammell added the comment: Great minds think alike I guess... I was thinking of a much smaller carryover size... maybe 1K. With individual text blocks longer than that, the user will almost certainly be dealing with collecting and aggregating content text anyway, and in that case

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell
Larry Trammell added the comment: Oh, and whether this affects only content text... I would presume so, but I don't know how to tell for sure. Unspecified behaviors can be very mysterious! -- ___ Python tracker <https://bugs.py

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell
Larry Trammell added the comment: I think the existing ContentHandler.characters(content) documentation DOES say that the text can come back in chunks... but it is subtle. It might be possible to say more explicitly that any content no matter how small is allowed to be returned as any

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell
Larry Trammell added the comment: If there were a decision NOT TO FIX... maybe then it would make sense to consider documentation patches at a higher priority. That way, SAX-Python (and expat-Python) tutorials across the Web could start patching their presentations accordingly

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-18 Thread Larry Trammell
Larry Trammell added the comment: Eric, now that you know as much as I do about the nature and scope of the peculiar parsing behavior, do you have any suggestions about how to proceed from here? -- ___ Python tracker <https://bugs.python.

[issue43560] Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks

2021-03-19 Thread Larry Trammell
New submission from Larry Trammell : Issue 43483 was posted as a "bug" but retracted. Though the problem is real, it is tricky to declare an UNSPECIFIED behavior to be a bug. See that issue page for more discussion and a test case. A brief overview is repeated here. SCENARIO - X

[issue43561] Modify XML parsing library descriptions to forewarn of content loss hazard

2021-03-19 Thread Larry Trammell
New submission from Larry Trammell : With reference to improvement issue 43560 : If those improvements remain unimplemented, or are demoted to "don't fix", users are left in the tricky situation where XML parsing applications can fail, apparently "losing content" in

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-19 Thread Larry Trammell
Larry Trammell added the comment: Check out issues 43560 (an enhancement issue to improve handling of small XML content chunks) 43561 (a documentation issue to give users warning about the hazard in the interim before the changes are implemented