[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-12 Thread Larry Trammell


New submission from Larry Trammell :

== The Problem ==

I have observed a "loss of data" problem using the Python SAX parser, when 
processing an oversize but very simple machine-generated xhtml file. The file 
represents a single N x 11 data table.  W3C "tidy" reports no xml errors.  The 
table is constructed in an entirely plausible manner, using table, tr, and td 
tags to define the table structure, and p tags to bracket content, which 
consists of small chunks of quoted text.  There is nothing pathological, no 
extraneous whitespace characters, no empty data fields. 

Everything works perfectly in small test cases.  But when a very large number 
of rows are present, a few characters of content strings are occasionally lost. 
I have observed 2 or 6 characters dropped.  But here's the strange part.  The 
pathological behavior disappears (or moves to another location) when one or 
more non-significant whitespace characters are inserted at an arbitrary 
location early in the file... e.g. an extra linefeed before the first tr tag. 

== Context ==

I have observed identical behavior on desktop systems using an Intel Xeon 
E5-1607 or a Core-2 processor, running 32-bit or 64-bit Linux operating 
systems, variously using Python 3.8.5, 3.8, 3.7.3, and 3.5.1.

== Observing the Problem == 

Sorry that the test data is so bulky (even at 0.5% of original size), but bulk 
appears to be a necessary condition to observe the problem. Run the following 
command line.  

python3  EnchXMLTest.py  EnchTestData.html 

The test script invokes the SAX parser and generates messages on stdout. Using 
the original test data as provided, the test should run correctly to 
completion.  Now modify the test data file, deleting the extraneous comment 
line (there is only one) found near the top of the file.  Repeat the test run, 
and this time look for missing content characters in parsed content fields of 
the last record.  
 
== Any guesses? ==

Beyond "user is oblivious," possibly something abnormal can occur at seams 
between large blocks of buffered text.  The presence or absence of an extra 
character early in the data stream results in a corresponding shift in content 
location at the end of the buffer.  Other clues: is it relevant that the 
problem appears in a string field that contains slash characters?

--
components: XML
files: EnchSAXTest.zip
messages: 388582
nosy: ridgerat1611
priority: normal
severity: normal
status: open
title: Loss of content in simple (but oversize) SAX parsing
type: behavior
versions: Python 3.7, Python 3.8
Added file: https://bugs.python.org/file49872/EnchSAXTest.zip

___
Python tracker 
<https://bugs.python.org/issue43483>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-13 Thread Larry Trammell


Larry Trammell  added the comment:

Not a bug, strictly speaking... more like user abuse.

The parsers (expat as well as SAX) must be able to return content text as a 
sequence of pieces when necessary. For example, as a text sequence interrupted 
by grouping or styling tags (like  or ).  Or, extensive text blocks 
might need to be subdivided for efficient processing.  Users would expect 
hazards like these and be wary.  But how many users would suspect that a quoted 
string of length 8 characters would be returned in multiple pieces?  Or that an 
entity notation would be split down the middle?  Virtually all existing 
tutorial examples showing content extraction are WRONG -- because the ONLY 
content that can be trusted must be filtered through some kind of aggregator 
object.  How many users will know this instinctively?  

It would be very useful for the parser systems to provide some kind of support 
for text aggregation function.  A guarantee that "small contiguous" text items 
will not be chopped might also be helpful.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue43483>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-15 Thread Larry Trammell


Larry Trammell  added the comment:

I can't find any real errors in documentation.  There are subtle design and 
implementation decisions that result in unexpected rare side effects.  After 
processing hundreds of thousands of lines one way, why would the parser 
suddenly decide to process the next line differently?  Well, because it can, 
and it happens to be convenient.  And that can catch users off-guard.

I'm considering whether posting an "enhancement" issue would be more 
appropriate... maybe there is a way to make the parser systems work more nearly 
the way people currently expect, without breaking things.

--

___
Python tracker 
<https://bugs.python.org/issue43483>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell


Larry Trammell  added the comment:

Assuming that my understanding is completely correct, the situation is that the 
xml parser has an unspecified behavior.  This is true in any text content 
handler, at any time, and applies to the expat parser as well as SAX. In some 
rare cases, the behavior of the current implementation (and also many past 
ones) sometimes seems inconsistent and can catch users by surprise -- even some 
who are relatively knowledgable (which does not include me). 

This is a little abstract, but two things could be done to improve this:

1. Modify the implementation so that the behavior remains unspecified but falls 
more in line with plausible expectations of the users.  This makes things a 
little more complicated for the implementer, but does not invalidate the 
documentation of present or past versions. 

2. The documentation could be updated to expose the new constraints on the 
previously unspecified behavior, giving users a better chance to recognize and 
prepare for any remaining difficulties.  However, the implementation changes 
could be made even without these documentation changes.

So I remain confused about whether this is really a "bug" -- it is an "easy but 
unfortunate implementation choice" that is technically not wrong, even if 
sometimes baffling.  Established applications that already use older parser 
versions are relatively unlikely to start failing given the kind of documents 
they process, so backport changes might be helpful but do not seem urgent. 

Eric, with this clarification, what is your opinion about how to properly post 
a new issue -- improvement or bug fix?  I can provide a more detailed technical 
explanation where a new issue is posted.

--

___
Python tracker 
<https://bugs.python.org/issue43483>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell


Larry Trammell  added the comment:

Sure...  I'll cut and paste some of the text I was organizing to go into a 
possible new issue page.

The only relevant documentation I could find was in the "xml.sax.handler" page 
in the Python 3.9.2 Documentation for the Python Standard Library (as it has 
been through many versions):

---
ContentHandler.characters(content) -- The Parser will call this method to 
report each chunk of character data.  SAX parsers may return all contiguous 
character data in a single chunk, or they may split it into several chunks...
---

As an example, here is a typical snippet taken from Web page

 https://www.tutorialspoint.com/parsing-xml-with-sax-apis-in-python 

The application example records the tag name "type" in the "CurrentData" 
member, and shortly thereafter, the "type" tag's content is received:

   # Call when a character is read
   def characters(self, content):
  if self.CurrentData == "type":
 self.type = content

Suppose that the parser receives the following text line from the input file.  

SciFi

Though there seems no reason for it, the parser could decide to deliver the 
content text as "Sc" followed by "iFi".  In that case, a second invocation of 
the "characters" method would overwrite the characters received in the first 
invocation, and some of the content text seems "lost."  

Given how rarely it happens, I suspect that when internal processing reaches 
the end of a block of buffered text from the input file, the easiest thing to 
do is to report any fragments of text that happen to remain at the end, no 
matter how tiny, and start fresh with the next internal buffer. Easy for the 
implementer, but baffling to the application developer.  And rare enough to 
elude application testing.

--

___
Python tracker 
<https://bugs.python.org/issue43483>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell


Larry Trammell  added the comment:

Great minds think alike I guess... 

I was thinking of a much smaller carryover size... maybe 1K. With individual 
text blocks longer than that, the user will almost certainly be dealing with 
collecting and aggregating content text anyway, and in that case, the problem 
is solved before it happens. 

Here is a documentation change I was experimenting with...

---
ContentHandler.characters(content) -- The Parser will call this method to 
report chunks of character data.  In general, character data may be reported as 
a single chunk or as sequence of chunks; but character data sequences with 
fewer than  xml.sax.handler.ContiguousChunkLength characters, when 
uninterrupted any other xml.sax.handler.ContentHandler event, are guaranteed to 
be delivered as a single chunk...  
---

That puts users on notice, "...wait, are my chunks of text smaller than that?" 
and they are less likely to be caught unaware.  But of course, the 
implementation change would be helpful even without this extra warning.

--

___
Python tracker 
<https://bugs.python.org/issue43483>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell


Larry Trammell  added the comment:

Oh, and whether this affects only content text...

I would presume so, but I don't know how to tell for sure.  Unspecified 
behaviors can be very mysterious!

--

___
Python tracker 
<https://bugs.python.org/issue43483>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell


Larry Trammell  added the comment:

I think the existing ContentHandler.characters(content) documentation DOES say 
that the text can come back in chunks... but it is subtle.  It might be 
possible to say more explicitly that any content no matter how small is allowed 
to be returned as any number of chunks at any time... Though true, that is 
harsh, overstating considerably what actually happens.   Concentrating on a 
better implementation would be more effective than worrying about existing 
documentation, given how long the existing conditions have prevailed. My 
opinion, as one who has been bitten.

--

___
Python tracker 
<https://bugs.python.org/issue43483>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell


Larry Trammell  added the comment:

If there were a decision NOT TO FIX... maybe then it would make sense to 
consider documentation patches at a higher priority.  That way, SAX-Python (and 
expat-Python) tutorials across the Web could start patching their presentations 
accordingly.

--

___
Python tracker 
<https://bugs.python.org/issue43483>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-18 Thread Larry Trammell


Larry Trammell  added the comment:

Eric, now that you know as much as I do about the nature and scope of the 
peculiar parsing behavior, do you have any suggestions about how to proceed 
from here?

--

___
Python tracker 
<https://bugs.python.org/issue43483>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43560] Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks

2021-03-19 Thread Larry Trammell


New submission from Larry Trammell :

Issue 43483 was posted as a "bug" but retracted.  Though the problem is real, 
it is tricky to declare an UNSPECIFIED behavior to be a bug.  See that issue 
page for more discussion and a test case.  A brief overview is repeated here.

SCENARIO - XML PARSING LOSES DATA (or not)

The parsing attempts to capture text consisting of very tiny quoted strings. A 
typical content line reads something like this: 

   Colchuck

The parser implements a scheme presented at various tutorial Web sites, using 
two member functions. 

   # Note the name attribute of the current tag group
   def element_handler(self, tagname, attrs) :
   self.CurrentTag = tagname  

   # Record the content from each "p" tag when encountered
   def characters(self, content):
   if self.CurrentTag == "p":
   self.name = content

   ...

   > print(parser.name)
   "Colchuck" 

But then, after successfully extracting content from perhaps hundreds of 
thousands of XML tag sets in this way, the parsing suddenly "drops" a few 
characters of content. 

   > print(parser.name)
   "lchuck" 

While this problem was observed with a SAX parser, it can affect expat parsers 
as well.  It affects 32-bit and 64-bit implementations the same, over several 
major releases of the Python 3 system.  

SPECIFIED BEHAVIOR (or not) 

The "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python 
Standard Library (and many prior versions) states:

---
ContentHandler.characters(content) -- The Parser will call this method to 
report each chunk of character data.  SAX parsers may return all contiguous 
character data in a single chunk, or they may split it into several chunks...
---

If it happens that the content is delivered in two chunks instead of one, the 
characters() method shown above overwrites the first part of the text with the 
second part, and some content seems lost.  This completely explains the 
observed behavior.  

EXPECTED BEHAVIOR (or not)

Even though the behavior is unspecified, users can have certain expectations 
about what a reasonable parser should do.  Among these:

  -- EFFICIENCY: the parser should do simple things simply, and complicated 
things as simply as possible
  -- CONSISTENCY: the parser behavior should be repeatable and dependable

The design can be considered "poor" if thorough testing cannot identify what 
the actual behaviors are going to be, because those behaviors are rare and 
unpredictable.

The obvious "simple thing," from the user perspective, is that the parser 
should return each tiny text string as one tiny text chunk.  In fact, this is 
precisely what it does... 99.999% of the time.  But then, suddenly, it doesn't. 
 

One hypothesis is that when the parsing scan of raw input text reaches the end 
of a large internal text buffer, it is easier from the implementer's 
perspective to flush any text remaining in the old buffer prior to fetching a 
new one, even if that produces a fragmented chunk with only a couple of 
characters.  

IMPROVEMENTS REQUIRED

Review the code to determine whether the text buffer scenario is in fact the 
primary cause of inconsistent behavior. Modify the data handling to defer 
delivery of content fragments that are small, carrying over a small amount of 
previously scanned text so that small contiguous text chunks are recombined 
rather than reported as multiple fragments. If the length of the content text 
to carry over is greater than some configurable 
xml.sax.handler.ContiguousChunkLength, the parser can go ahead and deliver it 
as a fragment.  

DOCUMENTING THE IMPROVEMENTS 

Strictly speaking:  none required.  Undefined behaviors are undefined, whether 
consistent or otherwise.  But after the improvements are implemented, it would 
be helpful to modify documentation to expose the new performance guarantees, 
making users more aware of the possible hazards.  For example, a new 
description in the "xml.sax.handler" page might read as follows: 

---
ContentHandler.characters(content) -- The Parser will call this method to 
report chunks of character data.  In general, character data may be reported as 
a single chunk or as sequence of chunks; but character data sequences with 
fewer than  xml.sax.handler.ContiguousChunkLength characters, when 
uninterrupted any other xml.sax.handler.ContentHandler event, are guaranteed to 
be delivered as a single chunk...  
---

--
components: XML
messages: 389108
nosy: ridgerat1611
priority: normal
severity: normal
status: open
title: Modify SAX/expat parsing to avoid fragmentation of already-tiny content 
chunks
type: enhancement
versions: Python 3.7, Python 3.8, Python 3.9

___
Python tracker 
<https://bugs.python.org/issue43560>
___
_

[issue43561] Modify XML parsing library descriptions to forewarn of content loss hazard

2021-03-19 Thread Larry Trammell


New submission from Larry Trammell :

With reference to improvement issue 43560 :

If those improvements remain unimplemented, or are demoted to "don't fix", 
users are left in the tricky situation where XML parsing applications can fail, 
apparently "losing content" in a rare and unpredictable manner.  It would be 
useful to patch the documentation to give users fair warning of this hazard. 

For example: the "xml.sax.handler" page in the Python 3.9.2 Documentation for 
the Python Standard Library (and many prior versions) currently states:

---
ContentHandler.characters(content) -- The Parser will call this method to 
report each chunk of character data.  SAX parsers may return all contiguous 
character data in a single chunk, or they may split it into several chunks...
---
 
The modified documentation would read something like the following:

---
ContentHandler.characters(content) -- The Parser will call this method to 
report each chunk of character data.  SAX parsers may return all contiguous 
character data in a single chunk, or they may split it into several chunks... 
To avoid a situation in which one small content fragment unexpectedly 
overwrites another one, it is essential for the characters() method to collect 
content by appending, rather than by assignment.
---

To give a concrete example, suppose that a Python programming site recommends 
the following coding to preserve a small text chunk bracketed by "" tags: 

   # Note the name attribute of the current tag group
   def element_handler(self, tagname, attrs) :
   self.CurrentTag = tagname  

   # Record the content from each "p" tag when encountered
   def characters(self, content):
   if self.CurrentTag == "p" :
   self.name = content

Even though that coding could be expected to work most of the time, it is 
exposed to the hazard that an unanticipated sequence of calls to the 
characters() function would overwrite data.

Instead, the coding should look something like this.

   # Note the name attribute of the current tag group
   def element_handler(self, tagname, attrs) :
   self.CurrentTag = tagname 
   self.name = "" 

   # Accumulate the content from each "p" tag when encountered
   def characters(self, content):
   if self.CurrentTag == "p":
   self.name.append(content)

--
assignee: docs@python
components: Documentation
messages: 389111
nosy: docs@python, ridgerat1611
priority: normal
severity: normal
status: open
title: Modify XML parsing library descriptions to forewarn of content loss 
hazard
versions: Python 3.7, Python 3.8, Python 3.9

___
Python tracker 
<https://bugs.python.org/issue43561>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-19 Thread Larry Trammell


Larry Trammell  added the comment:

Check out issues 

43560 (an enhancement issue to improve handling of small XML content chunks)

43561 (a documentation issue to give users warning about the hazard in the 
interim before the changes are implemented)

--

___
Python tracker 
<https://bugs.python.org/issue43483>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com