[issue43483] Loss of content in simple (but oversize) SAX parsing
New submission from Larry Trammell : == The Problem == I have observed a "loss of data" problem using the Python SAX parser, when processing an oversize but very simple machine-generated xhtml file. The file represents a single N x 11 data table. W3C "tidy" reports no xml errors. The table is constructed in an entirely plausible manner, using table, tr, and td tags to define the table structure, and p tags to bracket content, which consists of small chunks of quoted text. There is nothing pathological, no extraneous whitespace characters, no empty data fields. Everything works perfectly in small test cases. But when a very large number of rows are present, a few characters of content strings are occasionally lost. I have observed 2 or 6 characters dropped. But here's the strange part. The pathological behavior disappears (or moves to another location) when one or more non-significant whitespace characters are inserted at an arbitrary location early in the file... e.g. an extra linefeed before the first tr tag. == Context == I have observed identical behavior on desktop systems using an Intel Xeon E5-1607 or a Core-2 processor, running 32-bit or 64-bit Linux operating systems, variously using Python 3.8.5, 3.8, 3.7.3, and 3.5.1. == Observing the Problem == Sorry that the test data is so bulky (even at 0.5% of original size), but bulk appears to be a necessary condition to observe the problem. Run the following command line. python3 EnchXMLTest.py EnchTestData.html The test script invokes the SAX parser and generates messages on stdout. Using the original test data as provided, the test should run correctly to completion. Now modify the test data file, deleting the extraneous comment line (there is only one) found near the top of the file. Repeat the test run, and this time look for missing content characters in parsed content fields of the last record. == Any guesses? == Beyond "user is oblivious," possibly something abnormal can occur at seams between large blocks of buffered text. The presence or absence of an extra character early in the data stream results in a corresponding shift in content location at the end of the buffer. Other clues: is it relevant that the problem appears in a string field that contains slash characters? -- components: XML files: EnchSAXTest.zip messages: 388582 nosy: ridgerat1611 priority: normal severity: normal status: open title: Loss of content in simple (but oversize) SAX parsing type: behavior versions: Python 3.7, Python 3.8 Added file: https://bugs.python.org/file49872/EnchSAXTest.zip ___ Python tracker <https://bugs.python.org/issue43483> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43483] Loss of content in simple (but oversize) SAX parsing
Larry Trammell added the comment: Not a bug, strictly speaking... more like user abuse. The parsers (expat as well as SAX) must be able to return content text as a sequence of pieces when necessary. For example, as a text sequence interrupted by grouping or styling tags (like or ). Or, extensive text blocks might need to be subdivided for efficient processing. Users would expect hazards like these and be wary. But how many users would suspect that a quoted string of length 8 characters would be returned in multiple pieces? Or that an entity notation would be split down the middle? Virtually all existing tutorial examples showing content extraction are WRONG -- because the ONLY content that can be trusted must be filtered through some kind of aggregator object. How many users will know this instinctively? It would be very useful for the parser systems to provide some kind of support for text aggregation function. A guarantee that "small contiguous" text items will not be chopped might also be helpful. -- resolution: -> not a bug stage: -> resolved status: open -> closed ___ Python tracker <https://bugs.python.org/issue43483> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43483] Loss of content in simple (but oversize) SAX parsing
Larry Trammell added the comment: I can't find any real errors in documentation. There are subtle design and implementation decisions that result in unexpected rare side effects. After processing hundreds of thousands of lines one way, why would the parser suddenly decide to process the next line differently? Well, because it can, and it happens to be convenient. And that can catch users off-guard. I'm considering whether posting an "enhancement" issue would be more appropriate... maybe there is a way to make the parser systems work more nearly the way people currently expect, without breaking things. -- ___ Python tracker <https://bugs.python.org/issue43483> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43483] Loss of content in simple (but oversize) SAX parsing
Larry Trammell added the comment: Assuming that my understanding is completely correct, the situation is that the xml parser has an unspecified behavior. This is true in any text content handler, at any time, and applies to the expat parser as well as SAX. In some rare cases, the behavior of the current implementation (and also many past ones) sometimes seems inconsistent and can catch users by surprise -- even some who are relatively knowledgable (which does not include me). This is a little abstract, but two things could be done to improve this: 1. Modify the implementation so that the behavior remains unspecified but falls more in line with plausible expectations of the users. This makes things a little more complicated for the implementer, but does not invalidate the documentation of present or past versions. 2. The documentation could be updated to expose the new constraints on the previously unspecified behavior, giving users a better chance to recognize and prepare for any remaining difficulties. However, the implementation changes could be made even without these documentation changes. So I remain confused about whether this is really a "bug" -- it is an "easy but unfortunate implementation choice" that is technically not wrong, even if sometimes baffling. Established applications that already use older parser versions are relatively unlikely to start failing given the kind of documents they process, so backport changes might be helpful but do not seem urgent. Eric, with this clarification, what is your opinion about how to properly post a new issue -- improvement or bug fix? I can provide a more detailed technical explanation where a new issue is posted. -- ___ Python tracker <https://bugs.python.org/issue43483> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43483] Loss of content in simple (but oversize) SAX parsing
Larry Trammell added the comment: Sure... I'll cut and paste some of the text I was organizing to go into a possible new issue page. The only relevant documentation I could find was in the "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python Standard Library (as it has been through many versions): --- ContentHandler.characters(content) -- The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks... --- As an example, here is a typical snippet taken from Web page https://www.tutorialspoint.com/parsing-xml-with-sax-apis-in-python The application example records the tag name "type" in the "CurrentData" member, and shortly thereafter, the "type" tag's content is received: # Call when a character is read def characters(self, content): if self.CurrentData == "type": self.type = content Suppose that the parser receives the following text line from the input file. SciFi Though there seems no reason for it, the parser could decide to deliver the content text as "Sc" followed by "iFi". In that case, a second invocation of the "characters" method would overwrite the characters received in the first invocation, and some of the content text seems "lost." Given how rarely it happens, I suspect that when internal processing reaches the end of a block of buffered text from the input file, the easiest thing to do is to report any fragments of text that happen to remain at the end, no matter how tiny, and start fresh with the next internal buffer. Easy for the implementer, but baffling to the application developer. And rare enough to elude application testing. -- ___ Python tracker <https://bugs.python.org/issue43483> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43483] Loss of content in simple (but oversize) SAX parsing
Larry Trammell added the comment: Great minds think alike I guess... I was thinking of a much smaller carryover size... maybe 1K. With individual text blocks longer than that, the user will almost certainly be dealing with collecting and aggregating content text anyway, and in that case, the problem is solved before it happens. Here is a documentation change I was experimenting with... --- ContentHandler.characters(content) -- The Parser will call this method to report chunks of character data. In general, character data may be reported as a single chunk or as sequence of chunks; but character data sequences with fewer than xml.sax.handler.ContiguousChunkLength characters, when uninterrupted any other xml.sax.handler.ContentHandler event, are guaranteed to be delivered as a single chunk... --- That puts users on notice, "...wait, are my chunks of text smaller than that?" and they are less likely to be caught unaware. But of course, the implementation change would be helpful even without this extra warning. -- ___ Python tracker <https://bugs.python.org/issue43483> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43483] Loss of content in simple (but oversize) SAX parsing
Larry Trammell added the comment: Oh, and whether this affects only content text... I would presume so, but I don't know how to tell for sure. Unspecified behaviors can be very mysterious! -- ___ Python tracker <https://bugs.python.org/issue43483> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43483] Loss of content in simple (but oversize) SAX parsing
Larry Trammell added the comment: I think the existing ContentHandler.characters(content) documentation DOES say that the text can come back in chunks... but it is subtle. It might be possible to say more explicitly that any content no matter how small is allowed to be returned as any number of chunks at any time... Though true, that is harsh, overstating considerably what actually happens. Concentrating on a better implementation would be more effective than worrying about existing documentation, given how long the existing conditions have prevailed. My opinion, as one who has been bitten. -- ___ Python tracker <https://bugs.python.org/issue43483> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43483] Loss of content in simple (but oversize) SAX parsing
Larry Trammell added the comment: If there were a decision NOT TO FIX... maybe then it would make sense to consider documentation patches at a higher priority. That way, SAX-Python (and expat-Python) tutorials across the Web could start patching their presentations accordingly. -- ___ Python tracker <https://bugs.python.org/issue43483> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43483] Loss of content in simple (but oversize) SAX parsing
Larry Trammell added the comment: Eric, now that you know as much as I do about the nature and scope of the peculiar parsing behavior, do you have any suggestions about how to proceed from here? -- ___ Python tracker <https://bugs.python.org/issue43483> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43560] Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks
New submission from Larry Trammell : Issue 43483 was posted as a "bug" but retracted. Though the problem is real, it is tricky to declare an UNSPECIFIED behavior to be a bug. See that issue page for more discussion and a test case. A brief overview is repeated here. SCENARIO - XML PARSING LOSES DATA (or not) The parsing attempts to capture text consisting of very tiny quoted strings. A typical content line reads something like this: Colchuck The parser implements a scheme presented at various tutorial Web sites, using two member functions. # Note the name attribute of the current tag group def element_handler(self, tagname, attrs) : self.CurrentTag = tagname # Record the content from each "p" tag when encountered def characters(self, content): if self.CurrentTag == "p": self.name = content ... > print(parser.name) "Colchuck" But then, after successfully extracting content from perhaps hundreds of thousands of XML tag sets in this way, the parsing suddenly "drops" a few characters of content. > print(parser.name) "lchuck" While this problem was observed with a SAX parser, it can affect expat parsers as well. It affects 32-bit and 64-bit implementations the same, over several major releases of the Python 3 system. SPECIFIED BEHAVIOR (or not) The "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python Standard Library (and many prior versions) states: --- ContentHandler.characters(content) -- The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks... --- If it happens that the content is delivered in two chunks instead of one, the characters() method shown above overwrites the first part of the text with the second part, and some content seems lost. This completely explains the observed behavior. EXPECTED BEHAVIOR (or not) Even though the behavior is unspecified, users can have certain expectations about what a reasonable parser should do. Among these: -- EFFICIENCY: the parser should do simple things simply, and complicated things as simply as possible -- CONSISTENCY: the parser behavior should be repeatable and dependable The design can be considered "poor" if thorough testing cannot identify what the actual behaviors are going to be, because those behaviors are rare and unpredictable. The obvious "simple thing," from the user perspective, is that the parser should return each tiny text string as one tiny text chunk. In fact, this is precisely what it does... 99.999% of the time. But then, suddenly, it doesn't. One hypothesis is that when the parsing scan of raw input text reaches the end of a large internal text buffer, it is easier from the implementer's perspective to flush any text remaining in the old buffer prior to fetching a new one, even if that produces a fragmented chunk with only a couple of characters. IMPROVEMENTS REQUIRED Review the code to determine whether the text buffer scenario is in fact the primary cause of inconsistent behavior. Modify the data handling to defer delivery of content fragments that are small, carrying over a small amount of previously scanned text so that small contiguous text chunks are recombined rather than reported as multiple fragments. If the length of the content text to carry over is greater than some configurable xml.sax.handler.ContiguousChunkLength, the parser can go ahead and deliver it as a fragment. DOCUMENTING THE IMPROVEMENTS Strictly speaking: none required. Undefined behaviors are undefined, whether consistent or otherwise. But after the improvements are implemented, it would be helpful to modify documentation to expose the new performance guarantees, making users more aware of the possible hazards. For example, a new description in the "xml.sax.handler" page might read as follows: --- ContentHandler.characters(content) -- The Parser will call this method to report chunks of character data. In general, character data may be reported as a single chunk or as sequence of chunks; but character data sequences with fewer than xml.sax.handler.ContiguousChunkLength characters, when uninterrupted any other xml.sax.handler.ContentHandler event, are guaranteed to be delivered as a single chunk... --- -- components: XML messages: 389108 nosy: ridgerat1611 priority: normal severity: normal status: open title: Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks type: enhancement versions: Python 3.7, Python 3.8, Python 3.9 ___ Python tracker <https://bugs.python.org/issue43560> ___ _
[issue43561] Modify XML parsing library descriptions to forewarn of content loss hazard
New submission from Larry Trammell : With reference to improvement issue 43560 : If those improvements remain unimplemented, or are demoted to "don't fix", users are left in the tricky situation where XML parsing applications can fail, apparently "losing content" in a rare and unpredictable manner. It would be useful to patch the documentation to give users fair warning of this hazard. For example: the "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python Standard Library (and many prior versions) currently states: --- ContentHandler.characters(content) -- The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks... --- The modified documentation would read something like the following: --- ContentHandler.characters(content) -- The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks... To avoid a situation in which one small content fragment unexpectedly overwrites another one, it is essential for the characters() method to collect content by appending, rather than by assignment. --- To give a concrete example, suppose that a Python programming site recommends the following coding to preserve a small text chunk bracketed by "" tags: # Note the name attribute of the current tag group def element_handler(self, tagname, attrs) : self.CurrentTag = tagname # Record the content from each "p" tag when encountered def characters(self, content): if self.CurrentTag == "p" : self.name = content Even though that coding could be expected to work most of the time, it is exposed to the hazard that an unanticipated sequence of calls to the characters() function would overwrite data. Instead, the coding should look something like this. # Note the name attribute of the current tag group def element_handler(self, tagname, attrs) : self.CurrentTag = tagname self.name = "" # Accumulate the content from each "p" tag when encountered def characters(self, content): if self.CurrentTag == "p": self.name.append(content) -- assignee: docs@python components: Documentation messages: 389111 nosy: docs@python, ridgerat1611 priority: normal severity: normal status: open title: Modify XML parsing library descriptions to forewarn of content loss hazard versions: Python 3.7, Python 3.8, Python 3.9 ___ Python tracker <https://bugs.python.org/issue43561> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43483] Loss of content in simple (but oversize) SAX parsing
Larry Trammell added the comment: Check out issues 43560 (an enhancement issue to improve handling of small XML content chunks) 43561 (a documentation issue to give users warning about the hazard in the interim before the changes are implemented) -- ___ Python tracker <https://bugs.python.org/issue43483> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com