[Python-Dev] Improving the Python Memory Allocator
This message is a follow up to a thread I started on python-dev back in October, archived here: http://mail.python.org/pipermail/python-dev/2004-October/049480.html Basically, the problem I am trying to solve is that the Python memory allocator never frees memory back to the operating system. I have attached a patch against obmalloc.c for discussion. The patch still has some rough edges and possibly some bugs, so I don't think it should be merged as is. However, I would appreciate any feedback on the chances for getting this implementation into the core. The rest of this message lists some disadvantages to this implementation, a description of the important changes, a benchmark, and my future plans if this change gets accepted. The patch works for any version of Python that uses obmalloc.c (which includes Python 2.3 and 2.4), but I did my testing with Python 2.5 from CVS under Linux and Mac OS X. This version of the allocator will actually free memory. It has two disadvantages: First, there is slightly more overhead with programs that allocate a lot of memory, release it, then reallocate it. The original allocator simply holds on to all the memory, allowing it to be efficiently reused. This allocator will call free(), so it also must call malloc() again when the memory is needed. I have a "worst case" benchmark which shows that this cost isn't too significant, but it could be a problem for some workloads. If it is, I have an idea for how to work around it. Second, the previous allocator went out of its way to permit a module to call PyObject_Free while another thread is executing PyObject_Malloc. Apparently, this was a backwards compatibility hack for old Python modules which erroneously call these functions without holding the GIL. These modules will have to be fixed if this implementation is accepted into the core. Summary of the changes: - Add an "arena_object" structure for tracking pages that belong to each 256kB arena. - Change the "arenas" array from an array of pointers to an array of arena_object structures. - When freeing a page (a pool), it is placed on a free pool list for the arena it belongs to, instead of a global free pool list. - When freeing a page, if the arena is completely unused, the arena is deallocated. - When allocating a page, it is taken from the arena that is the most full. This gives arenas that are almost completely unused a chance to be freed. Benchmark: The only benchmark I have performed at the moment is the worst case for this allocator: A program that allocates 1 000 000 Python objects which occupy nearly 200MB, frees them, reallocates them, then quits. I ran the program four times, and discarded the initial time. Here is the object: class Obj: def __init__( self ): self.dumb = "hello" And here are the average execution times for this program: Python 2.5: real time: 16.304 user time: 16.016 system: 0.257 Python 2.5 + patch: real time: 16.062 user time: 15.593 system: 0.450 As expected, the patched version spends nearly twice as much system time than the original version. This is because it calls free() and malloc() twice as many times. However, this difference is offset by the fact that the user space execution time is actually *less* than the original version. How is this possible? The likely cause is because the original version defined the arenas pointer to be "volatile" in order to work when Free and Malloc were called simultaneously. Since this version breaks that, the pointer no longer needs to be volatile, which allows the value to be stored in a register instead of being read from memory on each operation. Here are some graphs of the memory allocator behaviour running this benchmark. Original: http://www.eng.uwaterloo.ca/~ejones/original.png New: http://www.eng.uwaterloo.ca/~ejones/new.png Future Plans: - More detailed benchmarking. - The "specialized" allocators for the basic types, such as ints, also need to free memory back to the system. - Potentially the allocator should keep some amount of free memory around to improve the performance of programs that cyclically allocate and free large amounts of memory. This amount should be "self-tuned" to the application. Thank you for your feedback, Evan Jones python-allocator.diff Description: Binary data ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improving the Python Memory Allocator
On Jan 24, 2005, at 7:00, Rodrigo Dias Arruda Senra wrote: Depending on the cost of arena allocation, it might help to define a lower threshold keeping a minimum of empty arena_objects permanently available. Do you think this can bring any speedup ? Yes, I think it might. I have to do some more benchmarking first, to try and figure out how expensive the allocations are. This is one of my "future work" items to work on if this change gets accepted. I have not implemented it yet, because I don't want to have to merge one *massive* patch. My rough idea is to do something like this: 1. Keep track of the largest number of pages in use at one time. 2. Every N memory operations (or some other measurement of "time"), reset this value and calculate a moving average of the number of pages. This estimates the current memory requirements of the application. 3. If (used + free) > average, free arenas until freeing one more arena would make (used + free) < average. This is better than a static scheme which says "keep X MB of free memory around" because it will self-tune to the application's requirements. If you have an applications that needs lots of RAM, it will keep lots of RAM. If it has very low RAM usage, it will be more aggressive in reclaiming free space. The challenge is how to determine a good measurement of "time." Ideally, if the application was idle for a while, you would perform some housekeeping like this. Does Python's cyclic garbage collector currently do this? If so, I could hook this "management" stuff on to its calls to gc.collect() Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improving the Python Memory Allocator
On Jan 24, 2005, at 18:16, Martin v. Löwis wrote: - please don't post patches here; post them to SF You may ask for comments here after you posted them to SF. Sure. This should be done even for patches which should absolutely not be committed? - please follow Python coding style. In particular, don't write if ( available_arenas == NULL ) { but write if (available_arenas == NULL) { Yikes! This is a "bad" habit of mine that is in the minority of coding style . Thank you for catching it. Second, the previous allocator went out of its way to permit a module to call PyObject_Free while another thread is executing PyObject_Malloc. Apparently, this was a backwards compatibility hack for old Python modules which erroneously call these functions without holding the GIL. These modules will have to be fixed if this implementation is accepted into the core. I'm not certain it is acceptable to make this assumption. Why is it not possible to use the same approach that was previously used (i.e. leak the arenas array)? This is definitely a very important point of discussion. The main problem is that leaking the "arenas" arena is not sufficient to make the memory allocator thread safe. Back in October, Tim Peters suggested that it might be possible to make the breakage easily detectable: http://mail.python.org/pipermail/python-dev/2004-October/049502.html If we changed PyMem_{Free, FREE, Del, DEL} to map to the system free(), all would be golden (except for broken old code mixing PyObject_ with PyMem_ calls). If any such broken code still exists, that remapping would lead to dramatic failures, easy to reproduce; and old code broken in the other, infinitely more subtle way (calling PyMem_{Free, FREE, Del, DEL} when not holding the GIL) would continue to work fine. I'll be honest, I only have a theoretical understanding of why this support is necessary, or why it is currently correct. For example, is it possible to call PyMem_Free from two threads simultaneously? Since the problem is that threads could call PyMem_Free without holding the GIL, it seems to be that it is possible. Shouldn't it also be supported? In the current memory allocator, I believe that situation can lead to inconsistent state. For example, see obmalloc.c:746, where it has been determined that a block needs to be put on the list of free blocks: *(block **)p = lastfree = pool->freeblock; pool->freeblock = (block *)p; Imagine two threads are simultaneously freeing blocks that belong to the same pool. They both read the same value for pool->freeblock, and assign that same value to p. The changes to pool->freeblock will have some arbitrary ordering. The result? You have just leaked a block of memory. Basically, if a concurrent memory allocator is the requirement, then I think some other approach is necessary. - When allocating a page, it is taken from the arena that is the most full. This gives arenas that are almost completely unused a chance to be freed. It would be helpful if that was documented in the data structures somewhere. The fact that the nextarena list is sorted by nfreepools is only mentioned in the place where this property is preserved; it should be mentioned in the introductory comments as well. This is one of those rough edges I mentioned before. If there is some concensus that these changes should be accepted, then I will need to severely edit the comments at the beginning of obmalloc.c. Thanks for your feedback, Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Python Interpreter Thread Safety?
Due to the issue of thread safety in the Python memory allocator, I have been wondering about thread safety in the rest of the Python interpreter. I understand that the interpreter is not thread safe, but I'm not sure that I have seen a discussion of the all areas where this is an issue. Here are the areas I know of: 1. The memory allocator. 2. Reference counts. 3. The cyclic garbage collector. 4. Current interpreter state is pointed to by a single shared pointer. 5. Many modules may not be thread safe (?). Ignoring the issue of #5 for the moment, are there any other areas where this is a problem? I'm curious about how much work it would be to allow concurrent execution of Python code. Evan Jones Note: One of the reasons I am asking is that my memory allocator patch is that it changes the current allocator from "sort of" thread safe to obviously unsafe. One way to eliminate this issue is to make the allocator completely thread safe, but that would require some fairly significant changes to avoid a major performance penalty. However, if it was one of the components that permitted the interpreter to go multi-threaded, then it would be worth it. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python Interpreter Thread Safety?
On Jan 28, 2005, at 18:24, Martin v. Löwis wrote: 5. Many modules may not be thread safe (?). Modules often release the GIL through BEGIN_ALLOW_THREADS, if they know that would be safe if another thread would enter the Python interpreter. Right, I guess that the modules already have to deal with being reentrant and thread-safe, since Python threads could already cause issues. Ignoring the issue of #5 for the moment, are there any other areas where this is a problem? I'm curious about how much work it would be to allow concurrent execution of Python code. Define "concurrent". Webster's offers Sorry, I really meant *parallel* execution of Python code: Multiple threads simultaneously executing a Python program, potentially on different CPUs. There have been attempts to implement free threading, and they have failed. What I was trying to ask with my last email was what are the trouble areas? There are probably many that I am unaware of, due to my unfamiliarity the Python internals. Due to some unfortunate historical reasons, there is code which enters free() without holding the GIL - and that is what the allocator specifically deals with. Right, but as said in a previous post, I'm not convinced that the current implementation is completely correct anyway. Again, the interpreter supports multi-threading today. Removing the GIL is more difficult, though - nearly any container object (list, dictionary, etc) would have to change, plus the reference counting (which would have to grow atomic increment/decrement). Wouldn't it be up to the programmer to ensure that accesses to shared objects, like containers, are serialized? For example, with Java's collections, there are both synchronized and unsynchronized versions. Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python Interpreter Thread Safety?
On Jan 28, 2005, at 20:27, Tim Peters wrote: The hacks in PyObject_Free (== PyMem_Free) are there solely so that question can be answered correctly in the absence of holding the GIL. "That question" == "does pymalloc control the pointer passed to me, or does the system malloc?". Ah! *Now* I get it. And yes, it will be possible to still support this in my patched version of the allocator. It just means that I have to leak the "arenas" array just like it did before, and then do some hard thinking about memory models and consistency to decide if the "arenas" pointer needs to be volatile. When pymalloc was new, we went to insane lengths to avoid breaking that stuff, but enough is enough. So you don't think we need to bother supporting that any more? Back to the present: Wouldn't it be up to the programmer to ensure that accesses to shared objects, like containers, are serialized? For example, with Java's collections, there are both synchronized and unsynchronized versions. Enormous mounds of existing threaded Python code freely manipulates lists and dicts without explicit locking now. We can't break that -- and wouldn't want to. Writing threaded code is especially easy (a relative stmt, not absolute) in Python because of it. Right, because currently Python switches threads on a granularity of opcodes, which gives you this serialization with the cost of never having parallel execution. Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python Interpreter Thread Safety?
On Jan 28, 2005, at 19:44, Martin v. Löwis wrote: Python threads can run truly parallel, as long as one of them invoke BEGIN_ALLOW_THREADS. Except that they are really executing C code, not Python code. I think nobody really remembers - ask Google for "Python free threading". Greg Stein did the patch, and the main problem apparently was that the performance became unacceptable - apparently primarily because of dictionary locking. Thanks, I found the threads discussing it. Right, but as said in a previous post, I'm not convinced that the current implementation is completely correct anyway. Why do you think so? (I see in your previous post that you claim it is not completely correct, but I don't see any proof). There are a number of issues actually, but as Tim points, only if the blocks are managed by PyMalloc. I had written a description of three of them here, but they are not relevant. If the issue is calling PyMem_Free with a pointer that was allocated with malloc() while PyMalloc is doing other stuff, then no problem: That is possible to support, but I'll have to think rather hard about some of the issues. For example, for lists, the C API allows direct access to the pointers in the list. If the elements of the list could change in-between, an object in the list might go away after you got the pointer, but before you had a chance to INCREF it. This would cause a crash shortly afterwards. Even if that was changed to always return a new refence, lots of code would break, as it would create large memory leaks (code would have needed to decref the list items, but currently doesn't - nor is it currently necessary). Ah! Right. In Java, the collections are all actually written in Java, and run on the VM. Thus, when some concurrent weirdness happens, it just corrupts the application, not the VM. However, in Python, this could actually corrupt the interpreter itself, crashing the entire thing with a very ungraceful Segmentation Fault or something similar. Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: Moving towards Python 3.0 (was Re: [Python-Dev] Speed up function calls)
On Jan 31, 2005, at 0:17, Guido van Rossum wrote: The "just kidding" applies to the whole list, right? None of these strike me as good ideas, except for improvements to function argument passing. Really? You see no advantage to moving to garbage collection, nor allowing Python to leverage multiple processor environments? I'd be curious to hear your reasons why not. My knowledge about garbage collection is weak, but I have read a little bit of Hans Boehm's work on garbage collection. For example, his "Memory Allocation Myths and Half Truths" presentation (http://www.hpl.hp.com/personal/Hans_Boehm/gc/myths.ps) is quite interesting. On page 25 he examines reference counting. The biggest disadvantage mentioned is that simple pointer assignments end up becoming "increment ref count" operations as well, which can "involve at least 4 potential memory references." The next page has a micro-benchmark that shows reference counting performing very poorly. Not to mention that Python has a garbage collector *anyway,* so wouldn't it make sense to get rid of the reference counting? My only argument for making Python capable of leveraging multiple processor environments is that multithreading seems to be where the big performance increases will be in the next few years. I am currently using Python for some relatively large simulations, so performance is important to me. Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Memory Allocator Part 2: Did I get it right?
After I finally understood what thread-safety guarantees the Python memory allocator needs to provide, I went and did some hard thinking about the code this afternoon. I believe that my modifications provide the same guarantees that the original version did. I do need to declare the arenas array to be volatile, and leak the array when resizing it. Please correct me if I am wrong, but the situation that needs to be supported is this: While one thread holds the GIL, any other thread can call PyObject_Free with a pointer that was returned by the system malloc. The following situation is *not* supported: While one thread holds the GIL, another thread calls PyObject_Free with a pointer that was returned by PyObject_Malloc. I'm hoping that I got things a little better this time around. I've submitted my updated patch to the patch tracker. For reference, I've included links to SourceForge and the previous thread. Thank you, Evan Jones Previous thread: http://mail.python.org/pipermail/python-dev/2005-January/051255.html Patch location: http://sourceforge.net/tracker/index.php? func=detail&aid=1123430&group_id=5470&atid=305470 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Memory Allocator Part 2: Did I get it right?
On Feb 15, 2005, at 17:52, Tim Peters wrote: As I said before, I don't think we need to support this any more. More, I think we should not -- the support code is excruciatingly subtle, it wasted plenty of your time trying to keep it working, and if we keep it in it's going to continue to waste time over the coming years (for example, in the short term, it will waste my time reviewing it). I do not have nearly enough experience in the Python world to evaluate this decision. I've only been programming in Python for about two years now, and as I am sure you are aware, this is my first patch that I have submitted to Python. I don't really know my way around the Python internals, beyond writing basic extensions in C. Martin's opinion is clearly the opposite of yours. Basically, the debate seems to boil down to maintaining backwards compatibility at the cost of making the code in obmalloc.c harder to understand. The particular case that is being supported could definitely be viewed as a "bug" in the code that using obmalloc. It also likely is quite rare. However, until now it has been supported, so it is hard to judge exactly how much code would be affected. It would definitely be a minor barrier to moving to Python 2.5. Is there some sort of consensus that is possible on this issue? While one thread holds the GIL, any other thread can call PyObject_Free with a pointer that was returned by the system malloc. What _was_ supported was more generally that any number of threads could call PyObject_Free with pointers that were returned by the system malloc/realloc at the same time as a single thread, holding the GIL, was doing anything whatsoever (including executing any code inside obmalloc.c) Okay, good, that is what I have assumed. Although that's a misleading way of expressing the actual intent; more on that below. That's fine. It may be a misleading description of the intent, but it is an accurate description of the required behaviour. At least I hope it is. I expect it would be easier if you ripped out the horrid support for PyObject_Free abuse; in a sane world, the release-build PyMem_FREE, PyMem_Del, and PyMem_DEL would expand to "free" instead of to "PyObject_FREE" (via changes to pymem.h). It turns out that basically the only thing that would change would be removing the "volatile" specifiers from two of the global variables, plus it would remove about 100 lines of comments. :) The "work" was basically just hurting my brain trying to reason about the concurrency issues, not changing code. It was never legit to do #a without holding the GIL. It was clear as mud whether it was legit to do #b without holding the GIL. If PyMem_Del (etc) change to expand to "free" in a release build, then #b can remain clear as mud without harming anyone. Nobody should be doing #a anymore. If someone still is, "tough luck -- fix it, you've had years of warning" is easy for me to live with at this stage. Hmm... The issue is that case #a may not be an easy problem to diagnose: Some implementations of free() will happily do nothing if they are passed a pointer they know nothing about. This would just result in a memory leak. Other implementations of free() can output a warning or crash in this case, which would make it trivial to locate. I suppose the other consideration is that already-compiled extension modules on non-Windows(*) systems will, if they're not recompiled, continue to call PyObject_Free everywhere they had a PyMem_Del/DEL/FREE call. Is it guaranteed that extension modules will be binary compatible with future Python releases? I didn't think this was the case. Thanks for the feedback, Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%
On Feb 16, 2005, at 18:42, Martin v. Löwis wrote: I must admit that I'm surprised. I would have expected that most allocations in Python go through obmalloc, so the heap would only see "large" allocations. It would be interesting to find out, in your application, why it is still an improvement to use the low-fragmentation heaps. Hmm... This is an excellent point. A grep through the Python source code shows that the following files call the native system malloc (I've excluded a few obviously platform specific files). A quick visual inspection shows that most of these are using it to allocate some sort of array or string, so it likely *should* go through the system malloc. Gfeller, any idea if you are using any of the modules on this list? If so, it would be pretty easy to try converting them to call the obmalloc functions instead, and see how that affects the performance. Evan Jones Demo/pysvr/pysvr.c Modules/_bsddb.c Modules/_curses_panel.c Modules/_cursesmodule.c Modules/_hotshot.c Modules/_sre.c Modules/audioop.c Modules/bsddbmodule.c Modules/cPickle.c Modules/cStringIO.c Modules/getaddrinfo.c Modules/main.c Modules/pyexpat.c Modules/readline.c Modules/regexpr.c Modules/rgbimgmodule.c Modules/svmodule.c Modules/timemodule.c Modules/zlibmodule.c PC/getpathp.c Python/strdup.c Python/thread.c ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%
On Thu, 2005-02-17 at 22:38, Tim Peters wrote: Then you allocate a small object, marked 's': bbbsfff Isn't the whole point of obmalloc is that we don't want to allocate "s" on the heap, since it is small? I guess "s" could be an object that might potentially grow. One thing to take from that is that LFH can't be helping list-growing in a direct way either, if LFH (as seems likely) also needs to copy objects that grow in order to keep its internal memory segregated by size. The indirect benefit is still available, though: LFH may be helping simply by keeping smaller objects out of the general heap's hair. So then wouldn't this mean that there would have to be some sort of small object being allocated via the system malloc that is causing the poor behaviour? As you mention, I wouldn't think it would be list objects, since resizing lists using LFH should be *worse*. That would actually be something that is worth verifying, however. It could be that the Windows LFH is extra clever? I'm afraid the only you can know for sure is by obtaining detailed memory maps and analyzing them. Well, it would also be useful to find out what code is calling the system malloc. This would make it easy to examine the code and see if it should be calling obmalloc or the system malloc. Any good ideas for easily obtaining this information? I imagine that some profilers must be able to produce a complete call graph? Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Memory Allocator Part 2: Did I get it right?
Sorry for taking so long to get back to this thread, it has been one of those weeks for me. On Feb 16, 2005, at 2:50, Martin v. Löwis wrote: Evan then understood the feature, and made it possible. This is very true: it was a very useful exercise. I can personally accept breaking the code that still relies on the invalid APIs. The only problem is that it is really hard to determine whether some code *does* violate the API usage. Great. Please ignore the patch on SourceForge for a little while. I'll produce a "revision 3" this weekend, without the compatibility hack. Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%
On Feb 18, 2005, at 17:51, Tim Peters wrote: grow the list to its final size once, at the start (overestimating if you don't know for sure). Then instead of appending, keep an index to the next free slot, same as you'd do in C. Then the list guts never move, so if that doesn't yield the same kind of speedup without using LFH, list copying wasn't actually the culprit to begin with. If this *does* improve the performance of his application by 15%, that would strongly argue for an addition to the list API similar to Java's ArrayList.ensureCapacity or the STL's vector::reserve. Since the list implementation already maintains separate ints for the list array size and the list occupied size, this would really just expose this implementation detail to Python. I don't like revealing the implementation in this fashion, but if it does make a significant performance difference, it could be worth it. http://java.sun.com/j2se/1.5.0/docs/api/java/util/ ArrayList.html#ensureCapacity(int) http://www.sgi.com/tech/stl/Vector.html#4 Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Memory Allocator Part 2: Did I get it right?
On Mar 4, 2005, at 23:55, Brett C. wrote: Evan uploaded the newest version (@ http://www.python.org/sf/1123430) and he says it is "complete". I should point out (very late! It's been a busy couple of weeks) that I fully expect that there may be some issues with the patch that will arise when it is reviewed. I am still lurking on Python-Dev, and I will strive to correct any defects ASAP. A few users have contacted me privately and have tested the patch, so it works for a few people at least. Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Unicode byte order mark decoding
I recently rediscovered this strange behaviour in Python's Unicode handling. I *think* it is a bug, but before I go and try to hack together a patch, I figure I should run it by the experts here on Python-Dev. If you understand Unicode, please let me know if there are problems with making these minor changes. >>> import codecs >>> codecs.BOM_UTF8.decode( "utf8" ) u'\ufeff' >>> codecs.BOM_UTF16.decode( "utf16" ) u'' Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder turns it into a character? The UTF-16 decoder contains logic to correctly handle the BOM. It even handles byte swapping, if necessary. I propose that the UTF-8 decoder should have the same logic: it should remove the BOM if it is detected at the beginning of a string. This will remove a bit of manual work for Python programs that deal with UTF-8 files created on Windows, which frequently have the BOM at the beginning. The Unicode standard is unclear about how it should be handled (version 4, section 15.9): Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. [...] Systems that use the byte order mark must recognize when an initial U+FEFF signals the byte order. In those cases, it is not part of the textual content and should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space. At the very least, it would be nice to add a note about this to the documentation, and possibly add this example function that implements the "UTF-8 or ASCII?" logic: def autodecode( s ): if s.beginswith( codecs.BOM_UTF8 ): # The byte string s is UTF-8 out = s.decode( "utf8" ) return out[1:] else: return s.decode( "ascii" ) As a second issue, the UTF-16LE and UTF-16BE encoders almost do the right thing: They turn the BOM into a character, just like the Unicode specification says they should. >>> codecs.BOM_UTF16_LE.decode( "utf-16le" ) u'\ufeff' >>> codecs.BOM_UTF16_BE.decode( "utf-16be" ) u'\ufeff' However, they also *incorrectly* handle the reversed byte order mark: >>> codecs.BOM_UTF16_BE.decode( "utf-16le" ) u'\ufffe' This is *not* a valid Unicode character. The Unicode specification (version 4, section 15.8) says the following about non-characters: Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text. Note that Unicode conformance freely allows the removal of these characters. (See C10 in Section3.2, Conformance Requirements.) My interpretation of the specification means that Python should silently remove the character, resulting in a zero length Unicode string. Similarly, both of the following lines should also result in a zero length Unicode string: >>> '\xff\xfe\xfe\xff'.decode( "utf16" ) u'\ufffe' >>> '\xff\xfe\xff\xff'.decode( "utf16" ) u'\u' Thanks for your feedback, Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
On Apr 1, 2005, at 15:19, M.-A. Lemburg wrote: The BOM (byte order mark) was a non-standard Microsoft invention to detect Unicode text data as such (MS always uses UTF-16-LE for Unicode text files). Well, it's origins do not really matter since at this point the BOM is firmly encoded in the Unicode standard. It seems to me that it is in everyone's best interest to support it. It is not needed for the UTF-8 because that format doesn't rely on the byte order and the BOM character at the beginning of a stream is a legitimate ZWNBSP (zero width non breakable space) code point. You are correct: it is a legitimate character. However, its use as a ZWNBSP character has been deprecated: The overloading of semantics for this code point has caused problems for programs and protocols. The new character U+2060 WORD JOINER has the same semantics in all cases as U+FEFF, except that it cannot be used as a signature. Implementers are strongly encouraged to use word joiner in those circumstances whenever word joining semantics is intended. Also, the Unicode specification is ambiguous on what an implementation should do about a leading ZWNBSP that is encoded in UTF-16. Like I mentioned, if you look at the Unicode standard, version 4, section 15.9, it says: 2. Unmarked Character Set. In some circumstances, the character set information for a stream of coded characters (such as a file) is not available. The only information available is that the stream contains text, but the precise character set is not known. This seems to indicate that it is permitted to strip the BOM from the beginning of UTF-8 text. -1; there's no standard for UTF-8 BOMs - adding it to the codecs module was probably a mistake to begin with. You usually only get UTF-8 files with BOM marks as the result of recoding UTF-16 files into UTF-8. This is clearly incorrect. The UTF-8 is specified in the Unicode standard version 4, section 15.9: In UTF-8, the BOM corresponds to the byte sequence . I normally find files with UTF-8 BOMs from many Windows applications when you save a text file as UTF8. I think that Notepad or WordPad does this, for example. I think UltraEdit also does the same thing. I know that Scintilla definitely does. At the very least, it would be nice to add a note about this to the documentation, and possibly add this example function that implements the "UTF-8 or ASCII?" logic. Well, I'd say that's a very English way of dealing with encoded text ;-) Please note I am saying only that something like this may want to me considered for addition to the documentation, and not to the Python standard library. This example function more closely replicates the logic that is used on those Windows applications when opening ".txt" files. It uses the default locale if there is no BOM: def autodecode( s ): if s.beginswith( codecs.BOM_UTF8 ): # The byte string s is UTF-8 out = s.decode( "utf8" ) return out[1:] else: return s.decode() BTW, how do you know that s came from the start of a file and not from slicing some already loaded file somewhere in the middle ? Well, the same argument could be applied to the UTF-16 decoder know that the string came from the start of a file, and not from slicing some already loaded file? The standard states that: In the UTF-16 encoding scheme, U+FEFF at the very beginning of a file or stream explicitly signals the byte order. So it is perfectly permissible to perform this type of processing if you consider a string to be equivalent to a stream. My interpretation of the specification means that Python should silently remove the character, resulting in a zero length Unicode string. Hmm, wouldn't it be better to raise an error ? After all, a reversed BOM mark in the stream looks a lot like you're trying to decode a UTF-16 stream assuming the wrong byte order ?! Well, either one is possible, however the Unicode standard suggests, but does not require, silently removing them: It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text. Note that Unicode conformance freely allows the removal of these characters. I would prefer silently ignoring them from the str.decode() function, since I believe in "be strict in what you emit, but liberal in what you accept." I think that this only applies to str.decode(). Any other attempt to create non-characters, such as unichr( 0x ), *should* raise an exception because clearly the programmer is making a mistake. Other than that: +1 on fixing this case. Cool! Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
On Apr 5, 2005, at 15:33, Walter Dörwald wrote: The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will never see these two characters. Shouldn't the decoder be capable of doing a partial match and quitting early? After all, "ab" is encoded in UTF8 as <61> <62> but the BOM is . If it did this type of partial matching, this issue would be avoided except in rare situations. A solution for this would be to add an argument named final to the decode and read methods that tells the decoder that the stream has ended and the remaining buffered bytes have to be handled now. This functionality is provided by a flush() method on similar objects, such as the zlib compression objects. Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] [PATCH][BUG] Segmentation Fault in xml.dom.minidom.parse
The following Python script causes Python 2.3, 2.4 and the latest CVS to crash with a Segmentation Fault: import xml.dom.minidom x = u'\nComment \xe7a va ? Tr\xe8s bien ?' dom = xml.dom.minidom.parseString( x.encode( 'latin_1' ) ) print repr( dom.childNodes[0].localName ) The problem is that this XML document does not specify an encoding. In this case, minidom assumes that it is encoded in UTF-8. However, in fact it is encoded in Latin-1. My two line patch, in the SourceForge tracker at the URL below, causes this to raise a UnicodeDecodingError instead. http://sourceforge.net/tracker/index.php? func=detail&aid=1309009&group_id=5470&atid=305470 Any chance that someone wants to commit this tiny two line fix? This might be the kind of fix that might be elegible to be backported to Python 2.4 as well. It passes "make test" on both my Linux system and my Mac. I've also attached a patch that adds this test case to test_minidom.py. Thanks, Evan Jones -- Evan Jones http://evanjones.ca/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Parser and Runtime: Divorced!
After a few hours of tedious and frustrating hacking I've managed to separate the Python abstract syntax tree parser from the rest of Python itself. This could be useful for people who may wish to build Python tools without Python, or tools in C/C++. In the process of doing this, I came across a comment mentioning that it would be desirable to separate the parser. Is there any interest in doing this? I now have a vague idea about how to do this. Of course, there is no point in making changes like this unless there is some tangible benefit. I will make my ugly hack available once I have polished it a little bit more. It involved hacking header files to provide a "re-implementation" of the pieces of Python that the parser needs (PyObject, PyString, and PyInt). It likely is a bit buggy, and it doesn't support all the types (notably, it is missing support for Unicode, Longs, Floats, and Complex), but it works well enough to get the AST from a few simple strings, which is what I wanted. Evan Jones -- Evan Jones http://evanjones.ca/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Parser and Runtime: Divorced!
On Oct 26, 2005, at 20:02, Evan Jones wrote: > In the process of doing this, I came across a comment mentioning that > it would be desirable to separate the parser. Is there any interest in > doing this? I now have a vague idea about how to do this. Of course, > there is no point in making changes like this unless there is some > tangible benefit. I am going to assume that since no one was excited about my post, that the answer is: no, there is no interest in seperating the parser from the rest of the Python run time. At any rate, if anyone is looking for a standalone C Python parser library, you can get it at the following URL. It includes a "print the tree" example that displays the AST for a specified file. It only supports a subset of the parse tree (assignment, functions, print, return), but it should be obvious how it could be extended. http://evanjones.ca/software/pyparser.html Evan Jones -- Evan Jones http://evanjones.ca/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com