[Python-Dev] Making the GIL faster & lighter on Windows
Hi everyone, I'm new to the list but I've been embedding Python and working very closely with the core sources for many years now. I discovered Python a long time ago when I needed to embed a scripting language and found the PHP sources... unreadable ;) Anyway, I'd like to ask something that may have been asked already, so I apologize if this has been covered. Instead of removing the GIL, has anyone thought of making it more lightweight? The current situation for Windows is that the single-thread case is decently fast (via interlocked operations), but it drops to using an event object in the case of contention. (see thread_nt.h) Now, I don't have any specific evidence aside from my experience in Windows multithreaded programming, but event objects are often considered the slowest synchronization mechanism available. So, what are the alternatives? Mutexes or critical sections. Semaphores too, if you want to get fancy, but I digress. Because mutexes have the capability of inter-process locking, which we don't need, critical sections fit the bill as a lightweight locking mechanism. They work in a way similar to how the Python GIL is handled: first, attempt an interlocked operation, and if another thread owns the lock, wait on a kernel object. They are known to be extremely fast. There are some catches with using a critical section instead of the current method: 1. It is recursive, while the current GIL setup is not. Would it break Python to support (or deal with) recursive behavior at the GIL level? Note that we can still disallow recursion and fail because we know if the current thread is the lock owner, but the return from the lock function is usually only checked when the wait parameter is zero (meaning "don't block, just try to acquire"). The biggest problem I see here is how mixing the PyGILState_* API with multiple interpreters will behave: when PyGILState_Ensure() is called while the GIL is held for a thread state under an interpreter other than the main interpreter, it tries to re-lock the GIL. This would normally cause a deadlock, but the best we could do with a critical section is have the call fail and/or increase a recursion counter. If maintaining behavior is absolutely necessary, I guess it would be pretty easy to force a deadlock. Personally, I would prefer a Py_FatalError or something like it. 2. Backwards incompatibility: TryEnterCriticalSection isn't available pre-NT4, so Windows 95 support is broken. Microsoft doesn't support or even mention it in the list of supporting OSes for their API functions anymore, so... non-issue? Some of the data structure is available to us, so I bet it would be easy to implement the function manually. 3. ?? - I'm sure there are other issues that deserve a look. I've given this a shot already while doing some concurrency testing with my ISAPI extension (PyISAPIe). First of all, nothing looks broken yet. I'm using my modified python26.dll to run all of my Python code and trying to find anywhere it could possibly break. For multiple concurrent requests against a single multithreaded ISAPI handler process, I see a statistically significant speed increase depending on how much Python code is executed. With more Python code executed (e.g. a Django page), the speedup was about 2x. I haven't tested with varied values for _Py_CheckInterval aside from finding a sweet spot for my specific purposes, but using 100 (the default) would likely make the performance difference more noticeable. A spin mutex also does well, but the results vary a lot more. Just as a disclaimer, my tests were nowhere near scientific, but if anyone needs convincing I can come up with some actual measurements. I think at this point most of you are wondering more about what it would break. Hopefully I haven't wasted anyone's time - I just wanted to share what I see as a possibly substantial improvement to Python's core. let me know if you're interested in a patch to use for your own testing. Cheers, Phillip ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Making the GIL faster & lighter on Windows
> You should definitely open a bug entry in http://bugs.python.org. There, post > your patch, some explanations and preferably a quick way (e.g. a simple > script) > of reproducing the speedups (without having to install a third-party library > or > extension, that is). I'll get started on that. I'm assuming I should generate a patch from the trunk (2.7)? The file doesn't look different, but I want to make sure I get it from the right place. > I wonder if the patch could be structured as a conditional compilation? You > know how many different spots are touched, and how many lines per spot. > > If it could be, then theoretically it could be released and people could do > lots of comparative stress testing with different workloads. That would be easy to do, because I am just replacing the *NonRecursiveMutex functions. > What about fairness? I don't know off-hand whether the GIL is > fair, or whether critical sections are fair, but it needs to be > considered. If you define fairness in this context as not starving other threads while consuming resources, that is built into the interpreter via sys.setcheckinterval() and also anywhere the GIL is released for I/O. What might be interesting is to see if releasing a critical section and immediately re-acquiring it every _Py_CheckInterval bytecode operations behaves in a similar manner (see ceval.c, line 869). My best guess right now is that it will behave as expected when not using the spin-based critical section. AFAIK, the kernel processes waiters in a FIFO manner without regard to priority. Because a guarantee of mutual exclusion is absolutely necessary, it's up to applications to provide fairness. Python does a decent job of this. - Phillip ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Making the GIL faster & lighter on Windows
> FWIW, Win32 CriticalSections are guaranteed to be fair, but they don't > guarantee a defined order of wakeup among threads of equal priority. Indeed, I should have quoted the MSDN docs: "The threads of a single process can use a critical section object for mutual-exclusion synchronization. There is no guarantee about the order in which threads will obtain ownership of the critical section, however, the system will be fair to all threads." http://msdn.microsoft.com/en-us/library/ms683472(VS.85).aspx I read somewhere else that the FIFO order is present, but obviously we shouldn't to expect that if it's not documented as such. > According to a past discussion on this list, the current implementation isn't: > http://mail.python.org/pipermail/python-dev/2008-March/077814.html > (at least on the poster's system) > I believe he's only talking about Linux. Apples & oranges when it comes to stuff like this, although it still justifies looking into what happens every _Py_CheckInterval on Windows. - Phillip ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Making the GIL faster & lighter on Windows
> No: fairness in mutex synchronization means that every waiter for the > mutex will eventually acquire it; it won't happen that one thread > starves waiting for the mutex. This is something that the mutex needs to > provide, not the application. Right, I guess I was thinking of it in terms of needing to release the mutex at some point in order for it to be later acquired. > Please trust Antoine that it's relevant: if the current implementation > isn't fair on Linux, there is no need for the new implementation to be > fair on Windows. Fair enough. -- While setting up my patch, I'm noticing something that could be potentially bad for this idea that I overlooked until just now. I'm going to hold off on submitting a ticket unless others suggest it's a better idea to keep this discussion going there. The thread module's lock object uses the same code used to lock and unlock the GIL. By replacing the current locking mechanism with a critical section, it'd be breaking the expected functionality of the lock object, specifically two cases: 1. Blocking recursion: Critical sections don't block on recursion, no way to enforce that 2. Releasing: Currently any thread can release a lock, but only the owner release a critical section Of course blocking recursion is only meaningful with the current behavior of #2, otherwise it's an unrecoverable deadlock. There are a few solutions to this. The first would be to implement only the GIL as a critical section. The problem then is the need to change all of the core code that does not use PyEval_Acquire/ReleaseLock (there is some, right?), which is the best place to use something other than the thread module's locking mechanism on the GIL. This is doable with some effort, but clearly not an option if there is any possibility that extensions are using something other than the PyThreadState_*, PyGILState_* and PyEval_* APIs to manipulate the GIL (are there others?). After any of this, of course, I wonder what kind of crazy things might be expected of the GIL externally that requires its behavior to remain as it is. The second solution would be to use semaphores. I can't say yet if it would be worth it performance-wise so I will refrain from conjecture for the moment. I like the first solution above... I don't know why non-recursion would be necessary for the GIL; clearly it would be a little more involved, but if I can demonstrate the performance gain maybe it's worth my time. - Phillip ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Making the GIL faster & lighter on Windows
Heads up to those who were following, I did my best to clearly outline the situation and direction in the tracker. http://bugs.python.org/issue6132 It includes a patch that will break the expected behavior of the thread lock object but make it possible to test GIL performance. > Note that for the GIL, if you use a CriticalSection object, you should > initialize its "spincount" to zero, because the GIL is almost always in > contention. That is, if you don't get the GIL right away, you won't for a > while. If I'm not mistaken, calling InitializeCriticalSection rather than InitializeCriticalSectionAndSpinCount (gotta love those long function names) sets the spin count to zero. I could tell when the spin count wasn't zero as far as performance is concerned - spinning is too much of a gamble in most contention situations. > I don't know what kernel primitive the Critical Section uses, but if it uses > an Event object or something similar, we are in the same soup, so to say, > because the CriticalSection's spinlocking feature buys us nothing. Judging from the increase in speed and CPU utilization I've seen, I don't believe this is the case. My guess is that it's something similar to a futex. - Phillip ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Making the GIL faster & lighter on Windows
The testing patch I submitted to the tracker includes a semaphore as well, and I did take some time to try it out. It seems that it's no better than the event object, either for a single thread or scaled to many threads... so this does appear to indicate that the WaitForXX functions are costly (which is expected) and scale terribly (which is unfortunate). I had always believed event objects to be "slower" but I'm not seeing a difference here compared to semaphores. My guess is that these results could be very different if I were to test on, say, Windows 2000 instead of Vista. - Phillip 2009/5/28 Kristján Valur Jónsson : > You are right, a small experiment confirmed that it is set to 0 (see > SetCriticalSectionSpinCount()) > I had assumed that a small non-zero value might be chosen on multiprocessor > machines. > > Do you think that the problem lies with the use of the "event" object as > such? Have you tried using a "semaphore" or "mutex" instead? Or do you > think that all of the synchronizations primitives that rely on the > WaitForMultipleObjects() api are subject to the same issue? > > Cheers, > > Kristján > > -Original Message- > From: python-dev-bounces+kristjan=ccpgames@python.org > [mailto:python-dev-bounces+kristjan=ccpgames@python.org] On Behalf Of > Phillip Sitbon > Sent: 27. maí 2009 22:23 > To: python-dev > Subject: Re: [Python-Dev] Making the GIL faster & lighter on Windows > > > If I'm not mistaken, calling InitializeCriticalSection rather than > InitializeCriticalSectionAndSpinCount (gotta love those long function > names) sets the spin count to zero. I could tell when the spin count > wasn't zero as far as performance is concerned - spinning is too much > of a gamble in most contention situations. > >> I don't know what kernel primitive the Critical Section uses, but if it >> uses an Event object or something similar, we are in the same soup, so to >> say, because the CriticalSection's spinlocking feature buys us nothing. > > Judging from the increase in speed and CPU utilization I've seen, I > don't believe this is the case. My guess is that it's something > similar to a futex. > > > ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FINAL PROPULSION OPEN SOURCE ENGINE VARIANT FORTHE F-35 JOINT STRIKE FIGHTER
>> I question the whole notion of using open source in military weapons. >> It seems like a rather basic violation of operational security. Perhaps >> your enemies will exploit your bugs instead of nicely reporting them >> and submitting patches on SourceForge ;-) > > Eric Raymond would argue that it's probably the other way around -- > proprietary software doesn't have enough eyeballs to make it safe. :-) > I guess in some cases it wouldn't matter if it were open source: http://online.wsj.com/article/SB124027491029837401.html I'm not sure of the seriousness (or mental stability) of Mr. "OMEGA RED", but I definitely got a chuckle from this :) - Phillip ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] pthread sem PyThread_acquire_lock
I'll do my best to try and explain/contribute, but please feel free to correct anything I get wrong. I believe the "swallowing" he's referring to is the ignoring of errno EINTR. I don't think that's the correct place to handle signals to begin with- why not just use the signal module to deal with such a scenario? http://docs.python.org/dev/library/signal.html#module-signal AFAIK, ignoring EINTR doesn't preclude the calling of signal handlers. There are many functions that don't return this value anymore, making them reentrant. I remember a number of years ago when it wasn't part of any standard to return EINTR or not, and so the only way to provide consistent behavior was to ignore it and loop. I'm not sure if that is still the case. A great example is reading from a socket. Whether or not it can be interrupted depends on the platform, so catching Ctrl+C often requires a timeout loop. Also, remember that signals are asynchronous in the sense that they are handled outside the normal execution flow of a program. Checking for EINTR probably isn't the best way to determine if a signal has been sent to the program. Cheers, Phillip On Sat, Jun 27, 2009 at 3:58 PM, Florian Mayer wrote: > > Martin v. Löwis wrote: >> >> Can you please explain what you mean by "swallowing"? > > The signals don't get through when acquiring the lock. >> >> What is the sequence of actions triggering the case you are talking >> about, what happens, and what do you expect to happen instead? > > Easiest way is to create a Queue.Queue and call .get() on it and try to > Ctrl-C it. >> >> Also, how would you fix this (in principle, not what the specific >> patch would look like)? > > Remove the loop that explicitly does this in a new function and use that > one in threadmodule.c for lock.acquire. >> >> Regards, >> Martin > > Best regards, > Florian > > PS: Excuse me, I messed up things once again. > > ___ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/phillip.sitbon%2Bpython-dev%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-ideas] Remove GIL with CAS instructions?
I'd just like to point out some previous discussion about implementing the GIL as a critical section or semaphore on Windows since it's come up here (although not quite the center of the OP's proposal AFAICS): http://bugs.python.org/issue6132 http://mail.python.org/pipermail/python-dev/2009-May/089746.html Some of this is more low-level. I did see higher performance when using non-Event objects, although I have not had time to follow up and do a deeper analysis. The GIL flashing "problem" with critical sections can very likely be rectified with a call to Sleep(0) or YieldProcessor() for those who are worried about it. On the subject of fairness, I tested various forms of the GIL on my multi-threaded ISAPI extension, where every millisecond counts when under high concurrency, and fairness wasn't an issue for single- or multi-core systems. It may be anecdotal, but it also may be that the issue is somewhat over-blown. It seems like these discussions come up in one form or another a few times a year and don't really get anywhere - probably because many people find that it's easier to just run one instance of Python on each core/processor. IPC is cheap (cPickle rocks!), and Python's memory footprint is acceptable by today's standards. Still, it is an interesting topic to many, myself included. Also, many people keep talking about inefficiencies due to threads waking up to a locked GIL. I'd like to see examples of this- most of the time, the OS should know that the thread is contending on the lock object and it is skipped over. Granted, a thread may wake up just to release the GIL shortly thereafter, but that's why sys.setcheckinterval() is there for us to tinker with. Anyway, enough of my $0.02. - Phillip 2009/10/21 John Arbash Meinel : > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Kristján Valur Jónsson wrote: > ... > >> This depends entirely on the platform and primitives used to implement the >> GIL. >> I'm interested in windows. There, I found this article: >> http://fonp.blogspot.com/2007/10/fairness-in-win32-lock-objects.html >> So, you may be on to something. Perhaps a simple C test is in order then? >> >> I did that. I found, on my dual-core vista machine, that running "release", >> that both Mutexes and CriticalSections behaved as you describe, with no >> "fairness". Using a "semaphore" seems to retain fairness, however. >> "fairness" was retained in debug builds too, strangely enough. >> >> Now, Python uses none of these. On windows, it uses an "Event" object >> coupled with an atomically updated counter. This also behaves fairly. >> >> The test application is attached. >> >> >> I think that you ought to sustantiate your claims better, maybe with a >> specific platform and using some test like the above. >> >> On the other hand, it shows that we must be careful what we use. There has >> been some talk of using CriticalSections for the GIL on windows. This test >> ought to show the danger of that. The GIL is different than a regular lock. >> It is a reverse-lock, really, and therefore may need to be implemented in >> its own special way, if we want very fast mutexes for the rest of the system >> (cc to python-dev) >> >> Cheers, >> >> Kristján > > I can compile and run this, but I'm not sure I know how to interpret the > results. If I understand it correctly, then everything but "Critical > Sections" are fair on my Windows Vista machine. > > To run, I changed the line "#define EVENT" to EVENT, MUTEX, SEMAPHORE > and CRIT. I then built and ran in "Release" environment (using VS 2008 > Express) > > For all but CRIT, I saw things like: > thread 5532 reclaims GIL > thread 5532 working 51234 units > thread 5532 worked 51234 units: 1312435761 > thread 5532 flashing GIL > thread 5876 reclaims GIL > thread 5876 working 51234 units > thread 5876 worked 51234 units: 1312435761 > thread 5876 flashing GIL > > Where there would be 4 lines for one thread, then 4 lines for the other > thread. > > for CRIT, I saw something more like 50 lines for one thread, and then 50 > lines for the other thread. > > This is Vista Home Basic, and VS 2008 Express Edition, with a 2-core > machine. > > John > =:-> > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.9 (Cygwin) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAkrfKFAACgkQJdeBCYSNAANbuQCgudU0IChylofTwvUk/JglChBd > 9gsAoIJHj63/CagKpduUsd68HV8pP3QX > =CuUj > -END PGP SIGNATURE- > ___ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/phillip.sitbon%2Bpython-dev%40gmail.com > ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%4
Re: [Python-Dev] GIL behaviour under Windows
I know I already posted some relevant threads to the other discussion, but I wanted to point out a couple of specific comments on GIL fairness from the discussion: http://mail.python.org/pipermail/python-dev/2009-May/089752.html http://mail.python.org/pipermail/python-dev/2009-May/089755.html - Phillip On Thu, Oct 22, 2009 at 10:16 AM, Antoine Pitrou wrote: > Tres Seaver palladion.com> writes: >> >> I read Sturla as saying there were 99,939 switches out of a possible >> 100,000, with sys.checkinterval set to 100. > > Oops, you're right. > But depending on the elapsed time (again :-)), it may be too high, because > too many switches per second will add a lot of overhead and decrease > performance. > > Regards > > Antoine. > > > ___ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/phillip.sitbon%2Bpython-dev%40gmail.com > ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com