[Python-Dev] deja-vu .. python locking

2006-09-18 Thread Martin Devera
Hello,

as someone has written in FAQ, sometimes someone starts a thread about
finer grained locking in Python.
Ok here is one.

I don't want to start a flamewar. I only seek suggestions and constructive
critic. I have some ideas whose are new in this context (I believe) and
I only wanted to make them public in case someone finds them interesting.

Comments are welcome.
Martin


Round 1, Greg Stein's patch
  The patch removes GIL from version 1.6 and replaces locking of
  list, dict and other structures with finer grained locking.
  The major slowdown seems to be in list and dict structures, dicts
  are used for object attributes and these are accessed quite often.

  Because (IIUC) mutex struct is quite heavy, dict and list are
  locked via pool of locks. When you lock this pooled lock you
  have to lock two locks in reality. One locks pool itself, and other
  locks the pooled lock (the second locking can be omited in non
  contended case because locks in the pool are in locked state).
  One lock take about 25 cycles on UP P4 (mainly pipeline flush
  during memory barrier) and can be even more expensive (hundreds
  of cycles) due to cacheline move between CPUs on SMP machine.
  "Global" pool lock is subject to cacheline pinpong as it will
  be often reacquired by competing CPUs.
  In mappinglookup there is lookmapping guarded by this locking scheme,
  lookmapping itself has about 20 cycles in the best (one hope typical) case
  plus compareobj cost (in case of string keys say ... 50..100 cycles?).
  Thus locking/unlocking the read takes 50..100 cycles and operation
  itself is 70-120 cycles.
  One might expect about 50% slowdown in dict read path.

RCU like locking
  Solution I have in mind is similar to RCU. In Python we have quiscent
  state - when a thread returns to main loop of interpreter.
  Let's add "owner_thread" field to locked object. It reflects last thread
  (its id) which called any lockable method on the object.
  Each LOCK operation looks like:
  while (ob->owner_thread != self_thread()) {
 unlock_mutex(thread_mutex[self_thread()])
// wait for owning thread to go to quiscent state
 lock_mutex(thread_mutex[ob->owner_thread])
 ob->owner_thread = self_thread()
 unlock_mutex(thread_mutex[ob->owner_thread])
 lock_mutex(thread_mutex[self_thread()])
  }
  Unlock is not done - we own the object now and can use it without locking
  (until we return to interpreter loop or we call LOCK on other object).
  For non-shared objects there is only penalty of ob->owner_thread != 
self_thread()
  condition. Not sure about Windows, but in recent Linuxes one can use %gs 
register
  as thread id, thus compare is about 3 cycles (and owner_thread should be
  in object's cacheline anyway).
  In contended case there is some cache pingpong with ob and mutex but it is
  as expected.

Deadlocks
  Our object ownership is long - from getting it in LOCK to next quiscent state
  of the thread. Thus when two threads want to step each on other's object, they
  will deadlock. Simple solution is to extend set of quiscent states.
  It is when thread releases its thread_mutex in main loop (and immediately
  reacquires). Additionaly it can release it just before it is going to wait
  on another thread's mutex, like in LOCK (already in code above). If you use
  LOCK correctly then when you are LOCKing an object you can't be in vulnerable
  part of OTHER object. So that let other threads to get ownership of your own
  objects in that time.
  One can also want to release his lock when going to lock mutex in threading
  package and in other places where GIL is released today.
  However I admit that I did no formal proof regarding deadlock, I plan
  to do it if nobody can find other flaw in the proposal.

Big reader lock
  While above scheme might work well, it'd impose performance penalty
  for shared dicts which are almost read only (module.__dict__).
  For these similar locking can be used, only writer has to wait until
  ALL other threads enter quiscent state (take locks of them), then perform
  change and unlock them all. Readers can read without any locking.

Compatibilty with 3rd party modules
  I've read this argument on pydev list. Maybe I'm not understanding something,
  but is it so complex for Py_InitModule4 to use extra flag in apiver for 
example ?
  When at least one non-freethreaded module is loaded, locking is done in
  old good way...
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-18 Thread Martin Devera
Martin v. Löwis wrote:
> Martin Devera schrieb:
>> RCU like locking
>>   Solution I have in mind is similar to RCU. In Python we have quiscent
>>   state - when a thread returns to main loop of interpreter.
> 
> There might be a terminology problem here. RCU is read-copy-update,
> right? I fail to see the copy (copy data structure to be modified)
> and update (replace original pointer with pointer to copy) part.
> Do this play a role in that scheme? If so, what specific structure
> is copied for, say, a list or a dict?
> 
> This confusion makes it very difficult for me to understand your
> proposal, so I can't comment much on it. If you think it could work,
> just go ahead and create an implementation.

It is why I used a word "similar". I see the similarity in a way to archieve
safe "delete" phase of RCU. Probably I selected bad title for the text. It
is because I was reading about RCU implementation in Linux kernel and
I discovered that the idea of postponing critical code to some safe point in
future might work in Python interpreter.

So that you are right. It is not RCU. It only uses similar technique as RCU
uses for free-ing old copy of data.

It is based on assumption that an object is typicaly used by single thread. You
must lock it anyway just for case if another thread steps on it. The idea is
that each object is "owned" by a thread. Owner can use its objects without
locking. If a thread wants to use foreign object then it has to wait for owning
thread to go to some safe place (out of interpreter, into LOCK of other 
object..).
It is done by per-thread lock and it is neccessary because owner does no 
locking,
thus you can be sure that nobody it using the object when former owner is 
somewhere
out of the object.

Regarding implementation, I wanted to look for some opinions before starting to
implement something as big as this patch. Probably someone can look and say, hey
it is stupit, you forgot that FILL_IN ... ;-)

I hope I explained it better this time, I know my English not the best. At least
worse than my Python :-)

thanks for your time, Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-18 Thread Martin Devera
>> So that you are right. It is not RCU. It only uses similar technique as RCU
>> uses for free-ing old copy of data.
>>
>> It is based on assumption that an object is typicaly used by single thread. 
> 
> Which thread owns builtins?  Or module dictionaries?  If two threads are
> running the same function and share no state except their globals, won't
> they constantly be thrashing on the module dictionary?  Likewise, if the
> same method is running in two different threads, won't they thrash on the
> class dictionary?

As I've written in "Big reader lock" paragraph of the original proposal, these
objects could be handled by not blocking in read path and wait for all other
threads to "come home" before modifying.
The selection between locking mode could be selected either by something like
__locking__ or by detecting the mode.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-19 Thread Martin Devera
Greg Ewing wrote:
> Martin Devera wrote:
> 
>> As I've written in "Big reader lock" paragraph of the original 
>> proposal, these
>> objects could be handled by not blocking in read path
> 
> But as was just pointed out, because of refcounting,
> there's really no such thing as read-only access to
> an object. What *looks* like read-only access at the
> Python level involves refcount updates just from the
> act of touching the object.
> 

Yes I was thinking about atomic inc/dec (locked inc/dec in x86)
as used in G.Stein's patch.
I have to admit that I haven't measured its performance, I was
hoping for decent one. But from http://www.linuxjournal.com/article/6993
it seems that atomic inc is rather expensive too (75ns on 1.8GHz P4) :-(

Greg, what change do you have in mind regarding that "3 instruction
addition" to refcounting ?
thanks, Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-19 Thread Martin Devera
> Ah, I think I understand now. First the minor critique: I believe
> the locking algorithm isn't thread-safe:
> 
>   while (ob->owner_thread != self_thread()) {
>unlock_mutex(thread_mutex[self_thread()])
>   // wait for owning thread to go to quiscent state
>lock_mutex(thread_mutex[ob->owner_thread])
>ob->owner_thread = self_thread()
>unlock_mutex(thread_mutex[ob->owner_thread])
>lock_mutex(thread_mutex[self_thread()])
>   }
> 
> If two threads are competing for the same object held by a third
> thread, they may simultaneously enter the while loop, and then
> simultaneously try to lock the owner_thread. Now, one will win,
> and own the object. Later, the other will gain the lock, and
> unconditionally overwrite ownership. This will cause two threads
> to own the objects, which is an error.

oops .. well it seems as very stupid error on my side. Yes you are
absolutely right, I'll have to rethink it. I hope it is possible
to do it in correct way...

> The more fundamental critique is: Why? It seems you do this
> to improve efficiency, (implicitly) claiming that it is
> more efficient to keep holding the lock, instead of releasing
> and re-acquiring it each time.
> 
> I claim that this doesn't really matter: any reasonable
> mutex implementation will be "fast" if there is no lock
> contention. On locking, it will not invoke any system
> call if the lock is currently not held (but just
> atomically test-and-set some field of the lock); on
> unlocking, it will not invoke any system call if
> the wait list is empty. As you also need to test, there
> shouldn't be much of a performance difference.

I measured it. Lock op in futex based linux locking is of the same
speed as windows critical section and it is about 30 cycles on my
P4 1.8GHz in uncontented case.
As explained in already mentioned http://www.linuxjournal.com/article/6993
it seems due to pipeline flush during cmpxchg insn.
And there will be cacheline transfer penalty which is much larger. So
that mutex locking will take time comparable with protected code itself
(assuming fast code like dict/list read).
Single compare will take ten times less.
Am I missing something ?

thanks, Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-19 Thread Martin Devera
Greg Ewing wrote:
> Martin Devera wrote:
> 
>> Greg, what change do you have in mind regarding that "3 instruction
>> addition" to refcounting ?
> 
> I don't have any change in mind. If even an atomic inc
> is too expensive, it seems there's no hope for us.

Just from curiosity, would be a big problem removing refcounting and live
with garbage collection only ? I'm not sure if some parts of py code
depends on exact refcnt behaviour (I guess it should not).
Probably not for mainstream, but maybe as compile time option as part
of freethreading solution only for those who need it.
Even if you can do fast atomic inc/dec, it forces cacheline with
refcounter to ping-pong between caches of referencing cpus (for read only
class dicts for example) so that you can probably never get good SMP
scalability.
Consider main memory latency 100ns, then on 8 way 2GHz SMP system where
paralel computation within the same py class is going on all cpus.
When you manage to do a lot of class references in a loop, say 6400
instructions apart (quite realistic) then at least one CPU each time
will block on that inc/dec, so that you lost one cpu in overhead...
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com