On 2013-06-28 20:26, Paolo Bonzini wrote: > This includes a (mangled) copy of the urcu-qsbr code from liburcu. > The main changes are: 1) removing dependencies on many other header files > in liburcu; 2) removing for simplicity the tentative busy waiting in > synchronize_rcu, which has limited performance effects; 3) replacing > futexes in synchronize_rcu with QemuEvents for Win32 portability. > The API is the same as liburcu, so it should be possible in the future > to require liburcu on POSIX systems for example and use our copy only > on Windows. > > Among the various versions available I chose urcu-qsbr, which has the > fastest rcu_read_{lock,unlock} but requires the program to manually > annotate quiescent points or intervals. QEMU threads usually have easily > identified quiescent periods, so this should not be a problem. > > Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > --- > docs/rcu.txt | 303 > +++++++++++++++++++++++++++++++++++++++++++++ > hw/9pfs/virtio-9p-synth.c | 1 + > include/qemu/queue.h | 13 ++ > include/qemu/rcu-pointer.h | 110 ++++++++++++++++ > include/qemu/rcu.h | 141 +++++++++++++++++++++ > include/qemu/thread.h | 3 - > libcacard/Makefile | 3 +- > util/Makefile.objs | 1 + > util/rcu.c | 195 +++++++++++++++++++++++++++++ > 9 files changed, 766 insertions(+), 4 deletions(-) > create mode 100644 docs/rcu.txt > create mode 100644 include/qemu/rcu-pointer.h > create mode 100644 include/qemu/rcu.h > create mode 100644 util/rcu.c > > diff --git a/docs/rcu.txt b/docs/rcu.txt > new file mode 100644 > index 0000000..4869ec7 > --- /dev/null > +++ b/docs/rcu.txt > @@ -0,0 +1,303 @@ > +Using RCU (Read-Copy-Update) for synchronization > +================================================ > + > +Read-copy update (RCU) is a synchronization mechanism that is used to > +protect read-mostly data structures. RCU is very efficient and scalable > +on the read side (it is wait-free), and thus can make the read paths > +extremely fast. > + > +RCU supports concurrency between a single writer and multiple readers, > +thus it is not used alone. Typically, the write-side will use a lock to > +serialize multiple updates, but other approaches are possible (e.g., > +restricting updates to a single task). In QEMU, when a lock is used, > +this will often be the "iothread mutex", also known as the "big QEMU > +lock" (BQL). Also, restricting updates to a single task is done in > +QEMU using the "bottom half" API. > + > +RCU is fundamentally a "wait-to-finish" mechanism. The read side marks > +sections of code with "critical sections", and the update side will wait > +for the execution of all *currently running* critical sections before > +proceeding, or before asynchronously executing a callback. > + > +The key point here is that only the currently running critical sections > +are waited for; critical sections that are started _after_ the beginning > +of the wait do not extend the wait, despite running concurrently with > +the updater. This is the reason why RCU is more scalable than, > +for example, reader-writer locks. It is so much more scalable that > +the system will have a single instance of the RCU mechanism; a single > +mechanism can be used for an arbitrary number of "things", without > +having to worry about things such as contention or deadlocks. > + > +How is this possible? The basic idea is to split updates in two phases, > +"removal" and "reclamation". During removal, we ensure that subsequent > +readers will not be able to get a reference to the old data. After > +removal has completed, a critical section will not be able to access > +the old data. Therefore, critical sections that begin after removal > +do not matter; as soon as all previous critical sections have finished, > +there cannot be any readers who hold references to the data structure, > +which may not be safely reclaimed (e.g., freed or unref'ed). > + > +Here is a picutre: > + > + thread 1 thread 2 thread 3 > + ------------------- ------------------------ ------------------- > + enter RCU crit.sec. > + | finish removal phase > + | begin wait > + | | enter RCU crit.sec. > + exit RCU crit.sec | | > + complete wait | > + begin reclamation phase | > + exit RCU crit.sec. > + > + > +Note how thread 3 is still executing its critical section when thread 2 > +starts reclaiming data. This is possible, because the old version of the > +data structure was not accessible at the time thread 3 began executing > +that critical section. > + > + > +RCU API > +======= > + > +The core RCU API is small: > + > + void rcu_read_lock(void); > + > + Used by a reader to inform the reclaimer that the reader is > + entering an RCU read-side critical section. > + > + void rcu_read_unlock(void); > + > + Used by a reader to inform the reclaimer that the reader is > + exiting an RCU read-side critical section. Note that RCU > + read-side critical sections may be nested and/or overlapping. > + > + void synchronize_rcu(void); > + > + Blocks until all pre-existing RCU read-side critical sections > + on all threads have completed. This marks the end of the removal > + phase and the beginning of reclamation phase. > + > + Note that it would be valid for another update to come while > + synchronize_rcu is running. Because of this, it is better that > + the updater releases any locks it may hold before calling > + synchronize_rcu. > + > + typeof(*p) rcu_dereference(p); > + typeof(p) rcu_assign_pointer(p, typeof(p) v); > + > + These macros are similar to atomic_mb_read() and atomic_mb_set() > + respectively. However, they make some assumptions on the code > + that calls them, which allows a more optimized implementation. > + > + rcu_assign_pointer assumes that the update side is not going > + to read from the data structure after "publishing" the new > + values; that is, it assumes that all assignments happen at > + the very end of the removal phase. > + > + rcu_dereference assumes that whenever a single RCU critical > + section reads multiple shared data, these reads are either > + data-dependent or need no ordering. This is almost always the > + case when using RCU. If this were not the case, you can use > + atomic_mb_read() or smp_rmb(). > + > + If you are going to be fetching multiple fields from the > + RCU-protected structure, repeated rcu_dereference() calls > + would look ugly and incur unnecessary overhead on Alpha CPUs. > + You can then do this: > + > + p = &rcu_dereference(head); > + foo = head->foo; > + bar = head->bar; > + > + > +RCU QUIESCENT STATES > +==================== > + > +An efficient implementation of rcu_read_lock() and rcu_read_unlock() > +relies on the availability of fast thread-local storage. Unfortunately, > +this is not possible on all the systems supported by QEMU (in particular > +on many POSIX systems other than Linux and Solaris). > + > +For this reason, QEMU's RCU implementation resorts to manual annotation > +of "quiescent states", i.e. points where no RCU read-side critical > +section can be active. All threads that participate in the RCU mechanism > +need to annotate such points.
"... can be active" also implies that rcu_read_lock(); rcu_thread_offline(); is actually an illegal ordering, right? Can we add some optional debug code that performs runtime checking of this requirements? That should also catch (sooner or later) leaking RCU locks. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux