On Wed, Sep 21, 2016 at 20:19:18 +0200, Paolo Bonzini wrote:
(snip)
> No, this is not true. Barriers order stores and loads within a thread
> _and_ establish synchronizes-with edges.
>
> In the example above you are violating causality:
>
> - cpu0 stores cpu->running before loading pending_cpus
>
> - because pending_cpus == 0, cpu1 stores pending_cpus = 1 after cpu0
> loads it
>
> - cpu1 loads cpu->running after it stores pending_cpus
OK. So I simplified the example to understand this better:
cpu0 cpu1
---- ----
{ A = B = 0, r0 and r1 are private variables }
x = 1 y = 1
smp_mb() smp_mb()
r0 = y r1 = x
Turns out this is scenario 10 here: https://lwn.net/Articles/573436/
The source of my confusion was not paying due attention to smp_mb,
which is necessary for maintaining transitivity.
> > Is there a performance (scalability) reason behind this patch?
>
> Yes: it speeds up all cpu_exec_start/end, _not_ start/end_exclusive.
>
> With this patch, as long as there are no start/end_exclusive (which are
> supposed to be rare) there is no contention on multiple CPUs doing
> cpu_exec_start/end.
>
> Without it, as CPUs increase, the global cpu_list_mutex is going to
> become a bottleneck.
I see. Scalability-wise I wouldn't expect much improvement with MTTCG
full-system, given that the iothread lock is still acquired on every
CPU loop exit (just like in KVM). However, for user-mode this should
yield measurable improvements =D
Thanks,
E.