Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.

Frederic Konrad Wed, 07 Oct 2015 07:54:54 -0700

Hi Claudio,

I'll rebase soon tomorrow with a bit of luck ;).


Thanks,
Fred

On 07/10/2015 14:46, Claudio Fontana wrote:
> Hello Frederic,
>
> On 11.08.2015 08:27, Frederic Konrad wrote:
>> On 11/08/2015 08:15, Benjamin Herrenschmidt wrote:
>>> On Mon, 2015-08-10 at 17:26 +0200, [email protected] wrote:
>>>> From: KONRAD Frederic <[email protected]>
>>>>
>>>> This is the 7th round of the MTTCG patch series.
>>>>
>>>>
>>>> It can be cloned from:
>>>> [email protected]:fkonrad/mttcg.git branch multi_tcg_v7.
> would it be possible to rebase on latest qemu? I wonder if mttcg is diverging 
> a bit too much from mainline,
> which will make it more difficult to rebase later..(Or did I get confused 
> about all these repos?)
>
> Thank you!
>
> Claudio
>
>>>> This patch-set try to address the different issues in the global picture of
>>>> MTTCG, presented on the wiki.
>>>>
>>>> == Needed patch for our work ==
>>>>
>>>> Some preliminaries are needed for our work:
>>>>   * current_cpu doesn't make sense in mttcg so a tcg_executing flag is 
>>>> added to
>>>>     the CPUState.
>>> Can't you just make it a TLS ?
>> True that can be done as well. But the tcg_exec_flags has a second meaning 
>> saying
>> "you can't start executing code right now because I want to do a safe_work".
>>>>   * We need to run some work safely when all VCPUs are outside their 
>>>> execution
>>>>     loop. This is done with the async_run_safe_work_on_cpu function 
>>>> introduced
>>>>     in this series.
>>>>   * QemuSpin lock is introduced (on posix only yet) to allow a faster 
>>>> handling of
>>>>     atomic instruction.
>>> How do you handle the memory model ? IE , ARM and PPC are OO while x86
>>> is (mostly) in order, so emulating ARM/PPC on x86 is fine but emulating
>>> x86 on ARM or PPC will lead to problems unless you generate memory
>>> barriers with every load/store ..
>> For the moment we are trying to do the first case.
>>> At least on POWER7 and later on PPC we have the possibility of setting
>>> the attribute "Strong Access Ordering" with mremap/mprotect (I dont'
>>> remember which one) which gives us x86-like memory semantics...
>>>
>>> I don't know if ARM supports something similar. On the other hand, when
>>> emulating ARM on PPC or vice-versa, we can probably get away with no
>>> barriers.
>>>
>>> Do you expose some kind of guest memory model info to the TCG backend so
>>> it can decide how to handle these things ?
>>>
>>>> == Code generation and cache ==
>>>>
>>>> As Qemu stands, there is no protection at all against two threads 
>>>> attempting to
>>>> generate code at the same time or modifying a TranslationBlock.
>>>> The "protect TBContext with tb_lock" patch address the issue of code 
>>>> generation
>>>> and makes all the tb_* function thread safe (except tb_flush).
>>>> This raised the question of one or multiple caches. We choosed to use one
>>>> unified cache because it's easier as a first step and since the structure 
>>>> of
>>>> QEMU effectively has a ‘local’ cache per CPU in the form of the jump 
>>>> cache, we
>>>> don't see the benefit of having two pools of tbs.
>>>>
>>>> == Dirty tracking ==
>>>>
>>>> Protecting the IOs:
>>>> To allows all VCPUs threads to run at the same time we need to drop the
>>>> global_mutex as soon as possible. The io access need to take the mutex. 
>>>> This is
>>>> likely to change when 
>>>> http://thread.gmane.org/gmane.comp.emulators.qemu/345258
>>>> will be upstreamed.
>>>>
>>>> Invalidation of TranslationBlocks:
>>>> We can have all VCPUs running during an invalidation. Each VCPU is able to 
>>>> clean
>>>> it's jump cache itself as it is in CPUState so that can be handled by a 
>>>> simple
>>>> call to async_run_on_cpu. However tb_invalidate also writes to the
>>>> TranslationBlock which is shared as we have only one pool.
>>>> Hence this part of invalidate requires all VCPUs to exit before it can be 
>>>> done.
>>>> Hence the async_run_safe_work_on_cpu is introduced to handle this case.
>>> What about the host MMU emulation ? Is that multithreaded ? It has
>>> potential issues when doing things like dirty bit updates into guest
>>> memory, those need to be done atomically. Also TLB invalidations on ARM
>>> and PPC are global, so they will need to invalidate the remote SW TLBs
>>> as well.
>>>
>>> Do you have a mechanism to synchronize with another thread ? IE, make it
>>> pop out of TCG if already in and prevent it from getting in ? That way
>>> you can "remotely" invalidate its TLB...
>> Yes that's what the safe_work is doing. Ask everybody to exit prevent VCPUs 
>> to
>> resume (tcg_exec_flag) and do the work when everybody is outside cpu-exec.
>>
>>>> == Atomic instruction ==
>>>>
>>>> For now only ARM on x64 is supported by using an cmpxchg instruction.
>>>> Specifically the limitation of this approach is that it is harder to 
>>>> support
>>>> 64bit ARM on a host architecture that is multi-core, but only supports 32 
>>>> bit
>>>> cmpxchg (we believe this could be the case for some PPC cores).
>>> Right, on the other hand 64-bit will do fine. But then x86 has 2-value
>>> atomics nowadays, doesn't it ? And that will be hard to emulate on
>>> anything. You might need to have some kind of global hashed lock list
>>> used by atomics (hash the physical address) as a fallback if you don't
>>> have a 1:1 match between host and guest capabilities.
>> VOS did a "Slow path for atomic instruction translation" series you can find 
>> here:
>> https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg00971.html
>>
>> Which will be used in the end.
>>
>> Thanks,
>> Fred
>>> Cheers,
>>> Ben.

Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.

Reply via email to