> On 31. Oct 2025, at 10:18, Julien Grall <[email protected]> wrote:
>
>
>
> On 31/10/2025 00:20, Mohamed Mediouni wrote:
>>> On 31. Oct 2025, at 00:55, Julien Grall <[email protected]> wrote:
>>>
>>> Hi Mohamed,
>>>
>>> On 30/10/2025 18:33, Mohamed Mediouni wrote:
>>>>> On 30. Oct 2025, at 14:41, [email protected] wrote:
>>>>>
>>>>> Adding @[email protected] and replying to his questions he asked over
>>>>> #XenDevel:matrix.org.
>>>>>
>>>>> can you add some details why the implementation cannot be optimized in
>>>>> KVM? Asking because I have never seen such issue when running Xen on QEMU
>>>>> (without nested virt enabled).
>>>>> AFAIK when Xen is run on QEMU without virtualization, then instructions
>>>>> are emulated in QEMU while with KVM, ideally the instruction should run
>>>>> directly on hardware except in some special cases (those trapped by
>>>>> FGT/CGT). Such as this one where KVM maintains shadow page tables for
>>>>> each VM. It traps these instructions and emulates them with callback such
>>>>> as handle_vmalls12e1is(). The way this callback is implemented, it has to
>>>>> iterate over the whole address space and clean-up the page tables which
>>>>> is a costly operation. Regardless of this, it should still be optimized
>>>>> in Xen as invalidating a selective range would be much better than
>>>>> invalidating a whole range of 48-bit address space.
>>>>> Some details about your platform and use case would be helpful. I am
>>>>> interested to know whether you are using all the features for nested virt.
>>>>> I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes,
>>>>> most of the features are enabled except VHE or those which are disabled
>>>>> by KVM.
>>>> Hello,
>>>> You mean Graviton4 (for reference to others, from a bare metal instance)?
>>>> Interesting to see people caring about nested virt there :) - and
>>>> hopefully using it wasn’t too much of a pain for you to deal with.
>>>>>
>>>>> ; switch to current VMID
>>>>> tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for
>>>>> current VMID
>>>>> tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for
>>>>> current VMID
>>>>> dsb ish
>>>>> isb
>>>>> ; switch back the VMID
>>>>> • This is where I am not quite sure and I was hoping that if someone
>>>>> with Arm expertise could sign off on this so that I can work on its
>>>>> implementation in Xen. This will be an optimization not only for
>>>>> virtualized hardware but also in general for Xen on arm64 machines.
>>>>>
>>>> Note that the documentation says
>>>>> The invalidation is not required to apply to caching structures that
>>>>> combine stage 1 and stage 2 translation table entries.
>>>> for TLBIP RIPAS2E1
>>>>> • The second place in Xen where this is problematic is when multiple
>>>>> vCPUs of the same domain juggle on single pCPU, TLBs are invalidated
>>>>> everytime a different vCPU runs on a pCPU. I do not know how this can be
>>>>> optimized. Any support on this is appreciated.
>>>> One way to handle this is every invalidate within the VM a broadcast TLB
>>>> invalidate (HCR_EL2.FB is what you’re looking for) and then forego that
>>>> TLB maintenance as it’s no longer necessary. This should not have a
>>>> practical performance impact.
>>>
>>> To confirm my understanding, you are suggesting to rely on the L2 guest to
>>> send the TLB flush. Did I understanding correctly? If so, wouldn't this
>>> open a security hole because a misbehaving guest may never send the TLB
>>> flush?
>>>
>> Hello,
>> HCR_EL2.FB can be used to make every TLB invalidate the guest issues (which
>> is a stage1 one) a broadcast TLB invalidate.
>
> Xen already sets HCR_EL2.FB. But I believe this is only solving the problem
> where the vCPU is moved to another pCPU. This doesn't solve the problem where
> two vCPUs from the same VM is sharing the same pCPU.
>
> Per the Arm Arm each CPU have their own private TLBs. So we have to flush
> between vCPU of the same domains to avoid translations from vCPU 1 to "leak"
> to the vCPU 2 (they may have confliected page-tables).
Hm… it varies on whether the VM uses CnP or not (and whether the HW supports
it)… (Linux does…)
> KVM has a similar logic see "last_vcpu_ran" and "__kvm_flush_cpu_context()".
> That said... they are using "vmalle1" whereas we are using "vmalls12e1". So
> maybe we can relax it. Not sure if this would make any difference for the
> performance though.
vmalle1 avoids the problem here (because it only invalidates stage-1
translations).
> Cheers,
>
> --
> Julien Grall
>
>