> On 30. Oct 2025, at 14:41, [email protected] wrote:
>
> Adding @[email protected] and replying to his questions he asked over
> #XenDevel:matrix.org.
>
> can you add some details why the implementation cannot be optimized in KVM?
> Asking because I have never seen such issue when running Xen on QEMU (without
> nested virt enabled).
> AFAIK when Xen is run on QEMU without virtualization, then instructions are
> emulated in QEMU while with KVM, ideally the instruction should run directly
> on hardware except in some special cases (those trapped by FGT/CGT). Such as
> this one where KVM maintains shadow page tables for each VM. It traps these
> instructions and emulates them with callback such as handle_vmalls12e1is().
> The way this callback is implemented, it has to iterate over the whole
> address space and clean-up the page tables which is a costly operation.
> Regardless of this, it should still be optimized in Xen as invalidating a
> selective range would be much better than invalidating a whole range of
> 48-bit address space.
> Some details about your platform and use case would be helpful. I am
> interested to know whether you are using all the features for nested virt.
> I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, most
> of the features are enabled except VHE or those which are disabled by KVM.
Hello,
You mean Graviton4 (for reference to others, from a bare metal instance)?
Interesting to see people caring about nested virt there :) - and hopefully
using it wasn’t too much of a pain for you to deal with.
>
> ; switch to current VMID
> tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for
> current VMID
> tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for
> current VMID
> dsb ish
> isb
> ; switch back the VMID
> • This is where I am not quite sure and I was hoping that if someone with
> Arm expertise could sign off on this so that I can work on its implementation
> in Xen. This will be an optimization not only for virtualized hardware but
> also in general for Xen on arm64 machines.
>
Note that the documentation says
> The invalidation is not required to apply to caching structures that combine
> stage 1 and stage 2 translation table entries.
for TLBIP RIPAS2E1
> • The second place in Xen where this is problematic is when multiple
> vCPUs of the same domain juggle on single pCPU, TLBs are invalidated
> everytime a different vCPU runs on a pCPU. I do not know how this can be
> optimized. Any support on this is appreciated.
One way to handle this is every invalidate within the VM a broadcast TLB
invalidate (HCR_EL2.FB is what you’re looking for) and then forego that TLB
maintenance as it’s no longer necessary. This should not have a practical
performance impact.
Thank you,
-Mohamed
>
>
> diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
> index 7642dbc7c5..e96ff92314 100644
> --- a/xen/arch/arm/mmu/p2m.c
> +++ b/xen/arch/arm/mmu/p2m.c
> @@ -247,7 +247,7 @@ void p2m_restore_state(struct vcpu *n)
> * when running multiple vCPU of the same domain on a single pCPU.
> */
> if ( *last_vcpu_ran != INVALID_VCPU_ID && *last_vcpu_ran != n->vcpu_id )
> - flush_guest_tlb_local();
> + ; // flush_guest_tlb_local();
> *last_vcpu_ran = n->vcpu_id;
> }
>
> Thanks & Regards,
> Haseeb Ashraf