Adding @[email protected]<mailto:[email protected]> and replying to his questions he 
asked over #XenDevel:matrix.org.

can you add some details why the implementation cannot be optimized in KVM? 
Asking because I have never seen such issue when running Xen on QEMU (without 
nested virt enabled).
AFAIK when Xen is run on QEMU without virtualization, then instructions are 
emulated in QEMU while with KVM, ideally the instruction should run directly on 
hardware except in some special cases (those trapped by FGT/CGT). Such as this 
one where KVM maintains shadow page tables for each VM. It traps these 
instructions and emulates them with callback such as handle_vmalls12e1is(). The 
way this callback is implemented, it has to iterate over the whole address 
space and clean-up the page tables which is a costly operation. Regardless of 
this, it should still be optimized in Xen as invalidating a selective range 
would be much better than invalidating a whole range of 48-bit address space.
Some details about your platform and use case would be helpful. I am interested 
to know whether you are using all the features for nested virt.
I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, most of 
the features are enabled except VHE or those which are disabled by KVM.

Regards,
Haseeb Ashraf
________________________________
From: Ashraf, Haseeb (DI SW EDA HAV SLS EPS RTOS LIN)
Sent: Thursday, October 30, 2025 11:12 AM
To: [email protected] <[email protected]>
Subject: Limitations for Running Xen on KVM Arm64

Hello Xen development community,

I wanted to discuss the limitations that I have faced while running Xen on KVM 
on Arm64 machines. I hope I am using the right mailing list.

The biggest limitation is the costly emulation of instruction tlbi vmalls12e1is 
in KVM. The cost is exponentially proportional to the IPA size exposed by KVM 
for VM hosting Xen. If I reduce the IPA size to 40-bits in KVM, then this issue 
is not much observable but with the IPA size of 48-bits, it is 256x more costly 
than the former one. Xen uses this instruction too frequently and this 
instruction is trapped and emulated by KVM, and performance is not as good as 
on bare-metal hardware. With 48-bit IPA, it can take up to 200 minutes for domu 
creation with just 128M RAM. I have identified two places in Xen which are 
problematic w.r.t the usage of this instruction and hoping to reduce the 
frequency of this instruction or use a more relevant TLBI instruction instead 
of invalidating whole stage-1 and stage-2 translations.


  1.
During the creation of domu, first the domu memory is mapped onto dom0 domain, 
images are copied into it, and it is then unmapped. During unmapping, the TLB 
translations are invalidated one by one for each page being unmapped in 
XENMEM_remove_from_physmap hypercall. Here is the code snippet where the 
decision to flush TLBs is being made during removal of mapping.

diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
index 7642dbc7c5..e96ff92314 100644
--- a/xen/arch/arm/mmu/p2m.c
+++ b/xen/arch/arm/mmu/p2m.c
@@ -1103,7 +1103,8 @@ static int __p2m_set_entry(struct p2m_domain *p2m,

    if ( removing_mapping )
        /* Flush can be deferred if the entry is removed */
-        p2m->need_flush |= !!lpae_is_valid(orig_pte);
+        //p2m->need_flush |= !!lpae_is_valid(orig_pte);
+        p2m->need_flush |= false;
    else
    {
        lpae_t pte = mfn_to_p2m_entry(smfn, t, a);

  1.
This can be optimized by either introducing a batch version of this hypercall 
i.e., XENMEM_remove_from_physmap_batch and flushing TLBs only once for all 
pages being removed
OR
by using a TLBI instruction that only invalidates the intended range of 
addresses instead of the whole stage-1 and stage-2 translations. I understand 
that a single TLBI instruction does not exist that can perform both stage-1 and 
stage-2 invalidations for a given address range but maybe a combination of 
instructions can be used such as:

; switch to current VMID
tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for current 
VMID
tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for 
current VMID
dsb ish
isb
; switch back the VMID

  1.
This is where I am not quite sure and I was hoping that if someone with Arm 
expertise could sign off on this so that I can work on its implementation in 
Xen. This will be an optimization not only for virtualized hardware but also in 
general for Xen on arm64 machines.


  1.
The second place in Xen where this is problematic is when multiple vCPUs of the 
same domain juggle on single pCPU, TLBs are invalidated everytime a different 
vCPU runs on a pCPU. I do not know how this can be optimized. Any support on 
this is appreciated.

diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
index 7642dbc7c5..e96ff92314 100644
--- a/xen/arch/arm/mmu/p2m.c
+++ b/xen/arch/arm/mmu/p2m.c
@@ -247,7 +247,7 @@ void p2m_restore_state(struct vcpu *n)
      * when running multiple vCPU of the same domain on a single pCPU.
      */
     if ( *last_vcpu_ran != INVALID_VCPU_ID && *last_vcpu_ran != n->vcpu_id )
-        flush_guest_tlb_local();
+        ; // flush_guest_tlb_local();

     *last_vcpu_ran = n->vcpu_id;
 }

Thanks & Regards,
Haseeb Ashraf

Reply via email to