Re: Limitations for Running Xen on KVM Arm64

Julien Grall Sat, 01 Nov 2025 11:24:35 -0700

(+ the other Arm maintainers)

On 31/10/2025 13:01, [email protected] wrote:

Hello,

Hi,

Before answering to the rest, would you be able to configure your e-mailclient to quote with '>' and avoid top-posting? Otherwise, it willbecome quite difficult to follow the conversation after a few round.

I have seen no such performance issue with nested KVM. For Xen, if thiscan be relaxed from |vmalls12e1| to |vmalle1|, this would still be ahuge performance improvement. I used Ftrace to get execution time ofeach of these handler functions:
handle_vmalls12e1is() min-max = 1464441 - 9495486 us

To clarify, Xen is using the local TLB version. So it should bevmalls12e1. But it looks like KVM will treat it the same way and Iwonder whether this could be optimized? (I don't know much about the KVMimplementation though).

So, to summarize using HCR_EL2.FB (which Xen already enables?) and thenusing vmalle1 instead of vmalls12e1 should resolve the issue-2 for vCPUsswitching on pCPUs.


I don't think HCR_EL2.FB would matter here.

Coming back to issue-1, what do you think about creating a batch versionof hypercall XENMEM_remove_from_physmap (other batch versions exist suchas for XENMEM_add_to_physmap) and doing the TLB invalidation only onceper this hypercall?

Before going into batching, do you have any data showing how oftenXENMEM_remove_from_physmap is called in your setup? Similar, I would beinterested to know the number of TLBs flush within one hypercalls andwhether the regions unmapped were contiguous.


In your previous e-mail you wrote:

> During the creation of domu, first the domu memory is mapped ontodom0 domain, images are copied into it, and it is then unmapped. Duringunmapping, the TLB translations are invalidated one by one for each pagebeing unmapped in XENMEM_remove_from_physmap hypercall. Here is the codesnippet where the decision to flush TLBs is being made during removal ofmapping.

Don't we map only the memory that is needed to copy the binaries? Ifnot, then I would suggest to look at that first.

I am asking because even with batching, we may still send a few TLBsbecause:* We need to avoid long-running operations, so the hypercall mayrestart. So we will have to flush at mininum before every restart* The current way we handle batching is we will process one item atthe time. As this may free memory (either leaf or intermediatepage-tables), we will need to flush the TLBs first to prevent the domainaccessing the wrong memory. This could be solved by keeping track of thelist of memory to free. But this is going to require some work and I amnot entirely sure this is worth it at the moment.

I just realized that ripas2e1 is a range TLBIinstruction which is only supported after Armv8.4 indicatedby ID_AA64ISAR0_EL1.TLB == 2. So, on older architectures, full stage-2invalidation would be required. For an architecture independentsolution, creating a batch version seems to be a better way.

I don't think we necessarily need a full stage-2 invalidation forprocessor not supporting range TLBI. We could use a series of TLBIIPAS2E1IS which I think is what TBLI range is meant to replace (so longthe addresses are contiguous in the given space).

On the KVM side, it would be worth looking at whether the implementationcan be optimized. Is this really walking block by block? Can it skipover large hole (e.g. if we know a level 1 entry doesn't exist, then wecan increment by 1GB).


Cheers,

--
Julien Grall

Re: Limitations for Running Xen on KVM Arm64

Reply via email to