On 2026/04/20 22:53, Marc Zyngier wrote:
On Mon, 20 Apr 2026 13:07:15 +0100,
Akihiko Odaki <[email protected]> wrote:

On 2026/04/20 18:51, Marc Zyngier wrote:
On Mon, 20 Apr 2026 09:36:16 +0100,
Akihiko Odaki <[email protected]> wrote:

On 2026/04/20 2:19, Marc Zyngier wrote:
On Sat, 18 Apr 2026 09:14:25 +0100,
Akihiko Odaki <[email protected]> wrote:

On a heterogeneous arm64 system, KVM's PMU emulation is based on the
features of a single host PMU instance. When a vCPU is migrated to a
pCPU with an incompatible PMU, counters such as PMCCNTR_EL0 stop
incrementing.

Although this behavior is permitted by the architecture, Windows does
not handle it gracefully and may crash with a division-by-zero error.

The current workaround requires VMMs to pin vCPUs to a set of pCPUs
that share a compatible PMU. This is difficult to implement correctly in
QEMU/libvirt, where pinning occurs after vCPU initialization, and it
also restricts the guest to a subset of available pCPUs.

Introduce the KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY attribute to
create a "fixed-counters-only" PMU. When set, KVM exposes a PMU that is
compatible with all pCPUs but that does not support programmable
event counters which may have different feature sets on different PMUs.

This allows Windows guests to run reliably on heterogeneous systems
without crashing, even without vCPU pinning, and enables VMMs to
schedule vCPUs across all available pCPUs, making full use of the host
hardware.

Much like KVM_ARM_VCPU_PMU_V3_IRQ and other read-write attributes, this
attribute provides a getter that facilitates kernel and userspace
debugging/testing.

OK, so that's the sales pitch. But how is it implemented? I would like
to be able to read a high-level description of the implementation
trade-offs.

Implementation-wise it is very trivial. Essentially the following
addition in kvm_arm_pmu_v3_get_attr() is the entire implementation:
+       case KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY:
+               if (test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY,
&vcpu->kvm->arch.flags))
+                       return 0;

Both its functionality and code complexity is trivial. So we can argue that:
- the functionality is too trivial to be useful or
- the interface/implementation complexity is so trivial that it does not
    incur maintenance burden

In this case the selftest uses the getter so I was more inclined to
have it, but adding one just for the selftest sounds too ad-hoc, so
here I looked into other attributes to ensure that it was not
introducing inconsistency with existing interfaces.

As the result, I found there are other read-write attributes; in fact
there are more read-write attributes than write-only ones.

You're completely missing the point. I'm referring to the whole of the
commit message, which is more of a marketing slide than a technical
description.

In terms of implementation, the obvious tradeoff is that it adds more
code to implement the feature. One thing to note is that
kvm_vcpu_load_pmu() is added and is called each time a vCPU migrates
across pCPUs. The heavy part, making the KVM_REQ_RELOAD_PMU request,
only happens when the feature is enabled.

Well, that's what I want to see. The repeated blurb about Windows
being broken is cover letter material, but not fir for a commit
message.

I understand. I'll leave the Windows issue the cover letter and write focused patch messages after splitting patches.


[...]

"for the first time" gives the impression that it will work if you try
again. I'd rather we say that "This feature is incompatible with the
existence of a PMU event filter".

The following sequence will work:
1. Set KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY
2. Set KVM_ARM_VCPU_PMU_V3_FILTER
3. Set KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY

This is to make the behavior conistent with KVM_ARM_VCPU_PMU_V3_SET_PMU.

I don't think this is correct. Filtering is completely at odds with
this patch, and I don't want to have to reason about the combination.

kvm_arm_pmu_v3_set_pmu() has the following condition:

if (kvm_vm_has_ran_once(kvm) ||
     (kvm->arch.pmu_filter && kvm->arch.arm_pmu != arm_pmu)) {
        ret = -EBUSY;
        break;
}

kvm_arm_pmu_v3_set_pmu_fixed_counters_only() has the corresponding
condition for consistency:

if (kvm_vm_has_ran_once(kvm) ||
     (kvm->arch.pmu_filter &&
      !test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY,
               &kvm->arch.flags)))
        return -EBUSY;

We can of course kill the PMU event filter for
FIXED_COUNTERS_ONLY. The filter is effectively no-op with
FIXED_COUNTERS_ONLY and I don't think that consistency matters much.

We shouldn't allow weird combinations in the UAPI. Since it makes no
sense to have both fixed-function *and* filters, we should make them
mutually exclusive.

Then I think it makes sense to make KVM_ARM_VCPU_PMU_V3_SET_NR_COUNTERS mutually exclusive for consistency, but I found something weird with it.

The documentation says it already "mandates that a PMU has explicitly been selected via KVM_ARM_VCPU_PMU_V3_SET_PMU", but apparently that's not properly implemented.

kvm_arm_pmu_v3_set_nr_counters() has the following check:
        if (!kvm->arch.arm_pmu)
                return -EINVAL;

I suspect it is intended to check if a PMU has explicitly been selected, but this check is effectively no-op because kvm_arm_set_default_pmu() has already set kvm->arch.arm_pmu before reaching there.

Furthermore, tools/testing/selftests/kvm/arm64/vpmu_counter_access.c seems to expect KVM_ARM_VCPU_PMU_V3_SET_NR_COUNTERS to work without selecting a PMU via KVM_ARM_VCPU_PMU_V3_SET_PMU.

How should we deal with the discrepancy between the documentation and the implementation/selftest?


[...]

I expect migration will be handled with the conventional register
getters and setters, but please share if you have a concern.

At the very least I want to see some documentation explaining that.

What kind of documentation do you expect?

A description of what counters are exposed by this feature, and what
architectural features they are dependent on.

I'll update the attribute documentation accordingly.


If we change kvm_vcpu_load_pmu() to avoid for_each_set_bit(), there
would be a good chance to forget updating it when mechanically
updating existing for_each_set_bit() instances, so it is a candidate
for documentation. But I don't have a good idea where to place it
either.

The moment we introduce FEAT_PMUv3_ICNTR, the whole PMUv3 emulation
will catch fire anyway, as we already limit ourselves to 32 bits for
reset and nesting switch.

So this will be a major redesign anyway. If you are really worried,
leave a comment in there.

I'll leave one line comment in the implementation corresponding to the attribute documentation.

Regards,
Akihiko Odaki

Reply via email to