Re: linux-next: KVM/s390x regression

David Hildenbrand Fri, 17 Oct 2025 15:42:45 -0700

On 18.10.25 00:15, David Hildenbrand wrote:

On 17.10.25 23:56, Balbir Singh wrote:

On 10/18/25 04:07, David Hildenbrand wrote:

On 17.10.25 17:20, Christian Borntraeger wrote:



Am 17.10.25 um 17:07 schrieb David Hildenbrand:

On 17.10.25 17:01, Christian Borntraeger wrote:

Am 17.10.25 um 16:54 schrieb David Hildenbrand:

On 17.10.25 16:49, Christian Borntraeger wrote:

This patch triggers a regression for s390x kvm as qemu guests can no longer 
start

error: kvm run failed Cannot allocate memory
PSW=mask 0000000180000000 addr 000000007fd00600
R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 
R03=0000000000000000
R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 
R07=0000000000000000
R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 
R11=0000000000000000
R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 
R15=0000000000000000
C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 
C03=0000000000000000
C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 
C07=0000000000000000
C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 
C11=0000000000000000
C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 
C15=0000000000000000

KVM on s390x does not use THP so far, will investigate. Does anyone have a 
quick idea?


Only when running KVM guests and apart from that everything else seems to be 
fine?


We have other weirdness in linux-next but in different areas. Could that 
somehow be
related to use disabling THP for the kvm address space?


Not sure ... it's a bit weird. I mean, when KVM disables THPs we essentially 
just remap everything to be mapped by PTEs. So there shouldn't be any PMDs in 
that whole process.

Remapping a file THP (shmem) implies zapping the THP completely.


I assume in your kernel config has CONFIG_ZONE_DEVICE and 
CONFIG_ARCH_ENABLE_THP_MIGRATION set, right?


yes.


I'd rule out copy_huge_pmd(), zap_huge_pmd() a well.


What happens if you revert the change in mm/pgtable-generic.c?


That partial revert seems to fix the issue
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 0c847cdf4fd3..567e2d084071 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, 
pmd_t *pmdvalp)
               if (pmdvalp)
                    *pmdvalp = pmdval;
-       if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
+       if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))


Okay, but that means that effectively we stumble over a PMD entry that is not a 
migration entry but still non-present.

And I would expect that it's a page table, because otherwise the change
wouldn't make a difference.

And the weird thing is that this only triggers sometimes, because if
it would always trigger nothing would ever work.

Is there some weird scenario where s390x might set a left page table mapped in 
a PMD to non-present?


Good point

Staring at the definition of pmd_present() on s390x it's really just

      return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;


Maybe this is happening in the gmap code only and not actually in the core-mm 
code?



I am not an s390 expert, but just looking at the code

So the check on s390 effectively

segment_entry/present = false or segment_entry_empty/invalid = true


pmd_present() == true iff _SEGMENT_ENTRY_PRESENT is set

because

        return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;

is the same as

        return pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT;

But that means we have something where _SEGMENT_ENTRY_PRESENT is not set.

I suspect that can only be the gmap tables.

Likely __gmap_link() does not set _SEGMENT_ENTRY_PRESENT, which is fine
because it's a software managed bit for "ordinary" page tables, not gmap
tables.

Which raises the question why someone would wrongly use
pte_offset_map()/__pte_offset_map() on the gmap tables.

I cannot immediately spot any such usage in kvm/gmap code, though.


Ah, it's all that pte_alloc_map_lock() stuff in gmap.c.

Oh my.

So we're mapping a user PTE table that is linked into the gmap tablesthrough a PMD table that does not have the right sw bits set we wouldexpect in a user PMD table.

What's also scary is that pte_alloc_map_lock() would try to pte_alloc()a user page table in the gmap, which sounds completely wrong?

Yeah, when walking the gmap and wanting to lock the linked user PTEtable, we should probably never use the pte_*map variants but obtain

the lock through pte_lockptr().

All magic we end up doing with RCU etc in __pte_offset_map_lock()
does not apply to the gmap PMD table.

--
Cheers

David / dhildenb

Re: linux-next: KVM/s390x regression

Reply via email to