On 18.10.25 00:15, David Hildenbrand wrote:
On 17.10.25 23:56, Balbir Singh wrote:
On 10/18/25 04:07, David Hildenbrand wrote:
On 17.10.25 17:20, Christian Borntraeger wrote:
Am 17.10.25 um 17:07 schrieb David Hildenbrand:
On 17.10.25 17:01, Christian Borntraeger wrote:
Am 17.10.25 um 16:54 schrieb David Hildenbrand:
On 17.10.25 16:49, Christian Borntraeger wrote:
This patch triggers a regression for s390x kvm as qemu guests can no longer
start
error: kvm run failed Cannot allocate memory
PSW=mask 0000000180000000 addr 000000007fd00600
R00=0000000000000000 R01=0000000000000000 R02=0000000000000000
R03=0000000000000000
R04=0000000000000000 R05=0000000000000000 R06=0000000000000000
R07=0000000000000000
R08=0000000000000000 R09=0000000000000000 R10=0000000000000000
R11=0000000000000000
R12=0000000000000000 R13=0000000000000000 R14=0000000000000000
R15=0000000000000000
C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000
C03=0000000000000000
C04=0000000000000000 C05=0000000000000000 C06=0000000000000000
C07=0000000000000000
C08=0000000000000000 C09=0000000000000000 C10=0000000000000000
C11=0000000000000000
C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000
C15=0000000000000000
KVM on s390x does not use THP so far, will investigate. Does anyone have a
quick idea?
Only when running KVM guests and apart from that everything else seems to be
fine?
We have other weirdness in linux-next but in different areas. Could that
somehow be
related to use disabling THP for the kvm address space?
Not sure ... it's a bit weird. I mean, when KVM disables THPs we essentially
just remap everything to be mapped by PTEs. So there shouldn't be any PMDs in
that whole process.
Remapping a file THP (shmem) implies zapping the THP completely.
I assume in your kernel config has CONFIG_ZONE_DEVICE and
CONFIG_ARCH_ENABLE_THP_MIGRATION set, right?
yes.
I'd rule out copy_huge_pmd(), zap_huge_pmd() a well.
What happens if you revert the change in mm/pgtable-generic.c?
That partial revert seems to fix the issue
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 0c847cdf4fd3..567e2d084071 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr,
pmd_t *pmdvalp)
if (pmdvalp)
*pmdvalp = pmdval;
- if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
+ if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
Okay, but that means that effectively we stumble over a PMD entry that is not a
migration entry but still non-present.
And I would expect that it's a page table, because otherwise the change
wouldn't make a difference.
And the weird thing is that this only triggers sometimes, because if
it would always trigger nothing would ever work.
Is there some weird scenario where s390x might set a left page table mapped in
a PMD to non-present?
Good point
Staring at the definition of pmd_present() on s390x it's really just
return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
Maybe this is happening in the gmap code only and not actually in the core-mm
code?
I am not an s390 expert, but just looking at the code
So the check on s390 effectively
segment_entry/present = false or segment_entry_empty/invalid = true
pmd_present() == true iff _SEGMENT_ENTRY_PRESENT is set
because
return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
is the same as
return pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT;
But that means we have something where _SEGMENT_ENTRY_PRESENT is not set.
I suspect that can only be the gmap tables.
Likely __gmap_link() does not set _SEGMENT_ENTRY_PRESENT, which is fine
because it's a software managed bit for "ordinary" page tables, not gmap
tables.
Which raises the question why someone would wrongly use
pte_offset_map()/__pte_offset_map() on the gmap tables.
I cannot immediately spot any such usage in kvm/gmap code, though.
Ah, it's all that pte_alloc_map_lock() stuff in gmap.c.
Oh my.
So we're mapping a user PTE table that is linked into the gmap tables
through a PMD table that does not have the right sw bits set we would
expect in a user PMD table.
What's also scary is that pte_alloc_map_lock() would try to pte_alloc()
a user page table in the gmap, which sounds completely wrong?
Yeah, when walking the gmap and wanting to lock the linked user PTE
table, we should probably never use the pte_*map variants but obtain
the lock through pte_lockptr().
All magic we end up doing with RCU etc in __pte_offset_map_lock()
does not apply to the gmap PMD table.
--
Cheers
David / dhildenb