On Fri, Aug 15, 2025 at 08:51:21AM +1000, Balbir Singh wrote: > On 8/13/25 10:07, Mika Penttilä wrote: > > > > On 8/13/25 02:36, Balbir Singh wrote: > > > >> On 8/12/25 15:35, Mika Penttilä wrote: > >>> Hi, > >>> > >>> On 8/12/25 05:40, Balbir Singh wrote: > >>> > >>>> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during > >>>> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating > >>>> device pages as compound pages during device pfn migration. > >>>> > >>>> migrate_device code paths go through the collect, setup > >>>> and finalize phases of migration. > >>>> > >>>> The entries in src and dst arrays passed to these functions still > >>>> remain at a PAGE_SIZE granularity. When a compound page is passed, > >>>> the first entry has the PFN along with MIGRATE_PFN_COMPOUND > >>>> and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the > >>>> remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This > >>>> representation allows for the compound page to be split into smaller > >>>> page sizes. > >>>> > >>>> migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP > >>>> page aware. Two new helper functions migrate_vma_collect_huge_pmd() > >>>> and migrate_vma_insert_huge_pmd_page() have been added. > >>>> > >>>> migrate_vma_collect_huge_pmd() can collect THP pages, but if for > >>>> some reason this fails, there is fallback support to split the folio > >>>> and migrate it. > >>>> > >>>> migrate_vma_insert_huge_pmd_page() closely follows the logic of > >>>> migrate_vma_insert_page() > >>>> > >>>> Support for splitting pages as needed for migration will follow in > >>>> later patches in this series. > >>>> > >>>> Cc: Andrew Morton <[email protected]> > >>>> Cc: David Hildenbrand <[email protected]> > >>>> Cc: Zi Yan <[email protected]> > >>>> Cc: Joshua Hahn <[email protected]> > >>>> Cc: Rakie Kim <[email protected]> > >>>> Cc: Byungchul Park <[email protected]> > >>>> Cc: Gregory Price <[email protected]> > >>>> Cc: Ying Huang <[email protected]> > >>>> Cc: Alistair Popple <[email protected]> > >>>> Cc: Oscar Salvador <[email protected]> > >>>> Cc: Lorenzo Stoakes <[email protected]> > >>>> Cc: Baolin Wang <[email protected]> > >>>> Cc: "Liam R. Howlett" <[email protected]> > >>>> Cc: Nico Pache <[email protected]> > >>>> Cc: Ryan Roberts <[email protected]> > >>>> Cc: Dev Jain <[email protected]> > >>>> Cc: Barry Song <[email protected]> > >>>> Cc: Lyude Paul <[email protected]> > >>>> Cc: Danilo Krummrich <[email protected]> > >>>> Cc: David Airlie <[email protected]> > >>>> Cc: Simona Vetter <[email protected]> > >>>> Cc: Ralph Campbell <[email protected]> > >>>> Cc: Mika Penttilä <[email protected]> > >>>> Cc: Matthew Brost <[email protected]> > >>>> Cc: Francois Dugast <[email protected]> > >>>> > >>>> Signed-off-by: Balbir Singh <[email protected]> > >>>> --- > >>>> include/linux/migrate.h | 2 + > >>>> mm/migrate_device.c | 457 ++++++++++++++++++++++++++++++++++------ > >>>> 2 files changed, 396 insertions(+), 63 deletions(-) > >>>> > >>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h > >>>> index acadd41e0b5c..d9cef0819f91 100644 > >>>> --- a/include/linux/migrate.h > >>>> +++ b/include/linux/migrate.h > >>>> @@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct > >>>> folio *folio, int node) > >>>> #define MIGRATE_PFN_VALID (1UL << 0) > >>>> #define MIGRATE_PFN_MIGRATE (1UL << 1) > >>>> #define MIGRATE_PFN_WRITE (1UL << 3) > >>>> +#define MIGRATE_PFN_COMPOUND (1UL << 4) > >>>> #define MIGRATE_PFN_SHIFT 6 > >>>> > >>>> static inline struct page *migrate_pfn_to_page(unsigned long mpfn) > >>>> @@ -147,6 +148,7 @@ enum migrate_vma_direction { > >>>> MIGRATE_VMA_SELECT_SYSTEM = 1 << 0, > >>>> MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1, > >>>> MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2, > >>>> + MIGRATE_VMA_SELECT_COMPOUND = 1 << 3, > >>>> }; > >>>> > >>>> struct migrate_vma { > >>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c > >>>> index 0ed337f94fcd..6621bba62710 100644 > >>>> --- a/mm/migrate_device.c > >>>> +++ b/mm/migrate_device.c > >>>> @@ -14,6 +14,7 @@ > >>>> #include <linux/pagewalk.h> > >>>> #include <linux/rmap.h> > >>>> #include <linux/swapops.h> > >>>> +#include <asm/pgalloc.h> > >>>> #include <asm/tlbflush.h> > >>>> #include "internal.h" > >>>> > >>>> @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long > >>>> start, > >>>> if (!vma_is_anonymous(walk->vma)) > >>>> return migrate_vma_collect_skip(start, end, walk); > >>>> > >>>> + if (thp_migration_supported() && > >>>> + (migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) && > >>>> + (IS_ALIGNED(start, HPAGE_PMD_SIZE) && > >>>> + IS_ALIGNED(end, HPAGE_PMD_SIZE))) { > >>>> + migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE | > >>>> + MIGRATE_PFN_COMPOUND; > >>>> + migrate->dst[migrate->npages] = 0; > >>>> + migrate->npages++; > >>>> + migrate->cpages++; > >>>> + > >>>> + /* > >>>> + * Collect the remaining entries as holes, in case we > >>>> + * need to split later > >>>> + */ > >>>> + return migrate_vma_collect_skip(start + PAGE_SIZE, end, > >>>> walk); > >>>> + } > >>>> + > >>>> for (addr = start; addr < end; addr += PAGE_SIZE) { > >>>> migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE; > >>>> migrate->dst[migrate->npages] = 0; > >>>> @@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long > >>>> start, > >>>> return 0; > >>>> } > >>>> > >>>> -static int migrate_vma_collect_pmd(pmd_t *pmdp, > >>>> - unsigned long start, > >>>> - unsigned long end, > >>>> - struct mm_walk *walk) > >>>> +/** > >>>> + * migrate_vma_collect_huge_pmd - collect THP pages without splitting > >>>> the > >>>> + * folio for device private pages. > >>>> + * @pmdp: pointer to pmd entry > >>>> + * @start: start address of the range for migration > >>>> + * @end: end address of the range for migration > >>>> + * @walk: mm_walk callback structure > >>>> + * > >>>> + * Collect the huge pmd entry at @pmdp for migration and set the > >>>> + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that > >>>> + * migration will occur at HPAGE_PMD granularity > >>>> + */ > >>>> +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long > >>>> start, > >>>> + unsigned long end, struct > >>>> mm_walk *walk, > >>>> + struct folio *fault_folio) > >>>> { > >>>> + struct mm_struct *mm = walk->mm; > >>>> + struct folio *folio; > >>>> struct migrate_vma *migrate = walk->private; > >>>> - struct folio *fault_folio = migrate->fault_page ? > >>>> - page_folio(migrate->fault_page) : NULL; > >>>> - struct vm_area_struct *vma = walk->vma; > >>>> - struct mm_struct *mm = vma->vm_mm; > >>>> - unsigned long addr = start, unmapped = 0; > >>>> spinlock_t *ptl; > >>>> - pte_t *ptep; > >>>> + swp_entry_t entry; > >>>> + int ret; > >>>> + unsigned long write = 0; > >>>> > >>>> -again: > >>>> - if (pmd_none(*pmdp)) > >>>> + ptl = pmd_lock(mm, pmdp); > >>>> + if (pmd_none(*pmdp)) { > >>>> + spin_unlock(ptl); > >>>> return migrate_vma_collect_hole(start, end, -1, walk); > >>>> + } > >>>> > >>>> if (pmd_trans_huge(*pmdp)) { > >>>> - struct folio *folio; > >>>> - > >>>> - ptl = pmd_lock(mm, pmdp); > >>>> - if (unlikely(!pmd_trans_huge(*pmdp))) { > >>>> + if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) { > >>>> spin_unlock(ptl); > >>>> - goto again; > >>>> + return migrate_vma_collect_skip(start, end, > >>>> walk); > >>>> } > >>>> > >>>> folio = pmd_folio(*pmdp); > >>>> if (is_huge_zero_folio(folio)) { > >>>> spin_unlock(ptl); > >>>> - split_huge_pmd(vma, pmdp, addr); > >>>> - } else { > >>>> - int ret; > >>>> + return migrate_vma_collect_hole(start, end, -1, > >>>> walk); > >>>> + } > >>>> + if (pmd_write(*pmdp)) > >>>> + write = MIGRATE_PFN_WRITE; > >>>> + } else if (!pmd_present(*pmdp)) { > >>>> + entry = pmd_to_swp_entry(*pmdp); > >>>> + folio = pfn_swap_entry_folio(entry); > >>>> + > >>>> + if (!is_device_private_entry(entry) || > >>>> + !(migrate->flags & > >>>> MIGRATE_VMA_SELECT_DEVICE_PRIVATE) || > >>>> + (folio->pgmap->owner != migrate->pgmap_owner)) { > >>>> + spin_unlock(ptl); > >>>> + return migrate_vma_collect_skip(start, end, > >>>> walk); > >>>> + } > >>>> > >>>> - folio_get(folio); > >>>> + if (is_migration_entry(entry)) { > >>>> + migration_entry_wait_on_locked(entry, ptl); > >>>> spin_unlock(ptl); > >>>> - /* FIXME: we don't expect THP for fault_folio */ > >>>> - if (WARN_ON_ONCE(fault_folio == folio)) > >>>> - return migrate_vma_collect_skip(start, > >>>> end, > >>>> - walk); > >>>> - if (unlikely(!folio_trylock(folio))) > >>>> - return migrate_vma_collect_skip(start, > >>>> end, > >>>> - walk); > >>>> - ret = split_folio(folio); > >>>> - if (fault_folio != folio) > >>>> - folio_unlock(folio); > >>>> - folio_put(folio); > >>>> - if (ret) > >>>> - return migrate_vma_collect_skip(start, > >>>> end, > >>>> - walk); > >>>> + return -EAGAIN; > >>>> } > >>>> + > >>>> + if (is_writable_device_private_entry(entry)) > >>>> + write = MIGRATE_PFN_WRITE; > >>>> + } else { > >>>> + spin_unlock(ptl); > >>>> + return -EAGAIN; > >>>> + } > >>>> + > >>>> + folio_get(folio); > >>>> + if (folio != fault_folio && unlikely(!folio_trylock(folio))) { > >>>> + spin_unlock(ptl); > >>>> + folio_put(folio); > >>>> + return migrate_vma_collect_skip(start, end, walk); > >>>> + } > >>>> + > >>>> + if (thp_migration_supported() && > >>>> + (migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) && > >>>> + (IS_ALIGNED(start, HPAGE_PMD_SIZE) && > >>>> + IS_ALIGNED(end, HPAGE_PMD_SIZE))) { > >>>> + > >>>> + struct page_vma_mapped_walk pvmw = { > >>>> + .ptl = ptl, > >>>> + .address = start, > >>>> + .pmd = pmdp, > >>>> + .vma = walk->vma, > >>>> + }; > >>>> + > >>>> + unsigned long pfn = page_to_pfn(folio_page(folio, 0)); > >>>> + > >>>> + migrate->src[migrate->npages] = migrate_pfn(pfn) | write > >>>> + | MIGRATE_PFN_MIGRATE > >>>> + | MIGRATE_PFN_COMPOUND; > >>>> + migrate->dst[migrate->npages++] = 0; > >>>> + migrate->cpages++; > >>>> + ret = set_pmd_migration_entry(&pvmw, folio_page(folio, > >>>> 0)); > >>>> + if (ret) { > >>>> + migrate->npages--; > >>>> + migrate->cpages--; > >>>> + migrate->src[migrate->npages] = 0; > >>>> + migrate->dst[migrate->npages] = 0; > >>>> + goto fallback; > >>>> + } > >>>> + migrate_vma_collect_skip(start + PAGE_SIZE, end, walk); > >>>> + spin_unlock(ptl); > >>>> + return 0; > >>>> + } > >>>> + > >>>> +fallback: > >>>> + spin_unlock(ptl); > >>>> + if (!folio_test_large(folio)) > >>>> + goto done; > >>>> + ret = split_folio(folio); > >>>> + if (fault_folio != folio) > >>>> + folio_unlock(folio); > >>>> + folio_put(folio); > >>>> + if (ret) > >>>> + return migrate_vma_collect_skip(start, end, walk); > >>>> + if (pmd_none(pmdp_get_lockless(pmdp))) > >>>> + return migrate_vma_collect_hole(start, end, -1, walk); > >>>> + > >>>> +done: > >>>> + return -ENOENT; > >>>> +} > >>>> + > >>>> +static int migrate_vma_collect_pmd(pmd_t *pmdp, > >>>> + unsigned long start, > >>>> + unsigned long end, > >>>> + struct mm_walk *walk) > >>>> +{ > >>>> + struct migrate_vma *migrate = walk->private; > >>>> + struct vm_area_struct *vma = walk->vma; > >>>> + struct mm_struct *mm = vma->vm_mm; > >>>> + unsigned long addr = start, unmapped = 0; > >>>> + spinlock_t *ptl; > >>>> + struct folio *fault_folio = migrate->fault_page ? > >>>> + page_folio(migrate->fault_page) : NULL; > >>>> + pte_t *ptep; > >>>> + > >>>> +again: > >>>> + if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) { > >>>> + int ret = migrate_vma_collect_huge_pmd(pmdp, start, > >>>> end, walk, fault_folio); > >>>> + > >>>> + if (ret == -EAGAIN) > >>>> + goto again; > >>>> + if (ret == 0) > >>>> + return 0; > >>>> } > >>>> > >>>> ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); > >>>> @@ -222,8 +334,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > >>>> mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0; > >>>> } > >>>> > >>>> - /* FIXME support THP */ > >>>> - if (!page || !page->mapping || PageTransCompound(page)) > >>>> { > >>>> + if (!page || !page->mapping) { > >>>> mpfn = 0; > >>>> goto next; > >>>> } > >>>> @@ -394,14 +505,6 @@ static bool migrate_vma_check_page(struct page > >>>> *page, struct page *fault_page) > >>>> */ > >>>> int extra = 1 + (page == fault_page); > >>>> > >>>> - /* > >>>> - * FIXME support THP (transparent huge page), it is bit more > >>>> complex to > >>>> - * check them than regular pages, because they can be mapped > >>>> with a pmd > >>>> - * or with a pte (split pte mapping). > >>>> - */ > >>>> - if (folio_test_large(folio)) > >>>> - return false; > >>>> - > >>> You cannot remove this check unless support normal mTHP folios migrate to > >>> device, > >>> which I think this series doesn't do, but maybe should? > >>> > >> mTHP needs to be a follow up series, there are comments in the cover > >> letter. I had an > >> assert earlier to prevent other folio sizes, but I've removed it and the > >> interface > >> to drivers does not allow for mTHP device private folios. > >> > >> > > pte mapped device private THPs with other sizes also can be created as a > > result of pmd and folio splits. > > Your should handle them in one place consistently not by different drivers. > > As pointed by Matthew, there's the problem with the fault_page if not split > > and just ignored. > > > > I've not run into this with my testing, let me try with more mTHP sizes > enabled. I'll wait on Matthew > to post his test case or any results, issues seen >
I’ve hit this. In the code I shared privately, I split THPs in the page-collection path. You omitted that in v2 and v3; I believe you’ll need those changes. The code I'm referring to had the below comment. 416 /* 417 * XXX: No clean way to support higher-order folios that don't match PMD 418 * boundaries for now — split them instead. Once mTHP support lands, add 419 * proper support for this case. 420 * 421 * The test, which exposed this as problematic, remapped (memremap) a 422 * large folio to an unaligned address, resulting in the folio being 423 * found in the middle of the PTEs. The requested number of pages was 424 * less than the folio size. Likely to be handled gracefully by upper 425 * layers eventually, but not yet. 426 */ I triggered it by doing some odd mremap operations, which caused the CPU page-fault handler to spin indefinitely iirc. In that case, a large device folio had been moved into the middle of a PMD. Upstream could see the same problem if the device fault handler enforces a must-migrate-to-device policy and mremap moves a large CPU folio into the middle of a PMD. I’m in the middle of other work; when I circle back, I’ll try to create a selftest to reproduce this. My current test is a fairly convoluted IGT with a bunch of threads doing remap nonsense, but I’ll try to distill it into a concise selftest. Matt > Balbir Singh >
