On Thu, 21 May 2026 12:01:17 +0800 "Li Zhe" <[email protected]> wrote:
> memmap_init_zone_device() can spend a substantial amount of time
> initializing large ZONE_DEVICE ranges because it repeats nearly
> identical struct page setup for every PFN.
>
> This series reduces that overhead in seven steps.
Cool, thanks, we all love speedups.
> The first patch factors the reusable pieces out of
> __init_zone_device_page() so later patches can share the same logic
> without changing the existing slow path.
>
> The second patch adds set_page_section_from_pfn(), so generic callers
> can update section bits from a PFN without open-coding
> SECTION_IN_PAGE_FLAGS checks.
>
> The third patch adds a template-based fast path for ZONE_DEVICE head
> pages. Instead of rebuilding the same struct page state for every PFN,
> it prepares a reusable page template once and copies it to each
> destination page.
>
> The fourth patch extends the same template-based approach to compound
> tails, so pfns_per_compound > 1 can also benefit from the fast path.
>
> The fifth patch introduces memcpy_streaming() and
> memcpy_streaming_drain() as a generic interface for write-once
> streaming copies, with a memcpy() fallback for architectures that do
> not provide a specialized backend.
>
> The sixth patch extends x86 memcpy_flushcache() small fixed-size
> fastpaths so struct-page-sized streaming copies can stay on the inline
> path.
>
> The last patch switches the zone-device template-copy path over to
> memcpy_streaming(). It refreshes PFN-dependent fields in the reusable
> template before each copy, keeps pageblock-aligned PFNs on regular
> memcpy(), and drains streaming stores before later normal stores update
> overlapping or dependent metadata.
>
> The optimized path is disabled when the page_ref_set tracepoint is
> enabled, sanitized builds remain on the slow path so their
> instrumented stores are preserved, and the fast path falls back to the
> existing slow path if sizeof(struct page) is not an integral number of
> u64 words.
>
> Testing
> =======
>
> Tests were run in a VM on an Intel Ice Lake server.
>
> Two PMEM configurations were used:
> - a 100 GB fsdax namespace configured with map=dev, which exercises
> the nd_pmem rebind path (pfns_per_compound == 1)
> - a 100 GB devdax namespace configured with align=2097152, which
> exercises the dax_pmem rebind path (pfns_per_compound > 1)
>
> For each configuration, the corresponding driver was unbound and
> rebound 30 times. Memmap initialization latency was collected from the
> pr_debug() output of memmap_init_zone_device().
>
> The first bind is reported separately, and the average of subsequent
> rebinds is used as the steady-state result.
How closely does this workload resemble any real-world user workload?
> Performance
> ===========
>
> nd_pmem rebind, 100 GB fsdax namespace, map=dev
> Base(v7.1-rc3):
> First binding: 1486 ms
> Average of subsequent rebinds: 273.52 ms
> With patches 1-3 applied:
> First binding: 1422 ms
> Average of subsequent rebinds: 245.73 ms
> Full series:
> First binding: 1389 ms
> Average of subsequent rebinds: 111.08 ms
>
> dax_pmem rebind, 100 GB devdax namespace, align=2097152
> Base(v7.1-rc3):
> First binding: 1515 ms
> Average of subsequent rebinds: 313.45 ms
> With patches 1-4 applied:
> First binding: 1422 ms
> Average of subsequent rebinds: 256.56 ms
> Full series:
> First binding: 1294 ms
> Average of subsequent rebinds: 110.24 ms
The improvements appear to range between "modest" and "large", but what
I'd like to understand is how frequently real-world users are using
these operations in real-world workloads.
IOW, (and this is always the bottom line), how valuable is this
patchset to our users?
> mm: factor zone-device page init helpers out of
> __init_zone_device_page
> mm: add a set_page_section_from_pfn() helper
> mm: add a template-based fast path for zone-device page init
> mm: extend the template fast path to zone-device compound tails
> string: introduce memcpy_streaming() helpers
> x86/string: extend memcpy_flushcache() fixed-size fastpaths
> mm: use memcpy_streaming() in zone-device template copies
>
> arch/x86/include/asm/string_64.h | 100 +++++++++++++---
> include/linux/mm.h | 19 ++-
> include/linux/string.h | 18 +++
> mm/mm_init.c | 198 +++++++++++++++++++++++++++----
> 4 files changed, 294 insertions(+), 41 deletions(-)
I won't take any action at this stage - let's await reviewer input. If
none is forthcoming then please remind me and I'll figure out what to
do.
The ever-present reviewer called "Sashiko" has thoughts to offer:
https://sashiko.dev/#/patchset/[email protected]
Please take a look, decide if there's useful material in there.