The template fast path still leaves the actual copy sequence up to the
compiler. Use the streaming-copy helpers introduced in the previous
patches for the ZONE_DEVICE template-copy path so common mm code can
request a write-once copy primitive without embedding arch-specific
store layout in the generic layer.

ZONE_DEVICE memmap initialization is a write-once path: each struct page
is populated once and is not expected to be reused from cache
immediately afterwards. A regular cached copy can therefore incur
write-allocate traffic and pollute the cache without much benefit.
Using memcpy_streaming() lets this path use an architecture-optimized
streaming copy where available, while still degrading to memcpy() on
architectures that do not provide a specialized implementation.

Update the PFN-dependent section bits and page->virtual state in the
reusable template before each copy instead of patching the destination
page afterwards. This keeps the hot path as a single streaming copy for
the common case and avoids post-copy normal stores to cachelines that
were just written through the streaming path. Keep pageblock-aligned PFNs
on memcpy() so pageblock initialization can immediately read back page
metadata without introducing a read-after-streaming dependency.

When the streaming backend uses non-temporal stores, order them before
entering memmap_init_compound(), before prep_compound_head() updates the
overlapping compound metadata, and before returning from
memmap_init_zone_device().

Keep sanitized builds on the slow path so KASAN/KMSAN retain their
instrumented stores.

Tested in a VM with a 100 GB fsdax namespace device configured with
map=dev and a 100 GB devdax namespace (align=2097152) on Intel Ice Lake
server.

Test procedure:
Rebind the nd_pmem and dax_pmem driver 30 times and collect the memmap
initialization time from the pr_debug() output of
memmap_init_zone_device().

Base(v7.1-rc3):
  First binding for nd_pmem driver: 1486 ms
  Average of subsequent rebinds: 273.52 ms

  First binding for dax_pmem driver: 1515 ms
  Average of subsequent rebinds: 313.45 ms

With this series:
  First binding for nd_pmem driver: 1389 ms
  Average of subsequent rebinds: 111.08 ms

  First binding for dax_pmem driver: 1294 ms
  Average of subsequent rebinds: 110.24 ms

This reduces the average rebind time by about 59.4% for nd_pmem and
64.8% for dax_pmem.

Signed-off-by: Li Zhe <[email protected]>
---
 mm/mm_init.c | 83 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 60 insertions(+), 23 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 17a84d4cda01..08feb24795b8 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1075,13 +1075,15 @@ static void __ref zone_device_page_init_slow(struct 
page *page,
 static inline bool zone_device_page_init_optimization_enabled(void)
 {
        /*
-        * The template fast path copies a preinitialized struct page as an
-        * array of u64 words. Skip it when the page_ref_set tracepoint is
-        * enabled, and fall back to the slow path if struct page is not an
-        * integral number of u64 words.
+        * The template fast path copies a preinitialized struct page from
+        * a reusable template. Keep sanitized builds on the slow path so
+        * their instrumented stores remain intact, skip the fast path when
+        * the page_ref_set tracepoint is enabled, and fall back if
+        * struct page is not an integral number of u64 words.
         */
-       return !page_ref_tracepoint_active(page_ref_set) &&
-               IS_ALIGNED(sizeof(struct page), sizeof(u64));
+       return !IS_ENABLED(CONFIG_KASAN) && !IS_ENABLED(CONFIG_KMSAN) &&
+              !page_ref_tracepoint_active(page_ref_set) &&
+              IS_ALIGNED(sizeof(struct page), sizeof(u64));
 }
 
 static inline void zone_device_template_head_page_init(struct page *template,
@@ -1104,30 +1106,42 @@ static inline void 
zone_device_template_tail_page_init(struct page *template,
 }
 
 /*
- * The copied template already provides the PFN-invariant portion of a
- * ZONE_DEVICE struct page. Fix up the fields that still depend on @pfn
- * after the copy, namely the section bits and page->virtual when present.
+ * 'template' is a reusable page prototype rather than a strictly immutable
+ * object. Most ZONE_DEVICE fields stay constant across the pages covered by
+ * the current template, but section bits and page->virtual may still depend
+ * on the PFN. Refresh those PFN-dependent fields in the template before
+ * copying it into @page.
  */
-static inline void zone_device_page_init_finish(struct page *page,
-                                                       unsigned long pfn)
+static inline void zone_device_page_update_template(struct page *template,
+                                                   unsigned long pfn)
 {
-       set_page_section_from_pfn(page, pfn);
+       set_page_section_from_pfn(template, pfn);
 #ifdef WANT_PAGE_VIRTUAL
        if (!is_highmem_idx(ZONE_DEVICE))
-               set_page_address(page, __va(pfn << PAGE_SHIFT));
+               set_page_address(template, __va(pfn << PAGE_SHIFT));
 #endif
 }
 
 static void zone_device_page_init_from_template(struct page *page,
-               unsigned long pfn, const struct page *template)
+               unsigned long pfn, struct page *template)
 {
-       const u64 *src = (const u64 *)template;
-       u64 *dst = (u64 *)page;
-       unsigned int i;
+       /*
+        * 'template' carries the invariant portion of a ZONE_DEVICE struct
+        * page. Update the PFN-dependent fields in place before copying it
+        * to the destination page.
+        *
+        * pageblock-aligned pages immediately feed
+        * init_pageblock_migratetype(), which reads back page metadata via
+        * helpers like page_zone(page). Avoid a read-after-streaming
+        * dependency for these rare pages by using regular cached stores
+        * instead of non-temporal ones.
+        */
+       zone_device_page_update_template(template, pfn);
+       if (unlikely(pageblock_aligned(pfn)))
+               memcpy(page, template, sizeof(*page));
+       else
+               memcpy_streaming(page, template, sizeof(*page));
 
-       for (i = 0; i < sizeof(struct page) / sizeof(u64); i++)
-               dst[i] = src[i];
-       zone_device_page_init_finish(page, pfn);
        zone_device_page_init_pageblock(page, pfn);
 }
 
@@ -1168,9 +1182,10 @@ static void __ref memmap_init_compound(struct page *head,
        __SetPageHead(head);
 
        /*
-        * A tail template can be reused for all tail pages in the same 
compound page
-        * because shared state for compound tails is pre-set by 
prep_compound_tail().
-        * The per-page page->virtual and section in flags are fixed up after 
copying.
+        * All tails of the same compound page share the state established by
+        * prep_compound_tail(). Reuse one tail template for the whole range
+        * and refresh only the PFN-dependent fields in that template before
+        * each copy.
         */
        if (use_template)
                zone_device_template_tail_page_init(&template, head_pfn + 1,
@@ -1189,6 +1204,15 @@ static void __ref memmap_init_compound(struct page *head,
                        set_page_count(page, 0);
                }
        }
+
+       /*
+        * prep_compound_head() updates compound metadata in struct folio fields
+        * that alias the first tail-page descriptors. When the tail pages above
+        * were populated with non-temporal stores, order those writes before 
the
+        * overlapping metadata updates below.
+        */
+       if (use_template)
+               memcpy_streaming_drain();
        prep_compound_head(head, order);
 }
 
@@ -1237,10 +1261,23 @@ void __ref memmap_init_zone_device(struct zone *zone,
                if (pfns_per_compound == 1)
                        continue;
 
+               /*
+                * Compound-head setup immediately updates head->flags, so make
+                * the streaming template copy visible before entering
+                * memmap_init_compound().
+                */
+               if (use_template)
+                       memcpy_streaming_drain();
+
                memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
                                     compound_nr_pages(altmap, pgmap),
                                     use_template);
        }
+       /*
+        * Drain any remaining non-temporal stores before returning.
+        */
+       if (use_template)
+               memcpy_streaming_drain();
 
        pr_debug("%s initialised %lu pages in %ums\n", __func__,
                nr_pages, jiffies_to_msecs(jiffies - start));
-- 
2.20.1

Reply via email to