On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote: > On 2025-11-13 at 06:29 +1100, Gregory Price <[email protected]> wrote... > > - Why? (In short: shunting to DAX is a failed pattern for users) > > - Other designs I considered (mempolicy, cpusets, zone_device) > > I'm interested in the contrast with zone_device, and in particular why > device_coherent memory doesn't end up being a good fit for this. >
I did consider zone_device briefly, but if you want sparse allocation you end up essentially re-implementing some form of buddy allocator. That seemed less then ideal, to say the least. Additionally, pgmap use precludes these pages from using LRU/Reclaim, and some devices may very well be compatible with such patterns. (I think compression will be, but it still needs work) > > - Why mempolicy.c and cpusets as-is are insufficient > > - SPM types seeking this form of interface (Accelerator, Compression) > > I'm sure you can guess my interest is in GPUs which also have memory some > people > consider should only be used for specific purposes :-) Currently our coherent > GPUs online this as a normal NUMA noode, for which we have also generally > found mempolicy, cpusets, etc. inadequate as well, so it will be interesting > to > hear what short comings you have been running into (I'm less familiar with the > Compression cases you talk about here though). > The TL;DR: cpusets as-designed doesn't really allow the concept of "Nothing can access XYZ node except specific things" because this would involve removing a node from the root cpusets.mems - and that can't be loosened. mempolicy is more of a suggestion and can be completely overridden. It is entirely ignored by things like demotion/reclaim/etc. I plan to discuss a bit of the specifics at LPC, but a lot of this stems from the zone-iteration logic in page_alloc.c and the rather... ermm... "complex" nature of how mempolicy and cpusets interacts with each other. I may add some additional notes on this thread prior to LPC given that time may be too short to get into the nasty bits in the session. > > - Platform extensions that would be nice to see (SPM-only Bits) > > > > Open Questions > > - Single SPM nodemask, or multiple based on features? > > - Apply SPM/SysRAM bit on-boot only or at-hotplug? > > - Allocate extra "possible" NUMA nodes for flexbility? > > I guess this might make hotplug easier? Particularly in cases where FW hasn't > created the nodes. > In cases where you need to reach back to the device for some signal, you likely need to have the driver for that device manage the alloc/free patterns - so this may (or may not) generalize to 1-device-per-node. In the scenario where you want some flexibility in managing regions, this may require multiple nodes for device. Maybe one device provides multiple types of memory - you want those on separate nodes. This doesn't seem like something you need to solve right away, just something for folks to consider. > > - Should SPM Nodes be zone-restricted? (MOVABLE only?) > > For device based memory I think so - otherwise you can never gurantee devices > can be removed or drivers (if required to access the memory) can be unbound as > you can't migrate things off the memory. > Zones in this scenario are bit of a square-peg/round-hole. Forcing everything in ZONE_MOVABLE means you can't do page pinning or things like 1GB gigantic pages. But the device driver should be capable of managing hotplug anyway, so what's the point of ZONE_MOVABLE? :shrug: > > The ZSwap example demonstrates this with the `mt_spm_nodemask`. This > > hack treats all spm nodes as-if they are compressed memory nodes, and > > we bypass the software compression logic in zswap in favor of simply > > copying memory directly to the allocated page. In a real design > > So in your example (I get it's a hack) is the main advantage that you can use > all the same memory allocation policies (eg. cgroups) when needing to allocate > the pages? Given this is ZSwap I guess these pages would never be mapped > directly into user-space but would anything in the design prevent that? This is, in-fact, the long term intent. As long as the device can manage inline decompression with reasonable latencies, there's no reason you shouldn't be able to leave the pages mapped Read-Only in user-space. The driver would be responsible for migrating on write-fault, similar to a NUMA Hint Fault on the existing transparent page placement system. > For example could a driver say allocate SPM memory and then explicitly > migrate an existing page to it? You might even extend migrate_pages with a new flag that simply drops the write-able flag from the page table mapping and abstract that entire complexity out of the driver :] ~Gregory

