On Fri, Sep 26, 2025 at 8:45 PM <[email protected]> wrote:
>
> Michał Cłapiński wrote:
> [..]
> > > As Mike says you would lose 128K at the end, but that indeed becomes
> > > losing that 1GB given alignment constraints.
> > >
> > > However, I think that could be solved by just separately vmalloc'ing the
> > > label space for this. Then instead of kernel parameters to sub-divide a
> > > region, you just have an initramfs script to do the same.
> > >
> > > Does that meet your needs?
> >
> > Sorry, I'm having trouble imagining this.
> > If I wanted 500 1GB chunks, I would request a region of 500GB+space
> > for the label? Or is that a label and info-blocks?
>
> You would specify an memmap= range of 500GB+128K*.
>
> Force attach that range to Mike's RAMDAX driver.
>
> [ modprobe -r nd_e820, don't build nd_820, or modprobe policy blocks nd_e820 ]
> echo ramdax > /sys/bus/platform/devices/e820_pmem/driver_override
> echo e820_pmem > /sys/bus/platform/drivers/ramdax
>
> * forget what I said about vmalloc() previously, not needed
>
> > Then on each boot the kernel would check if there is an actual
> > label/info-blocks in that space and if yes, it would recreate my
> > devices (including the fsdax/devdax type)?
>
> Right, if that range is persistent the kernel would automatically parse
> the label space each boot and divide up the 500GB region space into
> namespaces.
>
> 128K of label spaces gives you 509 potential namespaces.
That's not enough for us. We would need ~1 order of magnitude more.
Sorry I'm being vague about this but I can't discuss the actual
machine sizes.
> > One of the requirements for live update is that the kexec reboot has
> > to be fast. My solution introduced a delay of tens of milliseconds
> > since the actual device creation is asynchronous. Manually dividing a
> > region into thousands of devices from userspace would be very slow but
>
> Wait, 500GB Region / 1GB Namespace = thousands of Namespaces?
I was talking about devices and AFAIK 1 namespace equals 5 devices for
us currently (nd/{namespace, pfn, btt, dax}, dax/dax). Though the
device creation is asynchronous so I guess the actual device count is
not important.
> > I would have to do that only on the first boot, right?
>
> Yes, the expectation is only incur that overhead once. It also allows
> for VMs to be able to lookup their capacity by name. So you do not need
> a separate mapping of 1GB Namepsace blocks to VMs. Just give some VMs
> bigger Namespaces than others by name.
Sure, I can do that at first. But after some time fragmentation will
happen, right? At some point I will have to give VMs a bunch of
smaller namespaces here and there.
Btw. one more thing I don't understand. Why are maintainers so much
against adding new kernel parameters?