> > > > Based on the suggestions here, can we consider something like the > > following? > > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark > > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity > > structure to make it hotpluggable. > > Is that "devnode=on" parameter required? Can't we simply expose any node > that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE? > > Right now, with "ordinary", fixed-location memory devices > (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that > covers the device memory region for these devices with > MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, > which does not quite work IIRC. All applicable nodes that don't have > boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them.
Yeah, you're right that it isn't required. Exposing the node without any memory as MEM_AFFINITY_HOTPLUGGABLE seems like a better approach than using "devnode=on". > In your example, which memory ranges would we use for these nodes in SRAT? We are setting the Base Address and the Size as 0 in the SRAT memory affinity structures. This is done through the following: build_srat_memory(table_data, 0, 0, i, MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED); This results in the following logs in the VM from the Linux ACPI SRAT parsing code: [ 0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 8 PXM 8 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 9 PXM 9 [mem 0x00000000-0xffffffffffffffff] hotplug I would re-iterate that we are just emulating the baremetal behavior here. > I don't see how these numa-node args on a vfio-pci device have any > general utility. They're only used to create a firmware table, so why > don't we be explicit about it and define the firmware table as an > object? For example: > > -numa node,nodeid=2 \ > -numa node,nodeid=3 \ > -numa node,nodeid=4 \ > -numa node,nodeid=5 \ > -numa node,nodeid=6 \ > -numa node,nodeid=7 \ > -numa node,nodeid=8 \ > -numa node,nodeid=9 \ > -device >vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0 >\ > -object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \ Yeah, that is fine with me. If we agree with this approach, I can go implement it. > There are some suggestions in this thread that CXL could have similar > requirements, but I haven't found any evidence that these > dev-mem-pxm-{start,count} attributes in the _DSD are standardized in > any way. If they are, maybe this would be a dev-mem-pxm-acpi object > rather than an NVIDIA specific one. Maybe Jason, Jonathan can chime in on this? > It seems like we could almost meet the requirement for this table via > -acpitable, but I think we'd like to avoid the VM orchestration tool > from creating, compiling, and passing ACPI data blobs into the VM.