Re: [PATCH v8 0/3] Poisoned memory recovery on reboot

2025-02-13 Thread William Roche
On 2/11/25 23:35, Peter Xu wrote: On Tue, Feb 11, 2025 at 09:27:04PM +, “William Roche wrote: From: William Roche Here is a very simplified version of my fix only dealing with the recovery of huge pages on VM reset. --- This set of patches fixes an existing bug with hardware memory

[PATCH v8 0/3] Poisoned memory recovery on reboot

2025-02-11 Thread William Roche
From: William Roche Here is a very simplified version of my fix only dealing with the recovery of huge pages on VM reset. --- This set of patches fixes an existing bug with hardware memory errors impacting hugetlbfs memory backed VMs and its recovery on VM reset. When using hugetlbfs large

[PATCH v8 1/3] system/physmem: handle hugetlb correctly in qemu_ram_remap()

2025-02-11 Thread William Roche
From: William Roche The list of hwpoison pages used to remap the memory on reset is based on the backend real page size. To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete hugetlb page; hugetlb pages cannot be partially mapped. Signed-off-by: William Roche Co-developed-by: David

[PATCH v8 3/3] target/arm/kvm: Report memory errors injection

2025-02-11 Thread William Roche
From: William Roche Generate an x86 similar error injection message on ras enabled ARM platforms. ARM qemu only deals with action required memory errors signaled with SIGBUS/BUS_MCEERR_AR, and will report a message on every memory error relayed to the VM. A message like: Guest Memory Error at

[PATCH v8 2/3] system/physmem: poisoned memory discard on reboot

2025-02-11 Thread William Roche
From: William Roche Repair poisoned memory location(s), calling ram_block_discard_range(): punching a hole in the backend file when necessary and regenerating a usable memory. If the kernel doesn't support the madvise calls used by this function and we are dealing with anonymous memory,

Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page

2025-02-11 Thread William Roche
On 2/10/25 17:48, Peter Xu wrote: On Fri, Feb 07, 2025 at 07:02:22PM +0100, William Roche wrote: [...] So the main reason is a KVM "weakness" with kvm_send_hwpoison_signal(), and the second reason is to have richer error messages. This seems true, and I also remember something whe

Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page

2025-02-07 Thread William Roche
On 2/5/25 18:07, Peter Xu wrote: On Wed, Feb 05, 2025 at 05:27:13PM +0100, William Roche wrote: [...] The HMP command "info ramblock" is implemented with the ram_block_format() function which returns a message buffer built with a string for each ramblock (protected by the RCU_READ_

Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page

2025-02-05 Thread William Roche
On 2/4/25 18:01, Peter Xu wrote: On Sat, Feb 01, 2025 at 09:57:23AM +, “William Roche wrote: From: William Roche In case of a large page impacted by a memory error, provide an information about the impacted large page before the memory error injection message. This message would also

Re: [PATCH v7 6/6] hostmem: Handle remapping of RAM

2025-02-05 Thread William Roche
On 2/4/25 21:16, Peter Xu wrote: On Tue, Feb 04, 2025 at 07:55:52PM +0100, David Hildenbrand wrote: Ah, and now I remember where these 3 patches originate from: virtio-mem handling. For virtio-mem I want to register also a remap handler, for example, to perform the custom preallocation handling

Re: [PATCH v7 2/6] system/physmem: poisoned memory discard on reboot

2025-02-05 Thread William Roche
On 2/4/25 18:09, Peter Xu wrote: On Sat, Feb 01, 2025 at 09:57:22AM +, “William Roche wrote: From: William Roche Repair poisoned memory location(s), calling ram_block_discard_range(): punching a hole in the backend file when necessary and regenerating a usable memory. If the kernel

[PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap()

2025-02-01 Thread William Roche
From: David Hildenbrand Notify registered listeners about the remap at the end of qemu_ram_remap() so e.g., a memory backend can re-apply its settings correctly. Signed-off-by: David Hildenbrand Signed-off-by: William Roche --- hw/core/numa.c | 11 +++ include/exec/ramlist.h

[PATCH v7 6/6] hostmem: Handle remapping of RAM

2025-02-01 Thread William Roche
From: William Roche Let's register a RAM block notifier and react on remap notifications. Simply re-apply the settings. Exit if something goes wrong. Merging and dump settings are handled by the remap notification in addition to memory policy and preallocation. Co-developed-by:

[PATCH v7 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()

2025-02-01 Thread William Roche
From: William Roche The list of hwpoison pages used to remap the memory on reset is based on the backend real page size. To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete hugetlb page; hugetlb pages cannot be partially mapped. Signed-off-by: William Roche Co-developed-by: David

[PATCH v7 3/6] accel/kvm: Report the loss of a large memory page

2025-02-01 Thread William Roche
From: William Roche In case of a large page impacted by a memory error, provide an information about the impacted large page before the memory error injection message. This message would also appear on ras enabled ARM platforms, with the introduction of an x86 similar error injection message

[PATCH v7 0/6] Poisoned memory recovery on reboot

2025-02-01 Thread William Roche
From: William Roche Hello David, Here is the version with the small nits corrected. And the 'Acked-by' entries you gave me for patch 1 and 2. --- This set of patches fixes several problems with hardware memory errors impacting hugetlbfs memory backed VMs and the generic memory reco

[PATCH v7 5/6] hostmem: Factor out applying settings

2025-02-01 Thread William Roche
From: David Hildenbrand We want to reuse the functionality when remapping RAM. Signed-off-by: David Hildenbrand Signed-off-by: William Roche --- backends/hostmem.c | 155 - 1 file changed, 82 insertions(+), 73 deletions(-) diff --git a/backends

[PATCH v7 2/6] system/physmem: poisoned memory discard on reboot

2025-02-01 Thread William Roche
From: William Roche Repair poisoned memory location(s), calling ram_block_discard_range(): punching a hole in the backend file when necessary and regenerating a usable memory. If the kernel doesn't support the madvise calls used by this function and we are dealing with anonymous memory,

Re: [PATCH v6 3/6] accel/kvm: Report the loss of a large memory page

2025-02-01 Thread William Roche
On 1/30/25 18:02, David Hildenbrand wrote: On 27.01.25 22:31, “William Roche wrote: From: William Roche In case of a large page impacted by a memory error, provide an information about the impacted large page before the memory error injection message. This message would also appear on ras

[PATCH v6 6/6] hostmem: Handle remapping of RAM

2025-01-27 Thread William Roche
From: William Roche Let's register a RAM block notifier and react on remap notifications. Simply re-apply the settings. Exit if something goes wrong. Merging and dump settings are handled by the remap notification in addition to memory policy and preallocation. Co-developed-by:

[PATCH v6 4/6] numa: Introduce and use ram_block_notify_remap()

2025-01-27 Thread William Roche
From: David Hildenbrand Notify registered listeners about the remap at the end of qemu_ram_remap() so e.g., a memory backend can re-apply its settings correctly. Signed-off-by: David Hildenbrand Signed-off-by: William Roche --- hw/core/numa.c | 11 +++ include/exec/ramlist.h

[PATCH v6 3/6] accel/kvm: Report the loss of a large memory page

2025-01-27 Thread William Roche
From: William Roche In case of a large page impacted by a memory error, provide an information about the impacted large page before the memory error injection message. This message would also appear on ras enabled ARM platforms, with the introduction of an x86 similar error injection message

[PATCH v6 5/6] hostmem: Factor out applying settings

2025-01-27 Thread William Roche
From: David Hildenbrand We want to reuse the functionality when remapping RAM. Signed-off-by: David Hildenbrand Signed-off-by: William Roche --- backends/hostmem.c | 155 - 1 file changed, 82 insertions(+), 73 deletions(-) diff --git a/backends

[PATCH v6 2/6] system/physmem: poisoned memory discard on reboot

2025-01-27 Thread William Roche
From: William Roche Repair poisoned memory location(s), calling ram_block_discard_range(): punching a hole in the backend file when necessary and regenerating a usable memory. If the kernel doesn't support the madvise calls used by this function and we are dealing with anonymous memory,

[PATCH v6 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()

2025-01-27 Thread William Roche
From: William Roche The list of hwpoison pages used to remap the memory on reset is based on the backend real page size. To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete hugetlb page; hugetlb pages cannot be partially mapped. Co-developed-by: David Hildenbrand Signed-off-by

[PATCH v6 0/6] Poisoned memory recovery on reboot

2025-01-27 Thread William Roche
From: William Roche Hello David, I'm back on this topic. --- This set of patches fixes several problems with hardware memory errors impacting hugetlbfs memory backed VMs and the generic memory recovery on VM reset. When using hugetlbfs large pages, any large page location being impacted

Re: [PATCH v5 6/6] hostmem: Handle remapping of RAM

2025-01-27 Thread William Roche
On 1/14/25 15:11, David Hildenbrand wrote: On 10.01.25 22:14, “William Roche wrote: From: David Hildenbrand You can make yourself the author and just make me a Co-developed-by here. LGTM! Ok done. Thanks.

Re: [PATCH v5 0/6] Poisoned memory recovery on reboot

2025-01-27 Thread William Roche
On 1/14/25 15:12, David Hildenbrand wrote: On 10.01.25 22:13, “William Roche wrote: From: William Roche Hello David, I'm keeping the description of the patch set you already reviewed: Hi, one request, can you send it out next time (v6) *not* as reply to the previous thread, but just

Re: [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page

2025-01-27 Thread William Roche
On 1/14/25 15:09, David Hildenbrand wrote: On 10.01.25 22:14, “William Roche wrote: From: William Roche In case of a large page impacted by a memory error, enhance the existing Qemu error message which indicates that the error is injected in the VM, adding "on lost large page SIZE

Re: [PATCH v5 2/6] system/physmem: poisoned memory discard on reboot

2025-01-27 Thread William Roche
On 1/14/25 15:07, David Hildenbrand wrote: On 10.01.25 22:14, “William Roche wrote: From: William Roche Repair poisoned memory location(s), calling ram_block_discard_range(): punching a hole in the backend file when necessary and regenerating a usable memory. If the kernel doesn't suppor

Re: [PATCH v4 2/7] system/physmem: poisoned memory discard on reboot

2025-01-27 Thread William Roche
On 1/14/25 15:00, David Hildenbrand wrote: If we can get the current set of fixes integrated, I'll submit another fix proposal to take the fd_offset into account in a second time. (Not enlarging the current set) But here is what I'm thinking about. That we can discuss later if you want: @@ -3

Re: [PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()

2025-01-27 Thread William Roche
On 1/14/25 15:02, David Hildenbrand wrote: On 10.01.25 22:14, “William Roche wrote: From: William Roche The list of hwpoison pages used to remap the memory on reset is based on the backend real page size. When dealing with hugepages, we create a single entry for the entire page. To correctly

[PATCH v3 0/1] fallocate missing fd_offset

2025-01-22 Thread William Roche
From: William Roche Working on the poisoned memory recovery mechanisms with David Hildenbrand, it appeared that the file hole punching done with the memory discard functions are missing the file offset value fd_offset to correctly modify the right file location. Note that guest_memfd would not

[PATCH v3 1/1] system/physmem: take into account fd_offset for file fallocate

2025-01-22 Thread William Roche
From: William Roche Punching a hole in a file with fallocate needs to take into account the fd_offset value for a correct file location. But guest_memfd internal use doesn't currently consider fd_offset. Fixes: 4b870dc4d0c0 ("hostmem-file: add offset option") Signed-off-by

Re: [PATCH v2 1/1] system/physmem: take into account fd_offset for file fallocate

2025-01-22 Thread William Roche
On 1/22/25 09:01, David Hildenbrand wrote: On 21.01.25 23:54, “William Roche wrote: From: William Roche [...] --- a/system/physmem.c +++ b/system/physmem.c @@ -3655,6 +3655,7 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)   need_madvise = (rb->page_s

[PATCH v2 1/1] system/physmem: take into account fd_offset for file fallocate

2025-01-21 Thread William Roche
From: William Roche Punching a hole in a file with fallocate needs to take into account the fd_offset value for a correct file location. But guest_memfd internal use doesn't currently consider fd_offset. Fixes: 4b870dc4d0c0 ("hostmem-file: add offset option") Signed-off-by

[PATCH v2 0/1] fallocate missing fd_offset

2025-01-21 Thread William Roche
From: William Roche Working on the poisoned memory recovery mechanisms with David Hildenbrand, it appeared that the file hole punching done with the memory discard functions are missing the file offset value fd_offset to correctly modify the right file location. Note that guest_memfd would not

Re: [PATCH 1/1] system/physmem: take into account fd_offset for file fallocate

2025-01-21 Thread William Roche
Thank you Peter and David for your feedback. On 1/21/25 19:25, David Hildenbrand wrote: On 21.01.25 19:17, Peter Xu wrote: On Tue, Jan 21, 2025 at 05:59:56PM +, “William Roche wrote: From: William Roche Punching a hole in a file with fallocate needs to take into account the fd_offset

[PATCH 1/1] system/physmem: take into account fd_offset for file fallocate

2025-01-21 Thread William Roche
From: William Roche Punching a hole in a file with fallocate needs to take into account the fd_offset value for a correct file location. Fixes: 4b870dc4d0c0 ("hostmem-file: add offset option") Signed-off-by: William Roche --- system/physmem.c | 14 -- 1 file changed, 8

[PATCH 0/1] fallocate missing fd_offset

2025-01-21 Thread William Roche
From: William Roche Working on the poisoned memory recovery mechanisms with David Hildenbrand, it appeared that the file hole punching done with the memory discard functions are missing the file offset value fd_offset to correctly modify the right file location. I'm not sure that guest_

[PATCH v5 6/6] hostmem: Handle remapping of RAM

2025-01-10 Thread William Roche
David Hildenbrand Signed-off-by: William Roche --- backends/hostmem.c | 34 ++ include/system/hostmem.h | 1 + system/physmem.c | 4 3 files changed, 35 insertions(+), 4 deletions(-) diff --git a/backends/hostmem.c b/backends/hostmem.c index 46d80

[PATCH v5 4/6] numa: Introduce and use ram_block_notify_remap()

2025-01-10 Thread William Roche
From: David Hildenbrand Notify registered listeners about the remap at the end of qemu_ram_remap() so e.g., a memory backend can re-apply its settings correctly. Signed-off-by: David Hildenbrand Signed-off-by: William Roche --- hw/core/numa.c | 11 +++ include/exec/ramlist.h

[PATCH v5 3/6] accel/kvm: Report the loss of a large memory page

2025-01-10 Thread William Roche
From: William Roche In case of a large page impacted by a memory error, enhance the existing Qemu error message which indicates that the error is injected in the VM, adding "on lost large page SIZE@ADDR". Include also a similar message to the ARM platform. In the case of a large pag

[PATCH v5 5/6] hostmem: Factor out applying settings

2025-01-10 Thread William Roche
From: David Hildenbrand We want to reuse the functionality when remapping RAM. Signed-off-by: David Hildenbrand Signed-off-by: William Roche --- backends/hostmem.c | 155 - 1 file changed, 82 insertions(+), 73 deletions(-) diff --git a/backends

[PATCH v5 2/6] system/physmem: poisoned memory discard on reboot

2025-01-10 Thread William Roche
From: William Roche Repair poisoned memory location(s), calling ram_block_discard_range(): punching a hole in the backend file when necessary and regenerating a usable memory. If the kernel doesn't support the madvise calls used by this function and we are dealing with anonymous memory,

[PATCH v5 0/6] Poisoned memory recovery on reboot

2025-01-10 Thread William Roche
From: William Roche Hello David, I'm keeping the description of the patch set you already reviewed: --- This set of patches fixes several problems with hardware memory errors impacting hugetlbfs memory backed VMs and the generic memory recovery on VM reset. When using hugetlbfs large

[PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()

2025-01-10 Thread William Roche
From: William Roche The list of hwpoison pages used to remap the memory on reset is based on the backend real page size. When dealing with hugepages, we create a single entry for the entire page. To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete hugetlb page; hugetlb pages cannot

Re: [PATCH v4 1/7] hwpoison_page_list and qemu_ram_remap are based on pages

2025-01-10 Thread William Roche
On 1/8/25 22:34, David Hildenbrand wrote: On 14.12.24 14:45, “William Roche wrote: From: William Roche Subject should likely start with "system/physmem:". Maybe "system/physmem: handle hugetlb correctly in qemu_ram_remap()" I updated the commit title The list of

Re: [PATCH v4 5/7] hostmem: Factor out applying settings

2025-01-10 Thread William Roche
On 1/8/25 22:58, David Hildenbrand wrote: On 14.12.24 14:45, “William Roche wrote: From: David Hildenbrand We want to reuse the functionality when remapping or resizing RAM. We should drop the "or resizing of RAM." part, as that does no longer apply. Commit message corrected.

Re: [PATCH v4 7/7] system/physmem: Memory settings applied on remap notification

2025-01-10 Thread William Roche
On 1/8/25 22:53, David Hildenbrand wrote: On 14.12.24 14:45, “William Roche wrote: From: William Roche Merging and dump settings are handled by the remap notification in addition to memory policy and preallocation. Signed-off-by: William Roche ---   system/physmem.c | 2 --   1 file changed

Re: [PATCH v4 6/7] hostmem: Handle remapping of RAM

2025-01-10 Thread William Roche
On 1/8/25 22:51, David Hildenbrand wrote: On 14.12.24 14:45, “William Roche wrote: From: David Hildenbrand Let's register a RAM block notifier and react on remap notifications. Simply re-apply the settings. Exit if something goes wrong. Note: qemu_ram_remap() will not remap when RAM_PRE

Re: [PATCH v4 2/7] system/physmem: poisoned memory discard on reboot

2025-01-10 Thread William Roche
On 1/8/25 22:44, David Hildenbrand wrote: On 14.12.24 14:45, “William Roche wrote: +/* Try to simply remap the given location */ +static void qemu_ram_remap_mmap(RAMBlock *block, void* vaddr, size_t size, +    ram_addr_t offset) Can you make the parameters match

Re: [PATCH v4 0/7] Poisoned memory recovery on reboot

2025-01-10 Thread William Roche
On 1/8/25 22:22, David Hildenbrand wrote: On 14.12.24 14:45, “William Roche wrote: From: William Roche Hello David, Hi! Let me start reviewing today a bit (it's already late, and I'll continue tomorrow. Here is an new version of our code and an updated description of the

[PATCH v4 7/7] system/physmem: Memory settings applied on remap notification

2024-12-14 Thread William Roche
From: William Roche Merging and dump settings are handled by the remap notification in addition to memory policy and preallocation. Signed-off-by: William Roche --- system/physmem.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/system/physmem.c b/system/physmem.c index 9fc74a5699

[PATCH v4 2/7] system/physmem: poisoned memory discard on reboot

2024-12-14 Thread William Roche
From: William Roche Repair poisoned memory location(s), calling ram_block_discard_range(): punching a hole in the backend file when necessary and regenerating a usable memory. If the kernel doesn't support the madvise calls used by this function and we are dealing with anonymous memory,

[PATCH v4 6/7] hostmem: Handle remapping of RAM

2024-12-14 Thread William Roche
ff-by: David Hildenbrand Signed-off-by: William Roche --- backends/hostmem.c | 34 ++ include/sysemu/hostmem.h | 1 + 2 files changed, 35 insertions(+) diff --git a/backends/hostmem.c b/backends/hostmem.c index bf85d716e5..863f6da11d 100644 --- a/bac

[PATCH v4 4/7] numa: Introduce and use ram_block_notify_remap()

2024-12-14 Thread William Roche
From: David Hildenbrand Notify registered listeners about the remap at the end of qemu_ram_remap() so e.g., a memory backend can re-apply its settings correctly. Signed-off-by: David Hildenbrand Signed-off-by: William Roche --- hw/core/numa.c | 11 +++ include/exec/ramlist.h

[PATCH v4 5/7] hostmem: Factor out applying settings

2024-12-14 Thread William Roche
From: David Hildenbrand We want to reuse the functionality when remapping or resizing RAM. Signed-off-by: David Hildenbrand Signed-off-by: William Roche --- backends/hostmem.c | 155 - 1 file changed, 82 insertions(+), 73 deletions(-) diff --git a

[PATCH v4 3/7] accel/kvm: Report the loss of a large memory page

2024-12-14 Thread William Roche
From: William Roche In case of a large page impacted by a memory error, enhance the existing Qemu error message which indicates that the error is injected in the VM, adding "on lost large page SIZE@ADDR". Include also a similar message to the ARM platform. In the case of a large pag

[PATCH v4 0/7] Poisoned memory recovery on reboot

2024-12-14 Thread William Roche
From: William Roche Hello David, Here is an new version of our code and an updated description of the patch set: --- This set of patches fixes several problems with hardware memory errors impacting hugetlbfs memory backed VMs and the generic memory recovery on VM reset. When using hugetlbfs

[PATCH v4 1/7] hwpoison_page_list and qemu_ram_remap are based on pages

2024-12-14 Thread William Roche
From: William Roche The list of hwpoison pages used to remap the memory on reset is based on the backend real page size. When dealing with hugepages, we create a single entry for the entire page. Co-developed-by: David Hildenbrand Signed-off-by: William Roche --- accel/kvm/kvm-all.c

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

2024-12-06 Thread William Roche
On 12/3/24 16:00, David Hildenbrand wrote: On 03.12.24 15:39, William Roche wrote: [...] Our new Qemu code is testing first the fallocate+MADV_DONTNEED procedure for standard sized pages (in ram_block_discard_range()) and only folds back to the mmap() use if it fails. So maybe my proposal to

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

2024-12-03 Thread William Roche
On 12/3/24 15:08, David Hildenbrand wrote: [...] Let me take a look at your tool below if I can find an explanation of what is happening, because it's weird :) [...] At the end of this email, I included the source code of a simplistic test case that shows that the page is replaced in the c

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

2024-12-02 Thread William Roche
On 12/2/24 17:00, David Hildenbrand wrote: On 02.12.24 16:41, William Roche wrote: Hello David, Hi, sorry for reviewing yet, I was rather sick the last 1.5 weeks. I hope you get well soon! I've finally tested many page mapping possibilities and tried to identify the error inje

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

2024-12-02 Thread William Roche
Hello David, I've finally tested many page mapping possibilities and tried to identify the error injection reaction on these pages to see if mmap() can be used to recover the impacted area. I'm using the latest upstream kernel I have for that: 6.12.0-rc7.master.20241117.ol9.x86_64 But I also g

[PATCH v3 4/7] numa: Introduce and use ram_block_notify_remap()

2024-11-25 Thread William Roche
From: David Hildenbrand Notify registered listeners about the remap at the end of qemu_ram_remap() so e.g., a memory backend can re-apply its settings correctly. Signed-off-by: David Hildenbrand Signed-off-by: William Roche --- hw/core/numa.c | 11 +++ include/exec/ramlist.h

[PATCH v3 3/7] accel/kvm: Report the loss of a large memory page

2024-11-25 Thread William Roche
From: William Roche In case of a large page impacted by a memory error, complete the existing Qemu error message to indicate that the error is injected in the VM. Also include a simlar message to the ARM platform. Only in the case of a large page impacted, we now report: ...Memory Error at QEMU

[PATCH v3 6/7] hostmem: Handle remapping of RAM

2024-11-25 Thread William Roche
ff-by: David Hildenbrand Signed-off-by: William Roche --- backends/hostmem.c | 34 ++ include/sysemu/hostmem.h | 1 + 2 files changed, 35 insertions(+) diff --git a/backends/hostmem.c b/backends/hostmem.c index bf85d716e5..863f6da11d 100644 --- a/bac

[PATCH v3 7/7] system/physmem: Memory settings applied on remap notification

2024-11-25 Thread William Roche
From: William Roche Merging and dump settings are handled by the remap notification in addition to memory policy and preallocation. If preallocation is set on a memory block, qemu_prealloc_mem() call is needed also after a ram_block_discard_range() use for this block. Signed-off-by: William

[PATCH v3 5/7] hostmem: Factor out applying settings

2024-11-25 Thread William Roche
From: David Hildenbrand We want to reuse the functionality when remapping or resizing RAM. Signed-off-by: David Hildenbrand Signed-off-by: William Roche --- backends/hostmem.c | 155 - 1 file changed, 82 insertions(+), 73 deletions(-) diff --git a

[PATCH v3 2/7] system/physmem: poisoned memory discard on reboot

2024-11-25 Thread William Roche
From: William Roche Repair memory locations, calling ram_block_discard_range(), punching a hole in the backend file when necessary and regenerate a usable memory. Fall back to unmap/remap the memory location(s) if the kernel doesn't support the madvise calls used by ram_block_discard_

[PATCH v3 0/7] hugetlbfs memory HW error fixes

2024-11-25 Thread William Roche
From: William Roche Hi David, Here is an new version of our code, but I still need to double check the mmap behavior in case of a memory error impact on: - a clean page of an empty file or populated file - already mapped using MAP_SHARED or MAP_PRIVATE to see if mmap() can recover the area or

[PATCH v3 1/7] hwpoison_page_list and qemu_ram_remap are based of pages

2024-11-25 Thread William Roche
From: William Roche The list of hwpoison pages used to remap the memory on reset is based on the backend real page size. When dealing with hugepages, we create a single entry for the entire page. Co-developed-by: David Hildenbrand Signed-off-by: William Roche --- accel/kvm/kvm-all.c

Re: [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

2024-11-15 Thread William Roche
: On 12.11.24 19:17, William Roche wrote: On 11/12/24 12:13, David Hildenbrand wrote: On 07.11.24 11:21, “William Roche wrote: From: William Roche When an entire large page is impacted by an error (hugetlbfs case), report better the size and location of this large memory hole, so give a wa

Re: [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

2024-11-12 Thread William Roche
On 11/12/24 12:13, David Hildenbrand wrote: On 07.11.24 11:21, “William Roche wrote: From: William Roche When an entire large page is impacted by an error (hugetlbfs case), report better the size and location of this large memory hole, so give a warning message when this page is first hit

Re: [PATCH v2 2/7] system/physmem: poisoned memory discard on reboot

2024-11-12 Thread William Roche
On 11/12/24 12:07, David Hildenbrand wrote: On 07.11.24 11:21, “William Roche wrote: From: William Roche We take into account the recorded page sizes to repair the memory locations, calling ram_block_discard_range() to punch a hole in the backend file when necessary and regenerate a usable

Re: [PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size

2024-11-12 Thread William Roche
On 11/12/24 11:30, David Hildenbrand wrote: On 07.11.24 11:21, “William Roche wrote: From: William Roche When a memory page is added to the hwpoison_page_list, include the page size information.  This size is the backend real page size. To better deal with hugepages, we create a single entry

Re: [PATCH v2 6/7] hostmem: Handle remapping of RAM

2024-11-12 Thread William Roche
On 11/12/24 14:45, David Hildenbrand wrote: On 07.11.24 11:21, “William Roche wrote: From: David Hildenbrand Let's register a RAM block notifier and react on remap notifications. Simply re-apply the settings. Warn only when something goes wrong. Note: qemu_ram_remap() will not remap

[PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size

2024-11-07 Thread William Roche
From: William Roche When a memory page is added to the hwpoison_page_list, include the page size information. This size is the backend real page size. To better deal with hugepages, we create a single entry for the entire page. Signed-off-by: William Roche --- accel/kvm/kvm-all.c | 8

[PATCH v2 5/7] hostmem: Factor out applying settings

2024-11-07 Thread William Roche
From: David Hildenbrand We want to reuse the functionality when remapping or resizing RAM. Signed-off-by: David Hildenbrand Signed-off-by: William Roche --- backends/hostmem.c | 155 - 1 file changed, 82 insertions(+), 73 deletions(-) diff --git a

[PATCH v2 6/7] hostmem: Handle remapping of RAM

2024-11-07 Thread William Roche
igned-off-by: David Hildenbrand Signed-off-by: William Roche --- backends/hostmem.c | 29 + include/sysemu/hostmem.h | 1 + 2 files changed, 30 insertions(+) diff --git a/backends/hostmem.c b/backends/hostmem.c index bf85d716e5..fbd8708664 100644 --- a/bac

[PATCH v2 2/7] system/physmem: poisoned memory discard on reboot

2024-11-07 Thread William Roche
From: William Roche We take into account the recorded page sizes to repair the memory locations, calling ram_block_discard_range() to punch a hole in the backend file when necessary and regenerate a usable memory. Fall back to unmap/remap the memory location(s) if the kernel doesn't suppor

[PATCH v2 7/7] system/physmem: Memory settings applied on remap notification

2024-11-07 Thread William Roche
From: William Roche Merging and dump settings are handled by the remap notification in addition to memory policy and preallocation. If preallocation is set on a memory block, qemu_prealloc_mem() call is needed also after a ram_block_discard_range() use for this block. Signed-off-by: William

[PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

2024-11-07 Thread William Roche
From: William Roche When an entire large page is impacted by an error (hugetlbfs case), report better the size and location of this large memory hole, so give a warning message when this page is first hit: Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST addr Z Signed-off

[PATCH v2 4/7] numa: Introduce and use ram_block_notify_remap()

2024-11-07 Thread William Roche
From: David Hildenbrand Notify registered listeners about the remap at the end of qemu_ram_remap() so e.g., a memory backend can re-apply its settings correctly. Signed-off-by: David Hildenbrand Signed-off-by: William Roche --- hw/core/numa.c | 11 +++ include/exec/ramlist.h

[PATCH v2 0/7] hugetlbfs memory HW error fixes

2024-11-07 Thread William Roche
From: William Roche Hi David, Here is an updated description of the patch set: --- This set of patches fixes several problems with hardware memory errors impacting hugetlbfs memory backed VMs. When using hugetlbfs large pages, any large page location being impacted by an HW memory error

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

2024-10-29 Thread William Roche
On 10/28/24 17:42, David Hildenbrand wrote: On 26.10.24 01:27, William Roche wrote: On 10/23/24 09:28, David Hildenbrand wrote: On 22.10.24 23:35, “William Roche wrote: From: William Roche Add the page size information to the hwpoison_page_list elements. As the kernel doesn't always r

Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

2024-10-29 Thread William Roche
On 10/28/24 18:01, David Hildenbrand wrote: On 26.10.24 01:27, William Roche wrote: On 10/23/24 09:30, David Hildenbrand wrote: On 22.10.24 23:35, “William Roche wrote: From: William Roche When the VM reboots, a memory reset is performed calling qemu_ram_remap() on all hwpoisoned pages

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

2024-10-25 Thread William Roche
On 10/23/24 09:28, David Hildenbrand wrote: On 22.10.24 23:35, “William Roche wrote: From: William Roche Add the page size information to the hwpoison_page_list elements. As the kernel doesn't always report the actual poisoned page size, we adjust this size from the backend real page siz

Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

2024-10-25 Thread William Roche
On 10/23/24 09:30, David Hildenbrand wrote: On 22.10.24 23:35, “William Roche wrote: From: William Roche When the VM reboots, a memory reset is performed calling qemu_ram_remap() on all hwpoisoned pages. While we take into account the recorded page sizes to repair the memory locations, a

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

2024-10-25 Thread William Roche
On 10/23/24 09:28, David Hildenbrand wrote: On 22.10.24 23:35, “William Roche wrote: From: William Roche Add the page size information to the hwpoison_page_list elements. As the kernel doesn't always report the actual poisoned page size, we adjust this size from the backend real page

[PATCH v1 4/4] accel/kvm: Report the loss of a large memory page

2024-10-22 Thread William Roche
From: William Roche On HW memory error, we need to report better what the impact of this error is. So when an entire large page is impacted by an error (like the hugetlbfs case), we give a warning message when this page is first hit: Memory error: Loosing a large page (size: X) at QEMU addr Y

[PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

2024-10-22 Thread William Roche
From: William Roche When the VM reboots, a memory reset is performed calling qemu_ram_remap() on all hwpoisoned pages. While we take into account the recorded page sizes to repair the memory locations, a large page also needs to punch a hole in the backend file to regenerate a usable memory

[PATCH v1 0/4] hugetlbfs memory HW error fixes

2024-10-22 Thread William Roche
From: William Roche This set of patches fixes several problems with hardware memory errors impacting hugetlbfs memory backed VMs. When using hugetlbfs large pages, any large page location being impacted by an HW memory error results in poisoning the entire page, suddenly making a large chunk of

[PATCH v1 1/4] accel/kvm: SIGBUS handler should also deal with si_addr_lsb

2024-10-22 Thread William Roche
From: William Roche The SIGBUS signal siginfo reporting a HW memory error provides a si_addr_lsb field with an indication of the impacted memory page size. This information should be used to track the hwpoisoned page sizes. Signed-off-by: William Roche --- accel/kvm/kvm-all.c| 6

[PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

2024-10-22 Thread William Roche
From: William Roche Add the page size information to the hwpoison_page_list elements. As the kernel doesn't always report the actual poisoned page size, we adjust this size from the backend real page size. We take into account the recorded page size to adjust the size and location of the m

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

2024-10-10 Thread William Roche
On 10/9/24 17:45, Peter Xu wrote: On Thu, Sep 19, 2024 at 06:52:37PM +0200, William Roche wrote: Hello David, I hope my last week email answered your interrogations about:     - retrieving the valid data from the lost hugepage     - the need of smaller pages to replace a failed large page

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

2024-09-19 Thread William Roche
Hello David, I hope my last week email answered your interrogations about:     - retrieving the valid data from the lost hugepage     - the need of smaller pages to replace a failed large page     - the interaction of memory error and VM migration     - the non-symmetrical access to a poisoned me

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

2024-09-12 Thread William Roche
On 9/12/24 00:07, David Hildenbrand wrote: Hi again, This is a Qemu RFC to introduce the possibility to deal with hardware memory errors impacting hugetlbfs memory backed VMs. When using hugetlbfs large pages, any large page location being impacted by an HW memory error results in poisoning th

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

2024-09-10 Thread William Roche
On 9/10/24 13:36, David Hildenbrand wrote: On 10.09.24 12:02, “William Roche wrote: From: William Roche Hi, Apologies for the noise; resending as I missed CC'ing the maintainers of the changed files Hello, This is a Qemu RFC to introduce the possibility to deal with hardware m

[RFC RESEND 0/6] hugetlbfs largepage RAS project

2024-09-10 Thread William Roche
From: William Roche Apologies for the noise; resending as I missed CC'ing the maintainers of the changed files Hello, This is a Qemu RFC to introduce the possibility to deal with hardware memory errors impacting hugetlbfs memory backed VMs. When using hugetlbfs large pages, any large

  1   2   >