Re: [PATCH v4 1/2] cxl/core: introduce device reporting poison hanlding

2024-08-21 Thread Shiyang Ruan via
在 2024/8/9 2:28, Fan Ni 写道: On Thu, Aug 08, 2024 at 11:13:27PM +0800, Shiyang Ruan wrote: CXL device can find&report memory problems, even before MCE is detected by CPU. AFAIK, the current kernel only traces POISON error event from FW-First/OS-First path, but it doesn't handle them, neither

[PATCH v4 0/2] cxl: add device reporting poison handler

2024-08-08 Thread Shiyang Ruan via
This patchset includes "cxl/core: introduce poison creation hanlding" and "cxl: avoid duplicated report from MCE & device", which were posted separately. Here are changes since last version of each patch: P1: 1. since its async memory_failure(), set the flag to 0 2. also handle CXL_EVENT_TRA

[PATCH v4 2/2] cxl: avoid duplicated report from MCE & device

2024-08-08 Thread Shiyang Ruan via
Since CXL device is a memory device, while CPU is consuming a poison page of CXL device, it always triggers a MCE (via interrupt #18) and calls memory_failure() to handle POISON page, no matter which-First path is configured. CXL device could also find and report the POISON, kernel now not only tr

[PATCH v4 1/2] cxl/core: introduce device reporting poison hanlding

2024-08-08 Thread Shiyang Ruan via
CXL device can find&report memory problems, even before MCE is detected by CPU. AFAIK, the current kernel only traces POISON error event from FW-First/OS-First path, but it doesn't handle them, neither notify processes who are using the POISON page like MCE does. Thus, user have to read logs from

Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device

2024-07-22 Thread Shiyang Ruan via
在 2024/7/20 0:04, Dave Jiang 写道: On 7/1/24 7:12 PM, Shiyang Ruan wrote: 在 2024/6/25 21:56, Shiyang Ruan 写道: 在 2024/6/22 1:51, Dan Williams 写道: Shiyang Ruan wrote: Background: Since CXL device is a memory device, while CPU consumes a poison page of CXL device, it always triggers a MCE

Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device

2024-07-19 Thread Shiyang Ruan via
在 2024/6/19 0:53, Shiyang Ruan 写道: This patch adds a new notifier_block and MCE_PRIO_CXL, for CXL memdev to check whether the current poison page has been reported (if yes, stop the notifier chain, won't call the following memory_failure() to report), into `x86_mce_decoder_chain`. In this way

Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device

2024-07-01 Thread Shiyang Ruan via
在 2024/6/25 21:56, Shiyang Ruan 写道: 在 2024/6/22 1:51, Dan Williams 写道: Shiyang Ruan wrote: Background: Since CXL device is a memory device, while CPU consumes a poison page of CXL device, it always triggers a MCE by interrupt (INT18), no matter which-First path is configured.  This is the

Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device

2024-06-25 Thread Shiyang Ruan via
在 2024/6/22 4:44, Luck, Tony 写道: So who actually cares about recovering poisoned volatile memory? I'd like to understand more on how significant a use case this is. Whilst I can conjecture that its an extreme case of wanting to avoid loosing the ability to create 1GiB or larger pages due to po

Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device

2024-06-25 Thread Shiyang Ruan via
在 2024/6/22 1:51, Dan Williams 写道: Shiyang Ruan wrote: Background: Since CXL device is a memory device, while CPU consumes a poison page of CXL device, it always triggers a MCE by interrupt (INT18), no matter which-First path is configured. This is the first report. Then currently, in FW-Fi

Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device

2024-06-21 Thread Shiyang Ruan via
在 2024/6/20 23:51, Dave Jiang 写道: On 6/19/24 2:24 AM, Shiyang Ruan wrote: 在 2024/6/19 7:35, Dave Jiang 写道: On 6/18/24 9:53 AM, Shiyang Ruan wrote: Background: Since CXL device is a memory device, while CPU consumes a poison page of CXL device, it always triggers a MCE by interrupt (IN

Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device

2024-06-21 Thread Shiyang Ruan via
在 2024/6/21 1:02, Jonathan Cameron 写道: On Wed, 19 Jun 2024 00:53:10 +0800 Shiyang Ruan wrote: Background: Since CXL device is a memory device, while CPU consumes a poison page of CXL device, it always triggers a MCE by interrupt (INT18), no matter which-First path is configured. This is th

Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device

2024-06-19 Thread Shiyang Ruan via
在 2024/6/19 7:35, Dave Jiang 写道: On 6/18/24 9:53 AM, Shiyang Ruan wrote: Background: Since CXL device is a memory device, while CPU consumes a poison page of CXL device, it always triggers a MCE by interrupt (INT18), no matter which-First path is configured. This is the first report. Then

[RFC PATCH] cxl: avoid duplicating report from MCE & device

2024-06-18 Thread Shiyang Ruan via
Background: Since CXL device is a memory device, while CPU consumes a poison page of CXL device, it always triggers a MCE by interrupt (INT18), no matter which-First path is configured. This is the first report. Then currently, in FW-First path, the poison event is transferred according to th

Re: [PATCH v3 2/2] cxl/core: add poison creation event handler

2024-05-28 Thread Shiyang Ruan via
在 2024/5/24 23:15, Shiyang Ruan 写道: 在 2024/5/22 14:45, Dan Williams 写道: Shiyang Ruan wrote: [..] My expectation is MF_ACTION_REQUIRED is not appropriate for CXL event reported errors since action is only required for direct consumption events and those need not be reported through the devi

Re: [PATCH v3 2/2] cxl/core: add poison creation event handler

2024-05-24 Thread Shiyang Ruan via
在 2024/5/22 14:45, Dan Williams 写道: Shiyang Ruan wrote: [..] My expectation is MF_ACTION_REQUIRED is not appropriate for CXL event reported errors since action is only required for direct consumption events and those need not be reported through the device event queue. Got it. I'm not very

Re: [PATCH v3 2/2] cxl/core: add poison creation event handler

2024-05-20 Thread Shiyang Ruan via
在 2024/5/3 19:32, Shiyang Ruan 写道: 在 2024/4/24 2:40, Dan Williams 写道: Shiyang Ruan wrote: Currently driver only traces cxl events, poison creation (for both vmem and pmem type) on cxl memdev is silent. As it should be. OS needs to be notified then it could handle poison pages in time.

Re: [PATCH v3 1/2] cxl/core: correct length of DPA field masks

2024-05-03 Thread Shiyang Ruan via
在 2024/5/1 5:00, Alison Schofield 写道: On Wed, Apr 17, 2024 at 03:50:52PM +0800, Shiyang Ruan wrote: The length of Physical Address in General Media Event Record/DRAM Event Record is 64-bit, so the field mask should be defined as such length. Otherwise, this causes cxl_general_media and cxl_dr

Re: [PATCH v3 2/2] cxl/core: add poison creation event handler

2024-05-03 Thread Shiyang Ruan via
在 2024/4/24 2:40, Dan Williams 写道: Shiyang Ruan wrote: Currently driver only traces cxl events, poison creation (for both vmem and pmem type) on cxl memdev is silent. As it should be. OS needs to be notified then it could handle poison pages in time. No, it was always the case that late

Re: [PATCH v3 2/2] cxl/core: add poison creation event handler

2024-05-03 Thread Shiyang Ruan via
在 2024/4/24 1:57, Ira Weiny 写道: Shiyang Ruan wrote: Currently driver only traces cxl events, poison creation (for both vmem and pmem type) on cxl memdev is silent. OS needs to be notified then it could handle poison pages in time. Per CXL spec, the device error event could be signaled throu

Re: [PATCH v3 1/2] cxl/core: correct length of DPA field masks

2024-04-25 Thread Shiyang Ruan via
在 2024/4/24 5:04, Ira Weiny 写道: Alison Schofield wrote: On Wed, Apr 17, 2024 at 03:50:52PM +0800, Shiyang Ruan wrote: [snip] diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h index e5f13260fc52..cdfce932d5b1 100644 --- a/drivers/cxl/core/trace.h +++ b/drivers/cxl/core/trace

Re: [PATCH v3 2/2] cxl/core: add poison creation event handler

2024-04-18 Thread Shiyang Ruan via
在 2024/4/18 1:30, Dave Jiang 写道: On 4/17/24 12:50 AM, Shiyang Ruan wrote: Currently driver only traces cxl events, poison creation (for both vmem and pmem type) on cxl memdev is silent. OS needs to be notified then it could handle poison pages in time. Per CXL spec, the device error event

[PATCH v3 2/2] cxl/core: add poison creation event handler

2024-04-17 Thread Shiyang Ruan via
Currently driver only traces cxl events, poison creation (for both vmem and pmem type) on cxl memdev is silent. OS needs to be notified then it could handle poison pages in time. Per CXL spec, the device error event could be signaled through FW-First and OS-First methods. So, add poison creation

[PATCH v3 1/2] cxl/core: correct length of DPA field masks

2024-04-17 Thread Shiyang Ruan via
The length of Physical Address in General Media Event Record/DRAM Event Record is 64-bit, so the field mask should be defined as such length. Otherwise, this causes cxl_general_media and cxl_dram tracepoints to mask off the upper-32-bits of DPA addresses. The cxl_poison event is unaffected. If use

[PATCH v3 0/2] cxl: add poison creation event handler

2024-04-17 Thread Shiyang Ruan via
Changes: RFCv2 -> v3: 1. patch1: removed changes for flags 2. changed the main idea of this patchset: not for injection event handling, but for creation; 3. removed GET_POISON_LIST command while receiving POISON event; 4. dropped poison report in debugfs; 5. added DER event handler to handle P

Re: [RFC PATCH v2 4/6] cxl/core: report poison when injecting from debugfs

2024-04-03 Thread Shiyang Ruan via
在 2024/3/30 9:52, Dan Williams 写道: Shiyang Ruan wrote: Poison injection from debugfs is silent too. Add calling cxl_mem_report_poison() to make it able to do memory_failure(). Why does this needs to be signalled? It is a debug interface, the debugger can also trigger a read after the injec

Re: [RFC PATCH v2 3/6] cxl/core: add report option for cxl_mem_get_poison()

2024-04-03 Thread Shiyang Ruan via
在 2024/3/30 9:50, Dan Williams 写道: Shiyang Ruan wrote: The GMER only has "Physical Address" field, no such one indicates length. So, when a poison event is received, we could use GET_POISON_LIST command to get the poison list. Now driver has cxl_mem_get_poison(), so reuse it and add a parame

Re: [RFC PATCH v2 1/6] cxl/core: correct length of DPA field masks

2024-04-01 Thread Shiyang Ruan via
在 2024/3/30 9:37, Dan Williams 写道: Shiyang Ruan wrote: The length of Physical Address in General Media Event Record/DRAM Event Record is 64-bit, so the field mask should be defined as such length. Otherwise, this causes cxl_general_media and cxl_dram tracepoints to mask off the upper-32-bits

[RFC PATCH v2 3/6] cxl/core: add report option for cxl_mem_get_poison()

2024-03-28 Thread Shiyang Ruan via
The GMER only has "Physical Address" field, no such one indicates length. So, when a poison event is received, we could use GET_POISON_LIST command to get the poison list. Now driver has cxl_mem_get_poison(), so reuse it and add a parameter 'bool report', report poison record to MCE if set true.

[RFC PATCH v2 5/6] cxl: add definition for transaction types

2024-03-28 Thread Shiyang Ruan via
The transaction types are defined in General Media Event Record/DRAM Event per CXL rev 3.0 Section 8.2.9.2.1.1; Table 8-43 and Section 8.2.9.2.1.2; Table 8-44. Add them for Event Record handler use. Signed-off-by: Shiyang Ruan --- include/linux/cxl-event.h | 17 +++-- 1 file changed

[RFC PATCH v2 4/6] cxl/core: report poison when injecting from debugfs

2024-03-28 Thread Shiyang Ruan via
Poison injection from debugfs is silent too. Add calling cxl_mem_report_poison() to make it able to do memory_failure(). Signed-off-by: Shiyang Ruan --- drivers/cxl/core/memdev.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c index e976

[RFC PATCH v2 0/6] cxl: add poison event handler

2024-03-28 Thread Shiyang Ruan via
Changes: RFCv1 -> RFCv2: 1. update commit message of PATCH 1 2. use memory_failure_queue() instead of MCE 3. also report poison in debugfs when injecting poison 4. correct DPA->HPA logic: find memdev's endpoint decoder to find the region it belongs to 5. distinguish transaction_type of GMER, o

[RFC PATCH v2 1/6] cxl/core: correct length of DPA field masks

2024-03-28 Thread Shiyang Ruan via
The length of Physical Address in General Media Event Record/DRAM Event Record is 64-bit, so the field mask should be defined as such length. Otherwise, this causes cxl_general_media and cxl_dram tracepoints to mask off the upper-32-bits of DPA addresses. The cxl_poison event is unaffected. If use

[RFC PATCH v2 6/6] cxl/core: add poison injection event handler

2024-03-28 Thread Shiyang Ruan via
Currently driver only traces cxl events, poison injection (for both vmem and pmem type) on cxl memdev is silent. OS needs to be notified then it could handle poison range in time. Per CXL spec, the device error event could be signaled through FW-First and OS-First methods. So, add poison event h

[RFC PATCH v2 2/6] cxl/core: introduce cxl_mem_report_poison()

2024-03-28 Thread Shiyang Ruan via
If poison is detected(reported from cxl memdev), OS should be notified to handle it. So, introduce this helper function for later use: 1. translate DPA to HPA; 2. enqueue records into memory_failure's work queue; Signed-off-by: Shiyang Ruan --- Currently poison injection from debugfs always

[PATCH] monitor/hmp-cmds-target.c: append a space in error message in gpa2hva()

2024-03-18 Thread Shiyang Ruan via
From: Yao Xingtao In qemu monitor mode, when we use gpa2hva command to print the host virtual address corresponding to a guest physical address, if the gpa is not in RAM, the error message is below: (qemu) gpa2hva 0x75000 Memory at address 0x75000is not RAM a space is missed between '0x

Re: [RFC PATCH 5/5] cxl/core: add poison injection event handler

2024-03-14 Thread Shiyang Ruan via
在 2024/2/14 0:51, Jonathan Cameron 写道: + +void cxl_event_handle_record(struct cxl_memdev *cxlmd, +enum cxl_event_log_type type, +enum cxl_event_type event_type, +const uuid_t *uuid, union cxl_event *evt) +{ +

Re: [RFC PATCH 3/5] cxl/core: introduce cxl_mem_report_poison()

2024-03-14 Thread Shiyang Ruan via
在 2024/2/10 14:46, Dan Williams 写道: Shiyang Ruan wrote: If poison is detected(reported from cxl memdev), OS should be notified to handle it. Introduce this function: 1. translate DPA to HPA; 2. construct a MCE instance; (TODO: more details need to be filled) 3. log it into MCE event

Re: [RFC PATCH 4/5] cxl/core: add report option for cxl_mem_get_poison()

2024-03-14 Thread Shiyang Ruan via
在 2024/2/10 14:49, Dan Williams 写道: Shiyang Ruan wrote: When a poison event is received, driver uses GET_POISON_LIST command to get the poison list. Now driver has cxl_mem_get_poison(), so reuse it and add a parameter 'bool report', report poison record to MCE if set true. If the memory er

Re: [RFC PATCH 1/5] cxl/core: correct length of DPA field masks

2024-02-19 Thread Shiyang Ruan via
在 2024/2/10 14:34, Dan Williams 写道: Shiyang Ruan wrote: The length of Physical Address in General Media Event Record/DRAM Event Record is 64-bit, so the field mask should be defined as such length. Can you include this user visible side-effect of this change. Looks like this could cause usa

[RFC PATCH 1/5] cxl/core: correct length of DPA field masks

2024-02-09 Thread Shiyang Ruan via
The length of Physical Address in General Media Event Record/DRAM Event Record is 64-bit, so the field mask should be defined as such length. Signed-off-by: Shiyang Ruan --- drivers/cxl/core/trace.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/cxl/core/trace.

[RFC PATCH SET] cxl: add poison event handler

2024-02-09 Thread Shiyang Ruan via
Currently driver only trace cxl events, poison injection on cxl memdev is silent. OS needs to be notified then it could handle poison range in time. Per CXL spec, the device error event could be signaled through FW-First and OS-First methods. This draft patchset adds poison event handler in OS-F

[RFC PATCH 2/5] cxl/core: introduce cxl_memdev_dpa_to_hpa()

2024-02-09 Thread Shiyang Ruan via
When a memdev is assigned to a region, its Device Physical Address will be mapped to Host Physical Address. Introduce this helper function to translate HPA from a given memdev and its DPA. Signed-off-by: Shiyang Ruan --- drivers/cxl/core/memdev.c | 12 drivers/cxl/cxlmem.h |

[RFC PATCH 3/5] cxl/core: introduce cxl_mem_report_poison()

2024-02-09 Thread Shiyang Ruan via
If poison is detected(reported from cxl memdev), OS should be notified to handle it. Introduce this function: 1. translate DPA to HPA; 2. construct a MCE instance; (TODO: more details need to be filled) 3. log it into MCE event queue; After that, MCE mechanism can walk over its notifier cha

[RFC PATCH 2/2] hw/cxl/type3: send a GMER while injecting poison

2024-02-09 Thread Shiyang Ruan via
Send a signal to OS to let it able to handle the poison range. TODO: This is an rough draft, will add more parameters for qmp_cxl_inject_poison() to set to GMER. Signed-off-by: Shiyang Ruan --- hw/mem/cxl_type3.c | 5 + 1 file changed, 5 insertions(+) diff --git a/hw/mem/cxl_type3.c b/hw/m

[RFC PATCH 5/5] cxl/core: add poison injection event handler

2024-02-09 Thread Shiyang Ruan via
Currently driver only trace cxl events, poison injection on cxl memdev is silent. OS needs to be notified then it could handle poison range in time. Per CXL spec, the device error event could be signaled through FW-First and OS-First methods. So, add poison event handler in OS-First method: -

[RFC PATCH 4/5] cxl/core: add report option for cxl_mem_get_poison()

2024-02-09 Thread Shiyang Ruan via
When a poison event is received, driver uses GET_POISON_LIST command to get the poison list. Now driver has cxl_mem_get_poison(), so reuse it and add a parameter 'bool report', report poison record to MCE if set true. Signed-off-by: Shiyang Ruan --- drivers/cxl/core/mbox.c | 7 +-- driver

[RFC PATCH 1/2] hw/cxl/type3: add missing flag bit for GMER

2024-02-09 Thread Shiyang Ruan via
The "Volatile" should be set if current device is a volatile memory. Per CXL Spec r3.0 8.2.9.2.1.1, Table 8-43. Signed-off-by: Shiyang Ruan --- hw/mem/cxl_type3.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c index 52647b4ac7..d8fb63b1de 100644