在 2024/8/9 2:28, Fan Ni 写道:
On Thu, Aug 08, 2024 at 11:13:27PM +0800, Shiyang Ruan wrote:
CXL device can find&report memory problems, even before MCE is detected
by CPU. AFAIK, the current kernel only traces POISON error event
from FW-First/OS-First path, but it doesn't handle them, neither
This patchset includes "cxl/core: introduce poison creation hanlding"
and "cxl: avoid duplicated report from MCE & device", which were posted
separately. Here are changes since last version of each patch:
P1: 1. since its async memory_failure(), set the flag to 0
2. also handle CXL_EVENT_TRA
Since CXL device is a memory device, while CPU is consuming a poison
page of CXL device, it always triggers a MCE (via interrupt #18) and
calls memory_failure() to handle POISON page, no matter which-First path
is configured. CXL device could also find and report the POISON, kernel
now not only tr
CXL device can find&report memory problems, even before MCE is detected
by CPU. AFAIK, the current kernel only traces POISON error event
from FW-First/OS-First path, but it doesn't handle them, neither
notify processes who are using the POISON page like MCE does.
Thus, user have to read logs from
在 2024/7/20 0:04, Dave Jiang 写道:
On 7/1/24 7:12 PM, Shiyang Ruan wrote:
在 2024/6/25 21:56, Shiyang Ruan 写道:
在 2024/6/22 1:51, Dan Williams 写道:
Shiyang Ruan wrote:
Background:
Since CXL device is a memory device, while CPU consumes a poison page of
CXL device, it always triggers a MCE
在 2024/6/19 0:53, Shiyang Ruan 写道:
This patch adds a new notifier_block and MCE_PRIO_CXL, for CXL memdev
to check whether the current poison page has been reported (if yes,
stop the notifier chain, won't call the following memory_failure()
to report), into `x86_mce_decoder_chain`. In this way
在 2024/6/25 21:56, Shiyang Ruan 写道:
在 2024/6/22 1:51, Dan Williams 写道:
Shiyang Ruan wrote:
Background:
Since CXL device is a memory device, while CPU consumes a poison page of
CXL device, it always triggers a MCE by interrupt (INT18), no matter
which-First path is configured. This is the
在 2024/6/22 4:44, Luck, Tony 写道:
So who actually cares about recovering poisoned volatile memory?
I'd like to understand more on how significant a use case this is.
Whilst I can conjecture that its an extreme case of wanting to avoid
loosing the ability to create 1GiB or larger pages due to po
在 2024/6/22 1:51, Dan Williams 写道:
Shiyang Ruan wrote:
Background:
Since CXL device is a memory device, while CPU consumes a poison page of
CXL device, it always triggers a MCE by interrupt (INT18), no matter
which-First path is configured. This is the first report. Then
currently, in FW-Fi
在 2024/6/20 23:51, Dave Jiang 写道:
On 6/19/24 2:24 AM, Shiyang Ruan wrote:
在 2024/6/19 7:35, Dave Jiang 写道:
On 6/18/24 9:53 AM, Shiyang Ruan wrote:
Background:
Since CXL device is a memory device, while CPU consumes a poison page of
CXL device, it always triggers a MCE by interrupt (IN
在 2024/6/21 1:02, Jonathan Cameron 写道:
On Wed, 19 Jun 2024 00:53:10 +0800
Shiyang Ruan wrote:
Background:
Since CXL device is a memory device, while CPU consumes a poison page of
CXL device, it always triggers a MCE by interrupt (INT18), no matter
which-First path is configured. This is th
在 2024/6/19 7:35, Dave Jiang 写道:
On 6/18/24 9:53 AM, Shiyang Ruan wrote:
Background:
Since CXL device is a memory device, while CPU consumes a poison page of
CXL device, it always triggers a MCE by interrupt (INT18), no matter
which-First path is configured. This is the first report. Then
Background:
Since CXL device is a memory device, while CPU consumes a poison page of
CXL device, it always triggers a MCE by interrupt (INT18), no matter
which-First path is configured. This is the first report. Then
currently, in FW-First path, the poison event is transferred according
to th
在 2024/5/24 23:15, Shiyang Ruan 写道:
在 2024/5/22 14:45, Dan Williams 写道:
Shiyang Ruan wrote:
[..]
My expectation is MF_ACTION_REQUIRED is not appropriate for CXL event
reported errors since action is only required for direct consumption
events and those need not be reported through the devi
在 2024/5/22 14:45, Dan Williams 写道:
Shiyang Ruan wrote:
[..]
My expectation is MF_ACTION_REQUIRED is not appropriate for CXL event
reported errors since action is only required for direct consumption
events and those need not be reported through the device event queue.
Got it.
I'm not very
在 2024/5/3 19:32, Shiyang Ruan 写道:
在 2024/4/24 2:40, Dan Williams 写道:
Shiyang Ruan wrote:
Currently driver only traces cxl events, poison creation (for both vmem
and pmem type) on cxl memdev is silent.
As it should be.
OS needs to be notified then it could handle poison pages in time.
在 2024/5/1 5:00, Alison Schofield 写道:
On Wed, Apr 17, 2024 at 03:50:52PM +0800, Shiyang Ruan wrote:
The length of Physical Address in General Media Event Record/DRAM Event
Record is 64-bit, so the field mask should be defined as such length.
Otherwise, this causes cxl_general_media and cxl_dr
在 2024/4/24 2:40, Dan Williams 写道:
Shiyang Ruan wrote:
Currently driver only traces cxl events, poison creation (for both vmem
and pmem type) on cxl memdev is silent.
As it should be.
OS needs to be notified then it could handle poison pages in time.
No, it was always the case that late
在 2024/4/24 1:57, Ira Weiny 写道:
Shiyang Ruan wrote:
Currently driver only traces cxl events, poison creation (for both vmem
and pmem type) on cxl memdev is silent. OS needs to be notified then it
could handle poison pages in time. Per CXL spec, the device error event
could be signaled throu
在 2024/4/24 5:04, Ira Weiny 写道:
Alison Schofield wrote:
On Wed, Apr 17, 2024 at 03:50:52PM +0800, Shiyang Ruan wrote:
[snip]
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index e5f13260fc52..cdfce932d5b1 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace
在 2024/4/18 1:30, Dave Jiang 写道:
On 4/17/24 12:50 AM, Shiyang Ruan wrote:
Currently driver only traces cxl events, poison creation (for both vmem
and pmem type) on cxl memdev is silent. OS needs to be notified then it
could handle poison pages in time. Per CXL spec, the device error event
Currently driver only traces cxl events, poison creation (for both vmem
and pmem type) on cxl memdev is silent. OS needs to be notified then it
could handle poison pages in time. Per CXL spec, the device error event
could be signaled through FW-First and OS-First methods.
So, add poison creation
The length of Physical Address in General Media Event Record/DRAM Event
Record is 64-bit, so the field mask should be defined as such length.
Otherwise, this causes cxl_general_media and cxl_dram tracepoints to
mask off the upper-32-bits of DPA addresses. The cxl_poison event is
unaffected.
If use
Changes: RFCv2 -> v3:
1. patch1: removed changes for flags
2. changed the main idea of this patchset: not for injection event
handling, but for creation;
3. removed GET_POISON_LIST command while receiving POISON event;
4. dropped poison report in debugfs;
5. added DER event handler to handle P
在 2024/3/30 9:52, Dan Williams 写道:
Shiyang Ruan wrote:
Poison injection from debugfs is silent too. Add calling
cxl_mem_report_poison() to make it able to do memory_failure().
Why does this needs to be signalled? It is a debug interface, the
debugger can also trigger a read after the injec
在 2024/3/30 9:50, Dan Williams 写道:
Shiyang Ruan wrote:
The GMER only has "Physical Address" field, no such one indicates length.
So, when a poison event is received, we could use GET_POISON_LIST command
to get the poison list. Now driver has cxl_mem_get_poison(), so
reuse it and add a parame
在 2024/3/30 9:37, Dan Williams 写道:
Shiyang Ruan wrote:
The length of Physical Address in General Media Event Record/DRAM Event
Record is 64-bit, so the field mask should be defined as such length.
Otherwise, this causes cxl_general_media and cxl_dram tracepoints to
mask off the upper-32-bits
The GMER only has "Physical Address" field, no such one indicates length.
So, when a poison event is received, we could use GET_POISON_LIST command
to get the poison list. Now driver has cxl_mem_get_poison(), so
reuse it and add a parameter 'bool report', report poison record to MCE
if set true.
The transaction types are defined in General Media Event Record/DRAM Event
per CXL rev 3.0 Section 8.2.9.2.1.1; Table 8-43 and
Section 8.2.9.2.1.2; Table 8-44. Add them for Event Record handler use.
Signed-off-by: Shiyang Ruan
---
include/linux/cxl-event.h | 17 +++--
1 file changed
Poison injection from debugfs is silent too. Add calling
cxl_mem_report_poison() to make it able to do memory_failure().
Signed-off-by: Shiyang Ruan
---
drivers/cxl/core/memdev.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index e976
Changes:
RFCv1 -> RFCv2:
1. update commit message of PATCH 1
2. use memory_failure_queue() instead of MCE
3. also report poison in debugfs when injecting poison
4. correct DPA->HPA logic:
find memdev's endpoint decoder to find the region it belongs to
5. distinguish transaction_type of GMER, o
The length of Physical Address in General Media Event Record/DRAM Event
Record is 64-bit, so the field mask should be defined as such length.
Otherwise, this causes cxl_general_media and cxl_dram tracepoints to
mask off the upper-32-bits of DPA addresses. The cxl_poison event is
unaffected.
If use
Currently driver only traces cxl events, poison injection (for both vmem
and pmem type) on cxl memdev is silent. OS needs to be notified then it
could handle poison range in time. Per CXL spec, the device error event
could be signaled through FW-First and OS-First methods.
So, add poison event h
If poison is detected(reported from cxl memdev), OS should be notified to
handle it. So, introduce this helper function for later use:
1. translate DPA to HPA;
2. enqueue records into memory_failure's work queue;
Signed-off-by: Shiyang Ruan
---
Currently poison injection from debugfs always
From: Yao Xingtao
In qemu monitor mode, when we use gpa2hva command to print the host
virtual address corresponding to a guest physical address, if the gpa is
not in RAM, the error message is below:
(qemu) gpa2hva 0x75000
Memory at address 0x75000is not RAM
a space is missed between '0x
在 2024/2/14 0:51, Jonathan Cameron 写道:
+
+void cxl_event_handle_record(struct cxl_memdev *cxlmd,
+enum cxl_event_log_type type,
+enum cxl_event_type event_type,
+const uuid_t *uuid, union cxl_event *evt)
+{
+
在 2024/2/10 14:46, Dan Williams 写道:
Shiyang Ruan wrote:
If poison is detected(reported from cxl memdev), OS should be notified to
handle it. Introduce this function:
1. translate DPA to HPA;
2. construct a MCE instance; (TODO: more details need to be filled)
3. log it into MCE event
在 2024/2/10 14:49, Dan Williams 写道:
Shiyang Ruan wrote:
When a poison event is received, driver uses GET_POISON_LIST command
to get the poison list. Now driver has cxl_mem_get_poison(), so
reuse it and add a parameter 'bool report', report poison record to MCE
if set true.
If the memory er
在 2024/2/10 14:34, Dan Williams 写道:
Shiyang Ruan wrote:
The length of Physical Address in General Media Event Record/DRAM Event
Record is 64-bit, so the field mask should be defined as such length.
Can you include this user visible side-effect of this change. Looks like
this could cause usa
The length of Physical Address in General Media Event Record/DRAM Event
Record is 64-bit, so the field mask should be defined as such length.
Signed-off-by: Shiyang Ruan
---
drivers/cxl/core/trace.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/cxl/core/trace.
Currently driver only trace cxl events, poison injection on cxl memdev
is silent. OS needs to be notified then it could handle poison range
in time. Per CXL spec, the device error event could be signaled through
FW-First and OS-First methods.
This draft patchset adds poison event handler in OS-F
When a memdev is assigned to a region, its Device Physical Address will be
mapped to Host Physical Address. Introduce this helper function to
translate HPA from a given memdev and its DPA.
Signed-off-by: Shiyang Ruan
---
drivers/cxl/core/memdev.c | 12
drivers/cxl/cxlmem.h |
If poison is detected(reported from cxl memdev), OS should be notified to
handle it. Introduce this function:
1. translate DPA to HPA;
2. construct a MCE instance; (TODO: more details need to be filled)
3. log it into MCE event queue;
After that, MCE mechanism can walk over its notifier cha
Send a signal to OS to let it able to handle the poison range.
TODO: This is an rough draft, will add more parameters for
qmp_cxl_inject_poison() to set to GMER.
Signed-off-by: Shiyang Ruan
---
hw/mem/cxl_type3.c | 5 +
1 file changed, 5 insertions(+)
diff --git a/hw/mem/cxl_type3.c b/hw/m
Currently driver only trace cxl events, poison injection on cxl memdev
is silent. OS needs to be notified then it could handle poison range
in time. Per CXL spec, the device error event could be signaled through
FW-First and OS-First methods.
So, add poison event handler in OS-First method:
-
When a poison event is received, driver uses GET_POISON_LIST command
to get the poison list. Now driver has cxl_mem_get_poison(), so
reuse it and add a parameter 'bool report', report poison record to MCE
if set true.
Signed-off-by: Shiyang Ruan
---
drivers/cxl/core/mbox.c | 7 +--
driver
The "Volatile" should be set if current device is a volatile memory.
Per CXL Spec r3.0 8.2.9.2.1.1, Table 8-43.
Signed-off-by: Shiyang Ruan
---
hw/mem/cxl_type3.c | 6 ++
1 file changed, 6 insertions(+)
diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
index 52647b4ac7..d8fb63b1de 100644
47 matches
Mail list logo