On Mon, 21 Jul 2025 18:22:27 +0100 <shiju.j...@huawei.com> wrote: > From: Davidlohr Bueso <d...@stgolabs.net>
I tweaked the title to mention Post Package Repair. If anyone is ever looking for that particular maintenance command they might want to know it is in here from the title. > > This adds initial support for the Maintenance command, specifically > the soft and hard PPR operations on a dpa. The implementation allows > to be executed at runtime, therefore semantically, data is retained > and CXL.mem requests are correctly processed. > > Keep track of the requests upon a general media or DRAM event. > > Post Package Repair (PPR) maintenance operations may be supported by CXL > devices that implement CXL.mem protocol. A PPR maintenance operation > requests the CXL device to perform a repair operation on its media. > For example, a CXL device with DRAM components that support PPR features > may implement PPR Maintenance operations. DRAM components may support two > types of PPR, hard PPR (hPPR), for a permanent row repair, and Soft PPR > (sPPR), for a temporary row repair. Soft PPR is much faster than hPPR, > but the repair is lost with a power cycle. > > CXL spec 3.2 section 8.2.10.7.1.2 describes the device's sPPR (soft PPR) > maintenance operation and section 8.2.10.7.1.3 describes the device's > hPPR (hard PPR) maintenance operation feature. > > CXL spec 3.2 section 8.2.10.7.2.1 describes the sPPR feature discovery and > configuration. > > CXL spec 3.2 section 8.2.10.7.2.2 describes the hPPR feature discovery and > configuration. > > CXL spec 3.2 section 8.2.10.2.1.4 Table 8-60 describes the Memory Sparing > Event Record. > > Signed-off-by: Davidlohr Bueso <d...@stgolabs.net> > Co-developed-by: Shiju Jose <shiju.j...@huawei.com> > Signed-off-by: Shiju Jose <shiju.j...@huawei.com> > Signed-off-by: Jonathan Cameron <jonathan.came...@huawei.com> > Signed-off-by: Shiju Jose <shiju.j...@huawei.com> 2x SoB for Shiju. Main question I have on this is why we currently track things marked for maintenance in injected error records, but don't act on that in any way. I'm also thinking we could easily pass on any provided geometry so the sparing record reflects what was injected if that's where it came from. If we do PPR on something not injected we might still want to make some plausible geometry up. > +static void cxl_mbox_create_mem_sparing_event_records(CXLType3Dev *ct3d, > + uint8_t class, uint8_t sub_class) > +{ > + CXLEventSparing event_rec = {}; > + > + cxl_assign_event_header(&event_rec.hdr, > + &sparing_uuid, > + (1 << CXL_EVENT_TYPE_INFO), > + sizeof(event_rec), > + cxl_device_get_timestamp(&ct3d->cxl_dstate), > + 1, class, 1, sub_class, 0, 0, 0, 0); > + > + event_rec.flags = 0; > + event_rec.result = 0; > + event_rec.validity_flags = CXL_MSER_VALID_CHANNEL | > + CXL_MSER_VALID_RANK | > + CXL_MSER_VALID_NIB_MASK | > + CXL_MSER_VALID_BANK_GROUP | > + CXL_MSER_VALID_BANK | > + CXL_MSER_VALID_ROW | > + CXL_MSER_VALID_COLUMN | > + CXL_MSER_VALID_SUB_CHANNEL; > + > + event_rec.res_avail = 1; > + event_rec.channel = 2; > + event_rec.rank = 5; > + st24_le_p(event_rec.nibble_mask, 0xA59C); > + event_rec.bank_group = 2; > + event_rec.bank = 4; > + st24_le_p(event_rec.row, 13); > + event_rec.column = 23; > + event_rec.sub_channel = 7; At some point we should cycle back and make up some 'geometry' for the memory so we can map different DPAs to different places. This is fine for now though. > + > + if (cxl_event_insert(&ct3d->cxl_dstate, > + CXL_EVENT_TYPE_INFO, > + (CXLEventRecordRaw *)&event_rec)) { > + cxl_event_irq_assert(ct3d); > + } > +} > + > + > +static void cxl_perform_ppr(CXLType3Dev *ct3d, uint64_t dpa) > +{ > + CXLMaintenance *ent, *next; > + > + QLIST_FOREACH_SAFE(ent, &ct3d->maint_list, node, next) { If we did want to generate the right geometry to match the injected event we'd want to retrieve it here (having stashed it in the ent) > + if (dpa == ent->dpa) { > + QLIST_REMOVE(ent, node); What is this actually for at the moment? We track them on a list but don't enforce anything with it? I don't think we should enforce this as you can issue PPR on stuff that was never in error if you like. > + g_free(ent); > + break; > + } > + } > + > + /* Produce a Memory Sparing Event Record */ > + if (ct3d->soft_ppr_attrs.sppr_op_mode & > + CXL_MEMDEV_SPPR_OP_MODE_MEM_SPARING_EV_REC_EN) { > + cxl_mbox_create_mem_sparing_event_records(ct3d, > + CXL_MEMDEV_MAINT_CLASS_SPARING, > + CXL_MEMDEV_MAINT_SUBCLASS_CACHELINE_SPARING); > + } > +} > /* Component ID is device specific. Define this as a string. */ > void qmp_cxl_inject_general_media_event(const char *path, CxlEventLog log, > uint32_t flags, bool > has_maint_op_class, > @@ -1715,6 +1756,11 @@ void qmp_cxl_inject_general_media_event(const char > *path, CxlEventLog log, > error_setg(errp, "Unhandled error log type"); > return; > } > + if (rc == CXL_EVENT_TYPE_INFO && > + (flags & CXL_EVENT_REC_FLAGS_MAINT_NEEDED)) { > + error_setg(errp, "Informational event cannot require maintenance"); > + return; > + } > enc_log = rc; > > memset(&gem, 0, sizeof(gem)); > @@ -1773,6 +1819,10 @@ void qmp_cxl_inject_general_media_event(const char > *path, CxlEventLog log, > if (cxl_event_insert(cxlds, enc_log, (CXLEventRecordRaw *)&gem)) { > cxl_event_irq_assert(ct3d); > } > + > + if (flags & CXL_EVENT_REC_FLAGS_MAINT_NEEDED) { > + cxl_maintenance_insert(ct3d, dpa); Same as below. > + } > > memset(&dram, 0, sizeof(dram)); > @@ -1935,6 +1990,10 @@ void qmp_cxl_inject_dram_event(const char *path, > CxlEventLog log, > if (cxl_event_insert(cxlds, enc_log, (CXLEventRecordRaw *)&dram)) { > cxl_event_irq_assert(ct3d); > } > + > + if (flags & CXL_EVENT_REC_FLAGS_MAINT_NEEDED) { > + cxl_maintenance_insert(ct3d, dpa); We make up the geometry details for the sparing record, but we 'could' store them here if they were injected and hence spit out an appropriate sparing record? Do you think it's worth doing at this stage? Jonathan > + } > } > > #define CXL_MMER_VALID_COMPONENT BIT(0)