Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
New issues: - [Medium] The `set-error-threshold` operation modifies state but only replies with a generic ACK instead of returning the finalized object state, violating Netlink uAPI guidelines. Pre-existing issues: - [Critical] Lockless lookup and dereference of `drm_ras_node` via `xa_load()` leads to a Use-After-Free if the node is concurrently unregistered. -- commit 90a4ea5c0139594375be477d76556dae20fa84e0 Author: Raag Jadav <[email protected]> Subject: drm/ras: Introduce error threshold Add get-error-threshold and set-error-threshold command support which allows querying/setting error threshold of the counter. Threshold in RAS context means the number of errors the hardware is expected to accumulate before it raises them to software. > diff --git a/Documentation/netlink/specs/drm_ras.yaml > b/Documentation/netlink/specs/drm_ras.yaml > index e113056f8c016..9cf7f9cde2425 100644 > --- a/Documentation/netlink/specs/drm_ras.yaml > +++ b/Documentation/netlink/specs/drm_ras.yaml > @@ -124,3 +128,31 @@ operations: [ ... ] > + - > + name: set-error-threshold > + doc: >- > + Set error threshold of a given counter. > + attribute-set: error-counter-attrs > + flags: [admin-perm] > + do: > + request: > + attributes: > + - node-id > + - error-id > + - error-threshold [Severity: Medium] Does this operation need a reply block to return the finalized threshold? According to the Generic Netlink subsystem uAPI guidelines, operations that modify state must reply with a full message carrying identifying information about the object. If the driver applies bounds checking and clamps or rounds the threshold, user space won't know the actual applied value since the command only returns a generic ACK. Is it possible to include a reply block here? > diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c > index 467a169026fc1..d60c40ac5427a 100644 > --- a/drivers/gpu/drm/drm_ras.c > +++ b/drivers/gpu/drm/drm_ras.c > @@ -168,6 +189,40 @@ static int get_node_error_counter(u32 node_id, u32 > error_id, > return node->query_error_counter(node, error_id, name, value); > } > > +static int get_node_error_threshold(u32 node_id, u32 error_id, const char > **name, u32 *threshold) > +{ > + struct drm_ras_node *node; > + > + node = xa_load(&drm_ras_xa, node_id); > + if (!node) > + return -ENOENT; > + > + if (!node->query_error_threshold) > + return -EOPNOTSUPP; > + > + if (error_id < node->error_counter_range.first || error_id > > node->error_counter_range.last) > + return -EINVAL; > + > + return node->query_error_threshold(node, error_id, name, threshold); [Severity: Critical] This is a pre-existing issue, but is it safe to dereference the node pointer here without RCU protection or a reference count? The xa_load() function drops the RCU read lock before returning. If a concurrent thread unregisters the node and frees it, could this result in a use-after-free when calling node->query_error_threshold()? > +} > + > +static int set_node_error_threshold(u32 node_id, u32 error_id, u32 threshold) > +{ > + struct drm_ras_node *node; > + > + node = xa_load(&drm_ras_xa, node_id); > + if (!node) > + return -ENOENT; > + > + if (!node->set_error_threshold) > + return -EOPNOTSUPP; > + > + if (error_id < node->error_counter_range.first || error_id > > node->error_counter_range.last) > + return -EINVAL; > + > + return node->set_error_threshold(node, error_id, threshold); [Severity: Critical] This is a pre-existing issue, but similarly to get_node_error_threshold(), the node pointer is retrieved locklessly via xa_load() and immediately dereferenced. Could a concurrent unregistration of the device free the node while it is being accessed here? > +} [ ... ] > @@ -358,6 +466,59 @@ int drm_ras_nl_clear_error_counter_doit(struct sk_buff > *skb, > return node->clear_error_counter(node, error_id); > } > [ ... ] > +int drm_ras_nl_set_error_threshold_doit(struct sk_buff *skb, struct > genl_info *info) > +{ > + u32 node_id, error_id, threshold; > + > + if (!info->attrs || > + GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) || > + GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID) || > + GENL_REQ_ATTR_CHECK(info, > DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD)) > + return -EINVAL; > + > + node_id = > nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]); > + error_id = > nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]); > + threshold = > nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD]); > + > + return set_node_error_threshold(node_id, error_id, threshold); [Severity: Medium] This operation completes by returning the success code from set_node_error_threshold(), which results in a generic ACK being sent to user space. Should this prepare and send a netlink response message containing the actual applied threshold instead of relying on the ACK? > +} -- Sashiko AI review ยท https://sashiko.dev/#/patchset/[email protected]?part=2
