Thank you for your contribution! Sashiko AI review found 2 potential issue(s) 
to consider:

New issues:
- [Medium] The `set-error-threshold` operation modifies state but only replies 
with a generic ACK instead of returning the finalized object state, violating 
Netlink uAPI guidelines.

Pre-existing issues:
- [Critical] Lockless lookup and dereference of `drm_ras_node` via `xa_load()` 
leads to a Use-After-Free if the node is concurrently unregistered.
--

commit 90a4ea5c0139594375be477d76556dae20fa84e0
Author: Raag Jadav <[email protected]>
Subject: drm/ras: Introduce error threshold

Add get-error-threshold and set-error-threshold command support which
allows querying/setting error threshold of the counter. Threshold in RAS
context means the number of errors the hardware is expected to accumulate
before it raises them to software.

> diff --git a/Documentation/netlink/specs/drm_ras.yaml 
> b/Documentation/netlink/specs/drm_ras.yaml
> index e113056f8c016..9cf7f9cde2425 100644
> --- a/Documentation/netlink/specs/drm_ras.yaml
> +++ b/Documentation/netlink/specs/drm_ras.yaml
> @@ -124,3 +128,31 @@ operations:
[ ... ]
> +    -
> +      name: set-error-threshold
> +      doc: >-
> +           Set error threshold of a given counter.
> +      attribute-set: error-counter-attrs
> +      flags: [admin-perm]
> +      do:
> +        request:
> +          attributes:
> +            - node-id
> +            - error-id
> +            - error-threshold

[Severity: Medium]
Does this operation need a reply block to return the finalized threshold?

According to the Generic Netlink subsystem uAPI guidelines, operations that
modify state must reply with a full message carrying identifying information
about the object.

If the driver applies bounds checking and clamps or rounds the threshold,
user space won't know the actual applied value since the command only
returns a generic ACK. Is it possible to include a reply block here?

> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
> index 467a169026fc1..d60c40ac5427a 100644
> --- a/drivers/gpu/drm/drm_ras.c
> +++ b/drivers/gpu/drm/drm_ras.c
> @@ -168,6 +189,40 @@ static int get_node_error_counter(u32 node_id, u32 
> error_id,
>       return node->query_error_counter(node, error_id, name, value);
>  }
>  
> +static int get_node_error_threshold(u32 node_id, u32 error_id, const char 
> **name, u32 *threshold)
> +{
> +     struct drm_ras_node *node;
> +
> +     node = xa_load(&drm_ras_xa, node_id);
> +     if (!node)
> +             return -ENOENT;
> +
> +     if (!node->query_error_threshold)
> +             return -EOPNOTSUPP;
> +
> +     if (error_id < node->error_counter_range.first || error_id > 
> node->error_counter_range.last)
> +             return -EINVAL;
> +
> +     return node->query_error_threshold(node, error_id, name, threshold);

[Severity: Critical]
This is a pre-existing issue, but is it safe to dereference the node pointer
here without RCU protection or a reference count?

The xa_load() function drops the RCU read lock before returning. If a
concurrent thread unregisters the node and frees it, could this result in a
use-after-free when calling node->query_error_threshold()?

> +}
> +
> +static int set_node_error_threshold(u32 node_id, u32 error_id, u32 threshold)
> +{
> +     struct drm_ras_node *node;
> +
> +     node = xa_load(&drm_ras_xa, node_id);
> +     if (!node)
> +             return -ENOENT;
> +
> +     if (!node->set_error_threshold)
> +             return -EOPNOTSUPP;
> +
> +     if (error_id < node->error_counter_range.first || error_id > 
> node->error_counter_range.last)
> +             return -EINVAL;
> +
> +     return node->set_error_threshold(node, error_id, threshold);

[Severity: Critical]
This is a pre-existing issue, but similarly to get_node_error_threshold(),
the node pointer is retrieved locklessly via xa_load() and immediately
dereferenced.

Could a concurrent unregistration of the device free the node while it is
being accessed here?

> +}
[ ... ]
> @@ -358,6 +466,59 @@ int drm_ras_nl_clear_error_counter_doit(struct sk_buff 
> *skb,
>       return node->clear_error_counter(node, error_id);
>  }
>  
[ ... ]
> +int drm_ras_nl_set_error_threshold_doit(struct sk_buff *skb, struct 
> genl_info *info)
> +{
> +     u32 node_id, error_id, threshold;
> +
> +     if (!info->attrs ||
> +         GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
> +         GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID) ||
> +         GENL_REQ_ATTR_CHECK(info, 
> DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD))
> +             return -EINVAL;
> +
> +     node_id = 
> nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
> +     error_id = 
> nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
> +     threshold = 
> nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD]);
> +
> +     return set_node_error_threshold(node_id, error_id, threshold);

[Severity: Medium]
This operation completes by returning the success code from
set_node_error_threshold(), which results in a generic ACK being sent to
user space.

Should this prepare and send a netlink response message containing the
actual applied threshold instead of relying on the ACK?

> +}

-- 
Sashiko AI review ยท 
https://sashiko.dev/#/patchset/[email protected]?part=2

Reply via email to