Hi PoAn,
Thanks for the KIP! I was actually planning on including the paused metric as
part of KIP-1307 so thanks for taking it on - I’m interested in seeing it land.
I have a chart that tracks current assigned-partitions and a custom metric that
has a gauge with (config, nowMs) -> consumer.paused().size() but wanted to push
this upstream. Worth adding in a rejected alternatives that there are more like
me in the community that there is value in the main client.
Some thoughts, one of which I see Chia-Ping has already highlighted.
AK1: Naming.
AK1_1: Like Chia-Ping suggested, we should converge on paused-* for metric name
because it makes searches easier. On that note, I think we should name it
paused-partitions to align with its counterpart assigned-partitions.
AK1_2: partition-paused to denote a boolean could be denoted by
paused-partitions-state for 1/0.
AK1_3: partition-paused-time-ms could be paused-partitions-time-seconds rather
than millis. I think millis is too fine a granularity for essentially rebalance
like events and potentially only called between poll(). We have precedence with
last-polled-seconds-ago and last-heartbeat-seconds-ago.
AK1_4: Plural everywhere because even for per-partition metrics like
records-lag, records-lead-min we use plural with partition=“{partition}”,
topic=“{topic}” etc.
AK2: I think the paused-partitions metric should be in the
consumer-coordinator-metrics group along with assigned-partitions, rather than
consumer-metrics.
AK3: Flapping
One scenario I was thinking of was flapping. I have a delayed retry consumer
that checks the head of a queue to check if it’s time to process it, if not I
pause the partition until it’s time to pick it up again. But if it’s frequent
enough, there might be a bug where I constantly go between pause() and resume().
I think 4 more metrics might help here: a meter sensor with
paused-partitions-rate, paused-partitions-total at INFO with DEBUG
per-partition metrics; -rate for tracking any increased per-sec rate of calling
pause, and -total was cumulative analysis post-fact.
AK4: Might be worth adding a note on the cardinality, I see we’re very sparse
on per-partition metrics (only 7 of them so far).
Thanks,
Aditya
On 2026/04/06 11:03:35 PoAn Yang wrote:
> Hello everyone,
>
> I would like to start a discussion thread on KIP-1304. In this KIP, we plan
> to add new consumer metrics about paused partitions.
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1304%3A+Add+consumer+metric+about+paused+partitions
>
> Please take a look and feel free to share any thoughts.
>
> Thanks,
> PoAn