Hi PoAn,

Thanks for the KIP! I was actually planning on including the paused metric as 
part of KIP-1307 so thanks for taking it on - I’m interested in seeing it land. 
I have a chart that tracks current assigned-partitions and a custom metric that 
has a gauge with (config, nowMs) -> consumer.paused().size() but wanted to push 
this upstream. Worth adding in a rejected alternatives that there are more like 
me in the community that there is value in the main client.

Some thoughts, one of which I see Chia-Ping has already highlighted.

AK1: Naming.
AK1_1: Like Chia-Ping suggested, we should converge on paused-* for metric name 
because it makes searches easier. On that note, I think we should name it 
paused-partitions to align with its counterpart assigned-partitions.
AK1_2: partition-paused to denote a boolean could be denoted by 
paused-partitions-state for 1/0.
AK1_3: partition-paused-time-ms could be paused-partitions-time-seconds rather 
than millis. I think millis is too fine a granularity for essentially rebalance 
like events and potentially only called between poll(). We have precedence with 
last-polled-seconds-ago and last-heartbeat-seconds-ago.
AK1_4: Plural everywhere because even for per-partition metrics like 
records-lag, records-lead-min we use plural with partition=“{partition}”, 
topic=“{topic}” etc.

AK2: I think the paused-partitions metric should be in the 
consumer-coordinator-metrics group along with assigned-partitions, rather than 
consumer-metrics.

AK3: Flapping
One scenario I was thinking of was flapping. I have a delayed retry consumer 
that checks the head of a queue to check if it’s time to process it, if not I 
pause the partition until it’s time to pick it up again. But if it’s frequent 
enough, there might be a bug where I constantly go between pause() and resume().

I think 4 more metrics might help here: a meter sensor with 
paused-partitions-rate, paused-partitions-total at INFO with DEBUG 
per-partition metrics; -rate for tracking any increased per-sec rate of calling 
pause, and -total was cumulative analysis post-fact. 

AK4: Might be worth adding a note on the cardinality, I see we’re very sparse 
on per-partition metrics (only 7 of them so far).

Thanks,
Aditya

On 2026/04/06 11:03:35 PoAn Yang wrote:
> Hello everyone,
> 
> I would like to start a discussion thread on KIP-1304. In this KIP, we plan 
> to add new consumer metrics about paused partitions.
> 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1304%3A+Add+consumer+metric+about+paused+partitions
> 
> Please take a look and feel free to share any thoughts.
> 
> Thanks,
> PoAn

Reply via email to