navina commented on PR #9800: URL: https://github.com/apache/pinot/pull/9800#issuecomment-1317817172
> I'm not sure running the periodic task every minute or so is a good idea! Agreed here :) We will likely not set it to query every minute. > If we choose to emit the metric on the server side, then we can change the gauge as soon as the events are consumed. It's just up to the metric & monitoring system (outside pinot) to aggregate the metric values (e.g. finding max value) for different replicas of each partition. Agree that we can detect it sooner. but there doesn't seem to be a good way to aggregate it in the monitoring layer in the presence of rebalance (clean/unclean) or consuming segment re-distribution for any other reason. We have also noted that sometimes all consuming segments get into ERROR state (maybe consumer crashed or hanged) and yet the monitoring metric `LLC_PARTITION_CONSUMING` doesn't detect [ @npawar may have more context ]. Moreover adding a metric in the segment data manager feels like tip-toeing across a landmine. A much cleaner way would be to emit at partition level from the connector plugin directly or from server (without involving the server tag, but a stable replica id tag). I believe there are some dependency issues to be sorted out before getting there. > I believe we do invoke the code to remove a metric each time a partition completes consumption. This works well in a stable state and clean operations. But this doesn't cover cases of unclean shutdown / crashes in production and it has generally been observed to be not very reliable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
