moonming commented on issue #12275: URL: https://github.com/apache/apisix/issues/12275#issuecomment-2933136673
@zhaoqiang1980 Thanks for the detailed report! This is actually a known issue. It tends to occur more frequently when the shared dictionary (prometheus-metrics) is configured with a relatively small size. Under high concurrency and with a large number of metrics, the shared dict becomes a hotspot and introduces lock contention. The most straightforward mitigation is to increase the size of the shared dict to reduce contention. I think a more robust solution would be to implement a graceful degradation mechanism in the prometheus plugin. For example, when it detects that the shared memory is full and lock contention is impacting performance, it could temporarily pause metrics collection for 5 minutes. This may result in some metrics loss, but would prevent the CPU from hitting 100% and affecting overall system stability. We’d love to hear what others think about this approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
