Wubabalala opened a new pull request, #16167:
URL: https://github.com/apache/dubbo/pull/16167

   ## What is the purpose of the change?
   
   Fixes #16148
   
   Dynamically added metric series in `MetricsNameCountSampler` subclasses 
(e.g., `ThreadRejectMetricsCountSampler`, `ErrorCodeSampler`) are not exported 
to Prometheus when the first event arrives after the initial reporter sync 
cycle.
   
   ## Root Cause Analysis
   
   `MetricsNameCountSampler.samplesChanged` is initialized to `true` and 
consumed by the reporter's first `calSamplesChanged()` poll (CAS `true → 
false`). At that point, `sample()` returns nothing because no actual metric 
series exist yet.
   
   When the first real event arrives later (e.g., a thread pool rejection), 
`SimpleMetricsCountSampler.inc()` creates a new metric series entry in 
`metricCounter`. However, `MetricsNameCountSampler.samplesChanged` is only 
updated during metric name registration — it is never set back to `true` when a 
new metric series is first created at runtime. The reporter sees `false` on 
subsequent polls and never re-registers the new metric series to the Prometheus 
registry.
   
   ## Brief changelog
   
   - **`SimpleMetricsCountSampler`**: Refactored `getAtomicCounter()` into 
`incrementAndGetCreated()` which returns `true` when a new metric series is 
created for the first time (detected via reference equality against the 
candidate `AtomicLong`).
   - **`MetricsNameCountSampler`**: Override `inc()` to call 
`incrementAndGetCreated()` and set `samplesChanged = true` only when a new 
metric series is created. This avoids unnecessary re-registration for updates 
to already-registered series. Also made `addMetricName()` idempotent.
   
   ## How to verify
   
   **Thread pool reject path:**
   
   1. Start a Dubbo 3.3.x provider with `dubbo.metrics.enable-threadpool: true` 
and `dubbo.protocol.threads: 2`.
   2. Wait 3+ minutes (so the initial `samplesChanged` flag is consumed).
   3. Send enough concurrent requests to trigger thread pool exhaustion.
   4. Check `/actuator/prometheus` for `dubbo_thread_pool_reject_thread_count` 
— it should now appear.
   
   **Error code path** is covered by the unit test 
`ErrorCodeSampleTest.testErrorCodeMetricChangesAfterFirstLateEvent`, which 
verifies the same timing sequence with error code events.
   
   ## New tests
   
   - `ErrorCodeSampleTest.testErrorCodeMetricChangesAfterFirstLateEvent`: 
Verifies the exact timing sequence — initial flag consumed, then first error 
event sets flag, repeat of same error does not, new error code sets flag again.
   - 
`PrometheusMetricsThreadPoolTest.testThreadPoolRejectMetricsExportedAfterLateFirstEvent`:
 End-to-end test — simulates the full startup → first poll → late event 
timeline and asserts that Prometheus scrape output contains the late-arriving 
reject metric series.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to