Re: [PR] MINOR: Fix rate metric spikes [kafka]

via GitHub Wed, 22 May 2024 10:22:54 -0700


junrao commented on code in PR #15889:
URL: https://github.com/apache/kafka/pull/15889#discussion_r1608743345



##########
clients/src/main/java/org/apache/kafka/common/metrics/stats/SampledStat.java:
##########
@@ -50,10 +50,11 @@ public void record(MetricConfig config, double value, long 
timeMs) {
             sample = advance(config, timeMs);
         update(sample, config, value, timeMs);
         sample.eventCount += 1;
+        sample.lastEventMs = timeMs;
     }
 
     private Sample advance(MetricConfig config, long timeMs) {
-        this.current = (this.current + 1) % config.samples();
+        this.current = (this.current + 1) % (config.samples() + 1);

Review Comment:
   It would be useful to add a comment to explain why we keep an additional 
sample than configured.



##########
clients/src/main/java/org/apache/kafka/common/metrics/stats/Rate.java:
##########
@@ -72,29 +67,12 @@ public long windowSize(MetricConfig config, long now) {
         stat.purgeObsoleteSamples(config, now);
 
         /*
-         * Here we check the total amount of time elapsed since the oldest 
non-obsolete window.
-         * This give the total windowSize of the batch which is the time used 
for Rate computation.
-         * However, there is an issue if we do not have sufficient data for 
e.g. if only 1 second has elapsed in a 30 second
-         * window, the measured rate will be very high.
-         * Hence we assume that the elapsed time is always N-1 complete 
windows plus whatever fraction of the final window is complete.
-         *
-         * Note that we could simply count the amount of time elapsed in the 
current window and add n-1 windows to get the total time,
-         * but this approach does not account for sleeps. SampledStat only 
creates samples whenever record is called,
-         * if no record is called for a period of time that time is not 
accounted for in windowSize and produces incorrect results.
+         * Purging process above guarantees to keep all events starting from
+         * earliest(monitoredWindow start, oldestSample start). Use the 
largest as windowSize.
          */
-        long totalElapsedTimeMs = now - stat.oldest(now).lastWindowMs;
-        // Check how many full windows of data we have currently retained
-        int numFullWindows = (int) (totalElapsedTimeMs / 
config.timeWindowMs());
-        int minFullWindows = config.samples() - 1;
-
-        // If the available windows are less than the minimum required, add 
the difference to the totalElapsedTime
-        if (numFullWindows < minFullWindows)
-            totalElapsedTimeMs += (minFullWindows - numFullWindows) * 
config.timeWindowMs();
-
-        // If window size is being calculated at the exact beginning of the 
window with no prior samples, the window size
-        // will result in a value of 0. Calculation of rate over a window is 
size 0 is undefined, hence, we assume the
-        // minimum window size to be at least 1ms.
-        return Math.max(totalElapsedTimeMs, 1);
+        long monitoredWindow = config.timeWindowMs() * config.samples();

Review Comment:
   Hmm, the changes the existing logic a bit. The existing logic makes sure 
that we include at least config.samples() - 1 full windows. The last one could 
be partial.



##########
clients/src/test/java/org/apache/kafka/common/metrics/stats/RateTest.java:
##########
@@ -64,4 +69,30 @@ public void testRateWithNoPriorAvailableSamples(int 
numSample, int sampleWindowS
         double expectedRatePerSec = sampleValue / windowSize;
         assertEquals(expectedRatePerSec, observedRate, EPS);
     }
+
+    // Record an event every 100 ms on average, moving some 1 ms back or forth 
for fine-grained 
+    // window control. The expected rate, hence, is 10-11 events/sec depending 
on the moment of 
+    // measurement. Start assertions from the second window.
+    @Test
+    public void testRateIsConsistentAfterTheFirstWindow() {
+        MetricConfig config = new MetricConfig().timeWindow(1, 
SECONDS).samples(2);
+        List<Integer> steps = Arrays.asList(0, 99, 100, 100, 100, 100, 100, 
100, 100, 100, 100);
+
+        // start the first window and record events at 0,99,199,...,999 ms 
+        for (int stepMs : steps) {
+            time.sleep(stepMs);
+            rate.record(config, 1, time.milliseconds());
+        }
+
+        // making a gap of 100 ms between windows
+        time.sleep(101);
+
+        // start the second window and record events at 0,99,199,...,999 ms
+        for (int stepMs : steps) {
+            time.sleep(stepMs);
+            rate.record(config, 1, time.milliseconds());
+            double observedRate = rate.measure(config, time.milliseconds());

Review Comment:
   Yes, it's probably useful to assert that taking a second measurement with no 
time change leads to the same value. This is more for preventing future 
incorrect changes and it's also low overhead. 



##########
clients/src/test/java/org/apache/kafka/common/metrics/stats/RateTest.java:
##########
@@ -64,4 +69,31 @@ public void testRateWithNoPriorAvailableSamples(int 
numSample, int sampleWindowS
         double expectedRatePerSec = sampleValue / windowSize;
         assertEquals(expectedRatePerSec, observedRate, EPS);
     }
+
+    // Record an event every 100 ms on average, moving some 1 ms back or forth 
for fine-grained 
+    // window control. The expected rate, hence, is 10-11 events/sec depending 
on the moment of 
+    // measurement. Start assertions from the second window. This test is to 
address past issue,
+    // when measurements in the end of the sample led to value spikes.

Review Comment:
   How about changing "This test is to address past issue, when measurements in 
the end of the sample led to value spikes." to sth like "This test covers the 
case where a sample window partially overlaps with the monitored window." ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] MINOR: Fix rate metric spikes [kafka]

Reply via email to