haridsv opened a new pull request, #8120:
URL: https://github.com/apache/hbase/pull/8120

   Tool: Claude Code
   Model: Opus 4.6 (1M context)
   Prompt: You need to diagnose the HBase test TestAsyncTableScan that is 
flapping. It had multiple failures and errors in the PR validation build, but 
none when I ran locally. The source code is at @../../src/apache/hbase/ and the 
logs for good run is at @good-run and bad run is at @bad-run. Can you compare 
the logs and try to understand why the run failed and look at the code for a 
potential fix by focussing solely on these logs alone?
   
   Summary of the fix by AI:
   
    ## Summary
   
   **Root cause**: A race condition between span export and test span isolation 
causes stale `TracedScanResultConsumer#onComplete` spans from a previous test 
to appear in the current test's span exporter.
   
   **The race**: In `TracedScanResultConsumer.onComplete()`, the 
`TraceUtil.trace()` wrapper calls `delegate.onComplete()` (which sets 
`finished=true` and `notifyAll()`) **before** calling `span.end()` (which 
exports the span). The test thread can wake up from `getAll()`, finish the 
current test, start the next test (which calls `clearSpans()`), and only then 
does the ForkJoinPool thread call `span.end()` — exporting the stale span into 
the next test's freshly cleared exporter.
   
   **Why it flaps**: The race is timing-dependent. On CI (slower machines), the 
ForkJoinPool thread is more likely to be preempted between 
`delegate.onComplete()` and `span.end()`, giving the test thread time to 
proceed to the next test. Locally (faster machines), the window is too small to 
hit.
   
   **Why only `scan=batch`**: The batch scan mode produces more results (each 
result is split in two), making each test slightly longer and increasing the 
probability that the ForkJoinPool thread scheduling creates the right timing 
for the race.
   
   **The fix**: Modified `assertTraceContinuity()` and `assertTraceError()` in 
`TestAsyncTableScan.java` to include `hasParentSpanId(scanOperationSpanId)` in 
the matchers used for both filtering and `waitForSpan`. This ensures the 
assertions only consider spans that belong to the current test's SCAN 
operation, ignoring any stale spans from previous tests. The `waitForSpan` with 
the more specific matcher also correctly waits for the current test's span 
rather than being satisfied by a stale one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to