haridsv opened a new pull request, #8120:
URL: https://github.com/apache/hbase/pull/8120
Tool: Claude Code
Model: Opus 4.6 (1M context)
Prompt: You need to diagnose the HBase test TestAsyncTableScan that is
flapping. It had multiple failures and errors in the PR validation build, but
none when I ran locally. The source code is at @../../src/apache/hbase/ and the
logs for good run is at @good-run and bad run is at @bad-run. Can you compare
the logs and try to understand why the run failed and look at the code for a
potential fix by focussing solely on these logs alone?
Summary of the fix by AI:
## Summary
**Root cause**: A race condition between span export and test span isolation
causes stale `TracedScanResultConsumer#onComplete` spans from a previous test
to appear in the current test's span exporter.
**The race**: In `TracedScanResultConsumer.onComplete()`, the
`TraceUtil.trace()` wrapper calls `delegate.onComplete()` (which sets
`finished=true` and `notifyAll()`) **before** calling `span.end()` (which
exports the span). The test thread can wake up from `getAll()`, finish the
current test, start the next test (which calls `clearSpans()`), and only then
does the ForkJoinPool thread call `span.end()` — exporting the stale span into
the next test's freshly cleared exporter.
**Why it flaps**: The race is timing-dependent. On CI (slower machines), the
ForkJoinPool thread is more likely to be preempted between
`delegate.onComplete()` and `span.end()`, giving the test thread time to
proceed to the next test. Locally (faster machines), the window is too small to
hit.
**Why only `scan=batch`**: The batch scan mode produces more results (each
result is split in two), making each test slightly longer and increasing the
probability that the ForkJoinPool thread scheduling creates the right timing
for the race.
**The fix**: Modified `assertTraceContinuity()` and `assertTraceError()` in
`TestAsyncTableScan.java` to include `hasParentSpanId(scanOperationSpanId)` in
the matchers used for both filtering and `waitForSpan`. This ensures the
assertions only consider spans that belong to the current test's SCAN
operation, ignoring any stale spans from previous tests. The `waitForSpan` with
the more specific matcher also correctly waits for the current test's span
rather than being satisfied by a stale one.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]