gnodet commented on PR #2971: URL: https://github.com/apache/cxf/pull/2971#issuecomment-4135324651
_Claude Code on behalf of Guillaume Nodet_ @reta Thanks for the clarification about retries vs. silent ignoring. I investigated the root cause of why these tests can fail. Here's what I found: ### Root Cause 1: Thread-safety bug in `TestSpanHandler` (Brave/Micrometer) `TestSpanHandler.SPANS` ([source](https://github.com/apache/cxf/blob/main/systests/tracing/src/test/java/org/apache/cxf/systest/brave/TestSpanHandler.java#L29)) uses a plain `ArrayList`: ```java private static final List<MutableSpan> SPANS = new ArrayList<>(12); ``` The client span is added from the main thread (via `BraveClientStartInterceptor.handleFault()`) and the server span is added from the server thread (via `BraveProvider` ContainerResponseFilter). Without synchronization: - `ArrayList.add()` from two threads can lose an element - `ArrayList.size()` polled by Awaitility may never see the update due to JMM visibility This affects `MicrometerTracingTest` too since it extends `AbstractBraveTracingTest` and uses the same `TestSpanHandler`. **Fix**: Change to `CopyOnWriteArrayList` (or `Collections.synchronizedList()`). Writes are rare (span completions), reads frequent (Awaitility polling), making COW ideal. ### Root Cause 2: Timing under CI load (all implementations) The `testThatNewSpanIsCreatedOnClientTimeout` test sets a 100ms client timeout against a server endpoint that does `Thread.sleep(500)`. The server span is recorded ~400-500ms after the client span. Under CI load with CPU starvation and GC pauses, this 500ms sleep can take significantly longer. The OTel tests use `InMemorySpanExporter` (thread-safe), so their failures are purely timing-related. This explains why bumping the Awaitility timeout helps but doesn't eliminate failures. ### Verified: Both spans ARE properly ended I traced through the full chain and confirmed that both client and server spans are correctly ended: - **Client span**: `MessageSenderEndingInterceptor` catches `SocketTimeoutException`, wraps it in `Fault` (RuntimeException), `PhaseInterceptorChain.doIntercept()` catches it, calls `unwind()` → `handleFault()` on `BraveClientStartInterceptor` → span is ended - **Server span**: The `ContainerResponseFilter` runs *before* response entity writing, so even if the client disconnected, the server span is ended before any I/O error ### Suggested improvement Making `TestSpanHandler` thread-safe should eliminate the Brave/Micrometer failures. The OTel timing-related failures would still need the increased timeout (or a redesign of the test to reduce the timing gap, e.g. using a shorter `Thread.sleep` in `BookStore.getBooksLong()`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
