[I] flaky TableChangeWatcherTest.testSchemaChanges due to CuratorCache initial sync race [fluss]

via GitHub Wed, 25 Mar 2026 18:47:17 -0700


platinumhamburg opened a new issue, #2935:
URL: https://github.com/apache/fluss/issues/2935


   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/fluss/issues) and 
found nothing similar.
   
   
   ### Fluss version
   
   0.9.0 (latest release)
   
   ### Please describe the bug 🐞
   
   ## Description
   
   `TableChangeWatcherTest.testSchemaChanges` intermittently fails on CI with a 
missing `CreateTableEvent` for one of the 10 created tables.
   
   ## Observed Failure
   
   ```
   Expecting actual:
     [SchemaChangeEvent{table_0}, CreateTableEvent{table_0}, ...,
      SchemaChangeEvent{table_4}, SchemaChangeEvent{table_5}, 
CreateTableEvent{table_5}, ...]
   but could not find:
     [CreateTableEvent{table_4}]
   ```
   
   Key observations from CI log:
   - 20 `SchemaChangeEvent`s received (expected 10) — table_4 has a 
**duplicate** SchemaChangeEvent
   - 19 `CreateTableEvent`s received (expected 10) — table_4's 
`CreateTableEvent` is **missing**
   - The missing event is replaced by an extra `SchemaChangeEvent`, suggesting 
stale data interference
   
   ## Root Cause Hypothesis
   
   `testTableChanges` and `testSchemaChanges` both create tables named 
`table_0` through `table_9`. In `beforeEach`, the database is dropped and 
recreated, and a new `TableChangeWatcher` (backed by `CuratorCache`) is 
started. However, `CuratorCache.start()` performs an asynchronous initial tree 
scan.
   
   If the initial sync has not completed before new tables are created, the 
cache may hold stale `oldData` from the previous test's ZK nodes. When the new 
table_4 node is written, `CuratorCache` fires `NODE_CHANGED` with non-empty 
`oldData` (from the stale cache), causing the event to be routed to 
`processTableRegistrationChange()` instead of `processCreateTable()`:
   
   ```java
   // TableChangeWatcher.java:134-140
   if (oldData != null && oldData.getData() != null && oldData.getData().length 
> 0) {
       processTableRegistrationChange(tablePath, newData);  // ← wrong branch
   } else {
       processCreateTable(tablePath, newData);  // ← correct branch
   }
   ```
   
   This explains why:
   - Only 1 table is affected (depends on timing of initial sync vs table 
creation)
   - `SchemaChangeEvent` is duplicated (initial sync picks up stale schema node)
   - `CreateTableEvent` is lost (routed to wrong handler)
   
   ## Reproducibility
   
   This race cannot be reliably reproduced locally (50/50 passes) because local 
ZooKeeper is fast enough that `CuratorCache` initial sync completes before any 
table creation begins. The race only manifests on slow CI hosts with shared 
resources.
   
   
   ### Solution
   
   - Add `awaitInitialized()` to `TableChangeWatcher` using 
`CuratorCacheListener.initialized()` callback
   - Call `awaitInitialized()` in test `beforeEach` after `start()` to ensure 
cache is fully synced
   - Clear stale events after initial sync completes
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] flaky TableChangeWatcherTest.testSchemaChanges due to CuratorCache initial sync race [fluss]

Reply via email to