klsince opened a new pull request, #15420: URL: https://github.com/apache/pinot/pull/15420
While committing large upsert segments, we often found that new consuming thread got stuck on waiting snapshotWLock. After looking into the current use of snapshotRWLock and snapshots, it seems safe to not lock at all to update snapshots (or not making things worse): Threads and operations that take snapshot RWlocks today: 1. new consuming thread takes WLock and then update on-disk snapshots (the single-writer of on-disk snapshots) 2. other threads (Helix threads or prev consuming thread) take RLock to do addSegment/removeSegment/replaceSegment methods, i.e. potentially updating the in-mem validDocIds bitmaps, but those threads don't update on-disk snapshots. So for those segment operations: 1. `removeSegment` doesn't need RLock, as removing segment doesn't change other segments' validDocIds bitmaps. 2. `replaceSegment` and `addSegment` may change other segments' validDocIds bitmaps. But it should be safe if the new consuming thread is taking snapshots concurrently when both methods are ongoing. Because, the new segments processed by `replaceSegment` and `addSegment` will be loaded on the server eventually, so on-disk snapshots get in sync with segments' in-mem bitmaps eventually. If server crashes now and restarts, it'll continue to load the new segments, so the table partition wouldn't miss valid docs. And before that, on-disk snapshots never have less valid docs than in-mem bitmaps, as required by minion compaction task or segment preloading feature. What if the new segments get removed before being fully loaded? W/o snapshot lock, the existing segments' in-mem bitmaps and on-disk snapshots may see less valid docs than before, because their docs can get invalidated by the newly added segment that's removed immediately afterward. But even with snapshot lock today, removing segment this way can cause similar issue, i.e. the existing segments' in-mem bitmaps see less valid docs, however their on-disk snapshots still see same valid docs as they are not updated yet as blocked by the snapshot lock. But because the in-mem bitmaps get changed, queries already see less valid docs so damage is made already whether using lock or not. Restarting servers to recompute segments' bitmaps can fix the issue for both. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org