tlrx commented on PR #16086: URL: https://github.com/apache/lucene/pull/16086#issuecomment-4497386359
Thanks @romseygeek and @mikemccand for your feedback! Before addressing code review comments, I'll try to reply to the higher-level questions: > I'm not convinced we understand the root cause here, so I think we may be fixing a "not actually a problem". At Amazon we've also seen some evidence that merge abort was insanely slow, but somehow never followed up on it / got to root cause. Can we first spend some time back on the issue doing that? > It's hard for me to believe `checkIntegrity` can take minutes on any Lucene index file unless the storage system holding the index is insanely slow. Or, the index is insanely massive. Both are possible :) Yes, the storage systems were indeed insanely slow when it happened. I've seen this in two situations: disks saturated with IO (caused I think by many concurrent force merges running on non top notch SSDs) and storage backed by object storage like S3 with high read latency hiccups, as David describes in https://github.com/apache/lucene/issues/13354#issuecomment-4490313048. In those situations, the Lucene index needed to be closed for system maintenance purpose or to be reopened on a more performant machine. > For example, is there maybe a bug that's turned off the entire abort checking best effort system? I don't think there is a bug in the existing abort checking system itself. The existing `merge.checkAborted()` calls work correctly at the points where they are checked (there are 2 or 3 of such calls before and in `IndexWriter.mergeMiddle()`), but once the merge thread starts the integrity checks on the segment readers to merge, then it is committed to finishing reading all the files associated to all reader to merge before it can observe the abort flag again. Note that this change adds the `checkAborted()` method to `MergeState`, so we could maybe add additional checks on the aborted flag before checking the integrity of a reader. That would be less intrusive and already an improvement in my opinion, since the merge will be dropped later anyway. > do any of our unit tests check for prompt merge aborts? I found some tests that abort merges but they don't check that the abort happens promptly. Specifically, there is no test that verifies how "quickly" a merge reacts to the abort signal after it has entered `checkIntegrity` / `checksumEntireFile`. > I'm also not a fan of only doing stored fields -- other Lucene index files can be ginormous (vectors) -- can we impl for those as well? Yes I mentioned in the PR description that other files could benefit from this but wanted to keep the radius of changes low at first. For vectors, I wonder if more `checkAborted()` calls could be added too when building large graphs during merges. > For merges, `IndexWriter` already wraps the provided `Directory` to track which files are created for Codec file-tracking purposes, I think? If so, we could in theory wrap that to also instrument every `IndexInput` with this "check every N seconds", maybe. I explored the `MergeScheduler#wrapForMerge` path but noticed that many readers were not reopening the files/inputs for integrity checks but instead use clones of already opened inputs, making it hard to wrap the index inputs. The typical pattern seems to be reader -> reader merge instance > clone inputs -> seek to zero (then skip by reading to compute the checksum). I also explored wrapping the readers (`StoredFieldsReader`, `DocValuesProducer etc`) using `MergePolicy.OneMerge#wrapForMerge(CodecReader)` to override the checkIntegry methods but noticed that it would disable some bulk merges optimizations down the road. > Can we subclass `FilterIndexInput` instead [...] > since things can vary drastically depending on the env, let's use time/seconds as the "check every", not bytes? We can hardwire this to something reasonable that's surely in the noise of overall merge cost. If we're only accounting for time spent reading bytes for checksumming purposes without tying it to the `OneMerge` aborted flag, then I agree we can subclass `FilterIndexInput` directly in `checksumEntireFile`. So I think this leaves us with different options: 1. Implement a FilterIndexInput in CodecUtil.checksumEntireFile() that periodically checks elapsed time 2. Use the checkAborted() method added to MergeState in this pull request before each reader.checkIntegrity() call in the various writer/consumer merge methods. 3. Explore wrapping the merge Directory via MergeScheduler.wrapForMerge to return wrapped IndexInputs that periodically checks checkAborted (may require to reopen files instead of relying on clones) 4. Continue with the current proposal as-is. I think we could start with option 2 as a simple first improvement in this PR (doesn't interrupt in-progress checksums but avoids starting new ones on an already-aborted merge) and follow up with option 1 or 3 to interrupt long-running checksums. Does that sounds reasonable? If so I can rework the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
