mikemccand commented on code in PR #12872: URL: https://github.com/apache/lucene/pull/12872#discussion_r1418947126
########## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ########## @@ -873,13 +876,26 @@ public int compare(String a, String b) { if (0 == result.numBadSegments) { result.clean = true; } else { - msg( - infoStream, - "WARNING: " - + result.numBadSegments - + " broken segments (containing " - + result.totLoseDocCount - + " documents) detected"); + if (result.numBadSegmentsWithoutSegmentInfo > 0) { + msg( + infoStream, + "WARNING: " + + result.numBadSegments + + " broken segments (containing " Review Comment: Maybe say `containing at least`? ########## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ########## @@ -664,7 +667,7 @@ public int compare(String a, String b) { // failure (fastFail == false). so if we get here, we should // always have a valid lastCommit: assert lastCommit != null; - if (lastCommit == null) { + if (lastCommit == null) { // TODO: Remove/move unreachable if statement? Review Comment: Hmm this might still happen? E.g. some transient `IOException` or an Access Denied or so? ########## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ########## @@ -957,6 +974,9 @@ private Status.SegmentInfoStatus testSegment( SegmentReader reader = null; try { + if (info.info.maxDoc() == 0) { + throw new CheckIndexException(" this segment has missing or corrupt segment info"); Review Comment: Remove the extra space before `this`? Also, can we include specifics about which segment we were unable to read the `.si` for in the exception message? ########## lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java: ########## @@ -389,13 +389,39 @@ private static void parseSegmentInfos( } long totalDocs = 0; + SegmentInfo info; for (int seg = 0; seg < numSegments; seg++) { String segName = input.readString(); byte[] segmentID = new byte[StringHelper.ID_LENGTH]; input.readBytes(segmentID, 0, segmentID.length); Codec codec = readCodec(input); - SegmentInfo info = - codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READ); + try { + info = codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READ); + } catch (FileNotFoundException | NoSuchFileException e) { Review Comment: Hmm can we upgrade this to catch any "normal" (not crazy things like OOME) `Throwable`? E.g. if the `.si` file had some bits scrambled or what not we will get a more exotic exception or maybe an exception because the checksum is mismatched. Can we somehow return the root cause exception here and include it in `CheckIndexException` that we raise above? ########## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ########## @@ -957,6 +974,9 @@ private Status.SegmentInfoStatus testSegment( SegmentReader reader = null; try { + if (info.info.maxDoc() == 0) { Review Comment: Ahh I see, you now separately handle the `maxDoc == 0` case, here. But, what if an index has an illegal 0 doc segment, and an intact `.si`? We will now (incorrectly) state that it is missing its `SegmentInfo`? Maybe we could add a separate boolean somewhere to track `segmentInfoMissing` or so? ########## lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java: ########## @@ -389,13 +389,39 @@ private static void parseSegmentInfos( } long totalDocs = 0; + SegmentInfo info; for (int seg = 0; seg < numSegments; seg++) { String segName = input.readString(); byte[] segmentID = new byte[StringHelper.ID_LENGTH]; input.readBytes(segmentID, 0, segmentID.length); Codec codec = readCodec(input); - SegmentInfo info = - codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READ); + try { + info = codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READ); + } catch (FileNotFoundException | NoSuchFileException e) { + // https://github.com/apache/lucene/issues/7820 - If we throw a CorruptIndex exception here, + // a missing segment info will cause parseSegmentInfos() to short-circuit and stop reading + // the rest of the segments. So, when encountering a segment with a missing + // segment info (.si) file, we initialize an empty segment info (containing 0 docs) + // to keep track of the segment name, so that we can continue reading other segments, and + // the segment with the missing .si file can be later validated or exorcised by CheckIndex() + info = Review Comment: Hmm I don't like that this will suppress corruption when `SegmentInfos` is being loaded NOT from `CheckIndex`? This is the wrong layer to have such leniency? This method should throw the exception back up to the caller and caller should handle properly? Could we instead 1) create a subclass of `CorruptIndexException`, maybe `CorruptSegmentInfoException` or so, 2) require that caller pass in the segment name, and root cause exception, when creating it, and 3) catch any `Throwable` here and re-throw that subclass? Then we can fix `CheckIndex` to specifically catch that, pull the segment name out, and do its thing. ########## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ########## @@ -948,7 +965,7 @@ private Status.SegmentInfoStatus testSegment( segInfoStat.maxDoc = info.info.maxDoc(); final Version version = info.info.getVersion(); - if (info.info.maxDoc() <= 0) { + if (info.info.maxDoc() < 0) { Review Comment: Hmm why did you have to remove the `0` case? Lucene considers this corruption (IW should never flush or merge to a 0 doc segment). ########## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ########## @@ -4270,11 +4290,21 @@ public int doCheck(Options opts) throws IOException, InterruptedException { + result.totLoseDocCount + " documents would be lost, if -exorcise were specified\n"); } else { - opts.out.println("WARNING: " + result.totLoseDocCount + " documents will be lost\n"); - opts.out.println( - "NOTE: will write new segments file in 5 seconds; this will remove " - + result.totLoseDocCount - + " docs from the index. YOU WILL LOSE DATA. THIS IS YOUR LAST CHANCE TO CTRL+C!"); + if (result.numBadSegmentsWithoutSegmentInfo > 0) { + opts.out.println( + "WARNING: " + result.totLoseDocCount + " or more documents will be lost\n"); Review Comment: Can we change the wording to `at least ...`? I think that's stronger / more alarming on casual read than ` or more`. Or maybe `more than XX documents will be lost`, since we (almost certainly) know the lost segment(s) have `maxDoc > 0`? Whichever wording we use, let's be consistent in the thrown exception message above too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org