[ https://issues.apache.org/jira/browse/LUCENE-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308535#comment-17308535 ]
Uwe Schindler commented on LUCENE-9867: --------------------------------------- Hi, I would really try to reproduce this with another filesystem than XFS. I know that XFS has some problems with some combinations of fsync and prealocation of blocks. Read maybe also the kernel thread about "fsyncgate": https://danluu.com/fsyncgate/ The interesting thing that I did not know and where for example Postgres developers had to fix their code: If you call "fsync", you get back an error if something goes wrong (although this should not happen on disk full). The problem here is that the fsync resets the internal state and later calls to fsync succeed, although the problem still persists. "But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() cleared the AS_EIO bad page flag." The problem here could have something related to that: Maybe Lucene calls fsync during the cleanup of the IndexWriter (when the out of space happened), but ignored the error. Later it tried to do a commit and this succeeded, leaving the index in corrupt state? The other thing to look at (maybe it's more related to our current issue, because we seem to miss commit points, aka renamed files): Wir atomic renames, ext4 filesystem is a bit better to use as it especially handles file renames in a better way because of some compatibility layer to work around broken "publish commit using rename" behaviour. As we have no full control on what linux does from Java's NIO point of view (we can only best guess to fsync the directory), we still rely on this. See "auto_da_alloc" on ext4: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt > CorruptIndexException after failed segment merge caused by No space left on > device > ---------------------------------------------------------------------------------- > > Key: LUCENE-9867 > URL: https://issues.apache.org/jira/browse/LUCENE-9867 > Project: Lucene - Core > Issue Type: Bug > Components: core/store > Affects Versions: 8.5 > Reporter: Alexander L > Priority: Major > > Failed segment merge caused by "No space left on device" can't be recovered > and Lucene fails with CorruptIndexException after restart. The expectation is > that Lucene will be able to restart automatically without manual intervention. > We have 2 indexing patterns: > * Create and commit an empty index, then start long initial indexing process > (might take hours), perform a second commit in the end > * Using existing index, add no more than 4k documents and commit after that > Right now we don't have evidence to suggest which pattern caused this issue, > but we definitely witnessed a similar situation for the second pattern, > although it was a bit different - caused by {{OutOfMemoryError: Java Heap > Space}}, with missing {{_q.cfe}} file which produced only > {{NoSuchFileException}}, not {{CorruptIndexException}}. Please let me know if > we need a separate ticket for that. > Lucene version: 8.5.0 > Java version: OpenJDK 11 > OS: CentOS Linux 7 > Kernel: Linux 3.10.0-1160.11.1.el7.x86_64 > Virtualization: kvm > Filesystem: xfs > Failed merge stacktrace: > {code:java} > 2021-02-02T08:51:51.679+0000 > org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No > space left on device > at > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:704) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684) > Caused by: java.io.IOException: No space left on device > at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at > java.base/sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:62) > at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113) > at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:79) > at java.base/sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:280) > at java.base/java.nio.channels.Channels.writeFullyImpl(Channels.java:74) > at java.base/java.nio.channels.Channels.writeFully(Channels.java:97) > at java.base/java.nio.channels.Channels$1.write(Channels.java:172) > at > org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:416) > at > java.base/java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:74) > at > java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81) > at > java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:127) > at > org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53) > at > org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:73) > at org.apache.lucene.util.compress.LZ4.encodeLiterals(LZ4.java:159) > at org.apache.lucene.util.compress.LZ4.encodeSequence(LZ4.java:172) > at org.apache.lucene.util.compress.LZ4.compress(LZ4.java:441) > at > org.apache.lucene.codecs.compressing.CompressionMode$LZ4FastCompressor.compress(CompressionMode.java:165) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:229) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:159) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:636) > at > org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:229) > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:106) > at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4463) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4057) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) > {code} > Followed by failed startup: > {code:java} > 2021-02-02T08:52:07.926+0000 > org.apache.lucene.index.CorruptIndexException: Unexpected file read error > while reading index. > (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data/5f91aa0b07ce4d5e7beffaa2/segments_578fu"))) > at > org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291) > at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:846) > Caused by: java.nio.file.NoSuchFileException: > /data/5f91aa0b07ce4d5e7beffaa2/_6lfem.si > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182) > at java.base/java.nio.channels.FileChannel.open(FileChannel.java:292) > at java.base/java.nio.channels.FileChannel.open(FileChannel.java:345) > at > org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:81) > at > org.apache.lucene.store.Directory.openChecksumInput(Directory.java:157) > at > org.apache.lucene.codecs.lucene70.Lucene70SegmentInfoFormat.read(Lucene70SegmentInfoFormat.java:91) > at > org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:353) > at > org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:289) > ... 33 common frames omitted > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org