[
https://issues.apache.org/jira/browse/LUCENE-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308535#comment-17308535
]
Uwe Schindler commented on LUCENE-9867:
---------------------------------------
Hi,
I would really try to reproduce this with another filesystem than XFS. I know
that XFS has some problems with some combinations of fsync and prealocation of
blocks.
Read maybe also the kernel thread about "fsyncgate":
https://danluu.com/fsyncgate/
The interesting thing that I did not know and where for example Postgres
developers had to fix their code: If you call "fsync", you get back an error if
something goes wrong (although this should not happen on disk full). The
problem here is that the fsync resets the internal state and later calls to
fsync succeed, although the problem still persists. "But then we retried the
checkpoint, which retried the fsync(). The retry succeeded, because the prior
fsync() cleared the AS_EIO bad page flag."
The problem here could have something related to that: Maybe Lucene calls fsync
during the cleanup of the IndexWriter (when the out of space happened), but
ignored the error. Later it tried to do a commit and this succeeded, leaving
the index in corrupt state?
The other thing to look at (maybe it's more related to our current issue,
because we seem to miss commit points, aka renamed files):
Wir atomic renames, ext4 filesystem is a bit better to use as it especially
handles file renames in a better way because of some compatibility layer to
work around broken "publish commit using rename" behaviour. As we have no full
control on what linux does from Java's NIO point of view (we can only best
guess to fsync the directory), we still rely on this. See "auto_da_alloc" on
ext4: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
> CorruptIndexException after failed segment merge caused by No space left on
> device
> ----------------------------------------------------------------------------------
>
> Key: LUCENE-9867
> URL: https://issues.apache.org/jira/browse/LUCENE-9867
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/store
> Affects Versions: 8.5
> Reporter: Alexander L
> Priority: Major
>
> Failed segment merge caused by "No space left on device" can't be recovered
> and Lucene fails with CorruptIndexException after restart. The expectation is
> that Lucene will be able to restart automatically without manual intervention.
> We have 2 indexing patterns:
> * Create and commit an empty index, then start long initial indexing process
> (might take hours), perform a second commit in the end
> * Using existing index, add no more than 4k documents and commit after that
> Right now we don't have evidence to suggest which pattern caused this issue,
> but we definitely witnessed a similar situation for the second pattern,
> although it was a bit different - caused by {{OutOfMemoryError: Java Heap
> Space}}, with missing {{_q.cfe}} file which produced only
> {{NoSuchFileException}}, not {{CorruptIndexException}}. Please let me know if
> we need a separate ticket for that.
> Lucene version: 8.5.0
> Java version: OpenJDK 11
> OS: CentOS Linux 7
> Kernel: Linux 3.10.0-1160.11.1.el7.x86_64
> Virtualization: kvm
> Filesystem: xfs
> Failed merge stacktrace:
> {code:java}
> 2021-02-02T08:51:51.679+0000
> org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No
> space left on device
> at
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:704)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684)
> Caused by: java.io.IOException: No space left on device
> at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at
> java.base/sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:62)
> at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113)
> at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:79)
> at java.base/sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:280)
> at java.base/java.nio.channels.Channels.writeFullyImpl(Channels.java:74)
> at java.base/java.nio.channels.Channels.writeFully(Channels.java:97)
> at java.base/java.nio.channels.Channels$1.write(Channels.java:172)
> at
> org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:416)
> at
> java.base/java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:74)
> at
> java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
> at
> java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:127)
> at
> org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53)
> at
> org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:73)
> at org.apache.lucene.util.compress.LZ4.encodeLiterals(LZ4.java:159)
> at org.apache.lucene.util.compress.LZ4.encodeSequence(LZ4.java:172)
> at org.apache.lucene.util.compress.LZ4.compress(LZ4.java:441)
> at
> org.apache.lucene.codecs.compressing.CompressionMode$LZ4FastCompressor.compress(CompressionMode.java:165)
> at
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:229)
> at
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:159)
> at
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:636)
> at
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:229)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:106)
> at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4463)
> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4057)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662)
> {code}
> Followed by failed startup:
> {code:java}
> 2021-02-02T08:52:07.926+0000
> org.apache.lucene.index.CorruptIndexException: Unexpected file read error
> while reading index.
> (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data/5f91aa0b07ce4d5e7beffaa2/segments_578fu")))
> at
> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291)
> at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:846)
> Caused by: java.nio.file.NoSuchFileException:
> /data/5f91aa0b07ce4d5e7beffaa2/_6lfem.si
> at
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
> at
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
> at
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
> at
> java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182)
> at java.base/java.nio.channels.FileChannel.open(FileChannel.java:292)
> at java.base/java.nio.channels.FileChannel.open(FileChannel.java:345)
> at
> org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:81)
> at
> org.apache.lucene.store.Directory.openChecksumInput(Directory.java:157)
> at
> org.apache.lucene.codecs.lucene70.Lucene70SegmentInfoFormat.read(Lucene70SegmentInfoFormat.java:91)
> at
> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:353)
> at
> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:289)
> ... 33 common frames omitted
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]