[ https://issues.apache.org/jira/browse/GEODE-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206437#comment-17206437 ]
ASF GitHub Bot commented on GEODE-8536: --------------------------------------- DonalEvans commented on a change in pull request #5553: URL: https://github.com/apache/geode/pull/5553#discussion_r499011949 ########## File path: geode-lucene/src/main/java/org/apache/geode/cache/lucene/internal/IndexRepositoryFactory.java ########## @@ -44,6 +44,7 @@ private static final Logger logger = LogService.getLogger(); public static final String FILE_REGION_LOCK_FOR_BUCKET_ID = "FileRegionLockForBucketId:"; public static final String APACHE_GEODE_INDEX_COMPLETE = "APACHE_GEODE_INDEX_COMPLETE"; + protected static final int GET_INDEX_WRITER_MAX_ATTEMPTS = 10; Review comment: As I understand it, the timing window to hit the IOException is quite small and difficult to hit, since this problem only shows up in about 1 in 1000 runs of the test I used to diagnose the issue. If the fileAndChunkRegion was unavailable for a long period of time, I would expect to see this issue reproduce more often. After running some experiments, I was able to increase the number of retries to 200 without any noticeable negative effects, which would increase the time window during which IOExceptions would have to be consistently encountered and an exception thrown to 1 second, which should help reduce the chances of encountering it. However, I don't think it's possible to know for certain how long the fileAndChunkRegion might be unavailable, since that could change based on the operation being used on it, the size of the region, current system resources etc. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > StackOverflow can occur when Lucene IndexWriter is unable to be created > ----------------------------------------------------------------------- > > Key: GEODE-8536 > URL: https://issues.apache.org/jira/browse/GEODE-8536 > Project: Geode > Issue Type: Bug > Components: functions, lucene > Affects Versions: 1.12.0, 1.13.0, 1.14.0 > Reporter: Donal Evans > Priority: Major > Labels: pull-request-available > > If, during a call to IndexRepositoryFactory.computeIndexRepository(), an > IOException is encountered when attempting to construct an IndexWriter, the > function retry logic will reattempt the execution. This allows transient > exceptions caused by concurrent modification of the fileAndChunk region to be > ignored and subsequent executions to succeed (see GEODE-7703). However, if > the IOException is consistently thrown, the infinitely retrying function can > cause a StackOverflow: > {noformat} > java.lang.StackOverflowError > at > org.apache.geode.SystemFailure.startWatchDog(SystemFailure.java:320) > at > org.apache.geode.SystemFailure.notifyWatchDog(SystemFailure.java:758) > at org.apache.geode.SystemFailure.setFailure(SystemFailure.java:813) > at > org.apache.geode.SystemFailure.initiateFailure(SystemFailure.java:790) > at > org.apache.geode.internal.InternalDataSerializer.invokeToData(InternalDataSerializer.java:2251) > at > org.apache.geode.internal.InternalDataSerializer.basicWriteObject(InternalDataSerializer.java:2031) > at > org.apache.geode.DataSerializer.writeObject(DataSerializer.java:2839) > at > org.apache.geode.internal.cache.partitioned.PartitionedRegionFunctionStreamingMessage.toData(PartitionedRegionFunctionStreamingMessage.java:192) > at > org.apache.geode.internal.serialization.internal.DSFIDSerializerImpl.invokeToData(DSFIDSerializerImpl.java:213) > at > org.apache.geode.internal.serialization.internal.DSFIDSerializerImpl.write(DSFIDSerializerImpl.java:137) > at > org.apache.geode.internal.InternalDataSerializer.writeDSFID(InternalDataSerializer.java:1484) > at > org.apache.geode.internal.tcp.MsgStreamer.writeMessage(MsgStreamer.java:247) > at > org.apache.geode.distributed.internal.direct.DirectChannel.sendToMany(DirectChannel.java:306) > at > org.apache.geode.distributed.internal.direct.DirectChannel.sendToOne(DirectChannel.java:182) > at > org.apache.geode.distributed.internal.direct.DirectChannel.send(DirectChannel.java:511) > at > org.apache.geode.distributed.internal.DistributionImpl.directChannelSend(DistributionImpl.java:346) > at > org.apache.geode.distributed.internal.DistributionImpl.send(DistributionImpl.java:291) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.sendViaMembershipManager(ClusterDistributionManager.java:2058) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.sendOutgoing(ClusterDistributionManager.java:1986) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.sendMessage(ClusterDistributionManager.java:2023) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.putOutgoing(ClusterDistributionManager.java:1083) > at > org.apache.geode.internal.cache.execute.PartitionedRegionFunctionResultWaiter.getPartitionedDataFrom(PartitionedRegionFunctionResultWaiter.java:89) > at > org.apache.geode.internal.cache.PartitionedRegion.executeOnAllBuckets(PartitionedRegion.java:4079) > at > org.apache.geode.internal.cache.PartitionedRegion.executeFunction(PartitionedRegion.java:3583) > at > org.apache.geode.internal.cache.execute.PartitionedRegionFunctionExecutor.executeFunction(PartitionedRegionFunctionExecutor.java:220) > at > org.apache.geode.internal.cache.execute.AbstractExecution.execute(AbstractExecution.java:376) > at > org.apache.geode.internal.cache.execute.AbstractExecution.execute(AbstractExecution.java:359) > at > org.apache.geode.internal.cache.execute.LocalResultCollectorImpl.getResultInternal(LocalResultCollectorImpl.java:139) > at > org.apache.geode.internal.cache.execute.ResultCollectorHolder.getResult(ResultCollectorHolder.java:53) > at > org.apache.geode.internal.cache.execute.LocalResultCollectorImpl.getResult(LocalResultCollectorImpl.java:112) > at > org.apache.geode.internal.cache.partitioned.PRFunctionStreamingResultCollector.getResultInternal(PRFunctionStreamingResultCollector.java:219) > at > org.apache.geode.internal.cache.execute.ResultCollectorHolder.getResult(ResultCollectorHolder.java:53) > at > org.apache.geode.internal.cache.partitioned.PRFunctionStreamingResultCollector.getResult(PRFunctionStreamingResultCollector.java:88) > at > org.apache.geode.internal.cache.execute.LocalResultCollectorImpl.getResultInternal(LocalResultCollectorImpl.java:141) > at > org.apache.geode.internal.cache.execute.ResultCollectorHolder.getResult(ResultCollectorHolder.java:53) > at > org.apache.geode.internal.cache.execute.LocalResultCollectorImpl.getResult(LocalResultCollectorImpl.java:112) > at > org.apache.geode.internal.cache.partitioned.PRFunctionStreamingResultCollector.getResultInternal(PRFunctionStreamingResultCollector.java:219) > at > org.apache.geode.internal.cache.execute.ResultCollectorHolder.getResult(ResultCollectorHolder.java:53) > {noformat} > The underlying exception in this case is a FileNotFoundException thrown when > attempting to retrieve a Lucene file from the fileAndChunk region. > {noformat} > [warn 2020/07/28 23:49:55.375 PDT <Pooled Waiting Message Processor 2> > tid=0xab] Exception thrown while constructing Lucene Index for bucket:16 for > file region:/_PR/_Bindex#_partitionedRegion.files_16 > org.apache.lucene.index.CorruptIndexException: Unexpected file read error > while reading index. (resource=BufferedChecksumIndexInput(segments_4s)) > at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290) > at org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:165) > at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:974) > at > org.apache.geode.cache.lucene.internal.IndexRepositoryFactory.buildIndexWriter(IndexRepositoryFactory.java:152) > at > org.apache.geode.cache.lucene.internal.IndexRepositoryFactory.finishComputingRepository(IndexRepositoryFactory.java:116) > at > org.apache.geode.cache.lucene.internal.IndexRepositoryFactory.computeIndexRepository(IndexRepositoryFactory.java:65) > at > org.apache.geode.cache.lucene.internal.PartitionedRepositoryManager.computeRepository(PartitionedRepositoryManager.java:151) > at > org.apache.geode.cache.lucene.internal.PartitionedRepositoryManager.lambda$computeRepository$1(PartitionedRepositoryManager.java:170) > at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1892) > at > org.apache.geode.cache.lucene.internal.PartitionedRepositoryManager.computeRepository(PartitionedRepositoryManager.java:162) > at > org.apache.geode.cache.lucene.internal.LuceneBucketListener.lambda$afterPrimary$0(LuceneBucketListener.java:40) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.geode.distributed.internal.ClusterOperationExecutors.runUntilShutdown(ClusterOperationExecutors.java:442) > at > org.apache.geode.distributed.internal.ClusterOperationExecutors.doWaitingThread(ClusterOperationExecutors.java:411) > at > org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:119) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: _2p.si > at > org.apache.geode.cache.lucene.internal.filesystem.FileSystem.getFile(FileSystem.java:101) > at > org.apache.geode.cache.lucene.internal.directory.RegionDirectory.openInput(RegionDirectory.java:115) > at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:137) > at > org.apache.lucene.codecs.lucene62.Lucene62SegmentInfoFormat.read(Lucene62SegmentInfoFormat.java:89) > at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357) > at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:288) > ... 16 more > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)