[ 
https://issues.apache.org/jira/browse/GEODE-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206437#comment-17206437
 ] 

ASF GitHub Bot commented on GEODE-8536:
---------------------------------------

DonalEvans commented on a change in pull request #5553:
URL: https://github.com/apache/geode/pull/5553#discussion_r499011949



##########
File path: 
geode-lucene/src/main/java/org/apache/geode/cache/lucene/internal/IndexRepositoryFactory.java
##########
@@ -44,6 +44,7 @@
   private static final Logger logger = LogService.getLogger();
   public static final String FILE_REGION_LOCK_FOR_BUCKET_ID = 
"FileRegionLockForBucketId:";
   public static final String APACHE_GEODE_INDEX_COMPLETE = 
"APACHE_GEODE_INDEX_COMPLETE";
+  protected static final int GET_INDEX_WRITER_MAX_ATTEMPTS = 10;

Review comment:
       As I understand it, the timing window to hit the IOException is quite 
small and difficult to hit, since this problem only shows up in about 1 in 1000 
runs of the test I used to diagnose the issue. If the fileAndChunkRegion was 
unavailable for a long period of time, I would expect to see this issue 
reproduce more often. After running some experiments, I was able to increase 
the number of retries to 200 without any noticeable negative effects, which 
would increase the time window during which IOExceptions would have to be 
consistently encountered and an exception thrown to 1 second, which should help 
reduce the chances of encountering it. However, I don't think it's possible to 
know for certain how long the fileAndChunkRegion might be unavailable, since 
that could change based on the operation being used on it, the size of the 
region, current system resources etc.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> StackOverflow can occur when Lucene IndexWriter is unable to be created
> -----------------------------------------------------------------------
>
>                 Key: GEODE-8536
>                 URL: https://issues.apache.org/jira/browse/GEODE-8536
>             Project: Geode
>          Issue Type: Bug
>          Components: functions, lucene
>    Affects Versions: 1.12.0, 1.13.0, 1.14.0
>            Reporter: Donal Evans
>            Priority: Major
>              Labels: pull-request-available
>
> If, during a call to IndexRepositoryFactory.computeIndexRepository(), an 
> IOException is encountered when attempting to construct an IndexWriter, the 
> function retry logic will reattempt the execution. This allows transient 
> exceptions caused by concurrent modification of the fileAndChunk region to be 
> ignored and subsequent executions to succeed (see GEODE-7703). However, if 
> the IOException is consistently thrown, the infinitely retrying function can 
> cause a StackOverflow:
> {noformat}
> java.lang.StackOverflowError
>         at 
> org.apache.geode.SystemFailure.startWatchDog(SystemFailure.java:320)
>         at 
> org.apache.geode.SystemFailure.notifyWatchDog(SystemFailure.java:758)
>         at org.apache.geode.SystemFailure.setFailure(SystemFailure.java:813)
>         at 
> org.apache.geode.SystemFailure.initiateFailure(SystemFailure.java:790)
>         at 
> org.apache.geode.internal.InternalDataSerializer.invokeToData(InternalDataSerializer.java:2251)
>         at 
> org.apache.geode.internal.InternalDataSerializer.basicWriteObject(InternalDataSerializer.java:2031)
>         at 
> org.apache.geode.DataSerializer.writeObject(DataSerializer.java:2839)
>         at 
> org.apache.geode.internal.cache.partitioned.PartitionedRegionFunctionStreamingMessage.toData(PartitionedRegionFunctionStreamingMessage.java:192)
>         at 
> org.apache.geode.internal.serialization.internal.DSFIDSerializerImpl.invokeToData(DSFIDSerializerImpl.java:213)
>         at 
> org.apache.geode.internal.serialization.internal.DSFIDSerializerImpl.write(DSFIDSerializerImpl.java:137)
>         at 
> org.apache.geode.internal.InternalDataSerializer.writeDSFID(InternalDataSerializer.java:1484)
>         at 
> org.apache.geode.internal.tcp.MsgStreamer.writeMessage(MsgStreamer.java:247)
>         at 
> org.apache.geode.distributed.internal.direct.DirectChannel.sendToMany(DirectChannel.java:306)
>         at 
> org.apache.geode.distributed.internal.direct.DirectChannel.sendToOne(DirectChannel.java:182)
>         at 
> org.apache.geode.distributed.internal.direct.DirectChannel.send(DirectChannel.java:511)
>         at 
> org.apache.geode.distributed.internal.DistributionImpl.directChannelSend(DistributionImpl.java:346)
>         at 
> org.apache.geode.distributed.internal.DistributionImpl.send(DistributionImpl.java:291)
>         at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.sendViaMembershipManager(ClusterDistributionManager.java:2058)
>         at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.sendOutgoing(ClusterDistributionManager.java:1986)
>         at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.sendMessage(ClusterDistributionManager.java:2023)
>         at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.putOutgoing(ClusterDistributionManager.java:1083)
>         at 
> org.apache.geode.internal.cache.execute.PartitionedRegionFunctionResultWaiter.getPartitionedDataFrom(PartitionedRegionFunctionResultWaiter.java:89)
>         at 
> org.apache.geode.internal.cache.PartitionedRegion.executeOnAllBuckets(PartitionedRegion.java:4079)
>         at 
> org.apache.geode.internal.cache.PartitionedRegion.executeFunction(PartitionedRegion.java:3583)
>         at 
> org.apache.geode.internal.cache.execute.PartitionedRegionFunctionExecutor.executeFunction(PartitionedRegionFunctionExecutor.java:220)
>         at 
> org.apache.geode.internal.cache.execute.AbstractExecution.execute(AbstractExecution.java:376)
>         at 
> org.apache.geode.internal.cache.execute.AbstractExecution.execute(AbstractExecution.java:359)
>         at 
> org.apache.geode.internal.cache.execute.LocalResultCollectorImpl.getResultInternal(LocalResultCollectorImpl.java:139)
>         at 
> org.apache.geode.internal.cache.execute.ResultCollectorHolder.getResult(ResultCollectorHolder.java:53)
>         at 
> org.apache.geode.internal.cache.execute.LocalResultCollectorImpl.getResult(LocalResultCollectorImpl.java:112)
>         at 
> org.apache.geode.internal.cache.partitioned.PRFunctionStreamingResultCollector.getResultInternal(PRFunctionStreamingResultCollector.java:219)
>         at 
> org.apache.geode.internal.cache.execute.ResultCollectorHolder.getResult(ResultCollectorHolder.java:53)
>         at 
> org.apache.geode.internal.cache.partitioned.PRFunctionStreamingResultCollector.getResult(PRFunctionStreamingResultCollector.java:88)
>         at 
> org.apache.geode.internal.cache.execute.LocalResultCollectorImpl.getResultInternal(LocalResultCollectorImpl.java:141)
>         at 
> org.apache.geode.internal.cache.execute.ResultCollectorHolder.getResult(ResultCollectorHolder.java:53)
>         at 
> org.apache.geode.internal.cache.execute.LocalResultCollectorImpl.getResult(LocalResultCollectorImpl.java:112)
>         at 
> org.apache.geode.internal.cache.partitioned.PRFunctionStreamingResultCollector.getResultInternal(PRFunctionStreamingResultCollector.java:219)
>         at 
> org.apache.geode.internal.cache.execute.ResultCollectorHolder.getResult(ResultCollectorHolder.java:53)
> {noformat}
> The underlying exception in this case is a FileNotFoundException thrown when 
> attempting to retrieve a Lucene file from the fileAndChunk region.
> {noformat}
> [warn 2020/07/28 23:49:55.375 PDT <Pooled Waiting Message Processor 2> 
> tid=0xab] Exception thrown while constructing Lucene Index for bucket:16 for 
> file region:/_PR/_Bindex#_partitionedRegion.files_16
> org.apache.lucene.index.CorruptIndexException: Unexpected file read error 
> while reading index. (resource=BufferedChecksumIndexInput(segments_4s))
> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
> at org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:165)
> at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:974)
> at 
> org.apache.geode.cache.lucene.internal.IndexRepositoryFactory.buildIndexWriter(IndexRepositoryFactory.java:152)
> at 
> org.apache.geode.cache.lucene.internal.IndexRepositoryFactory.finishComputingRepository(IndexRepositoryFactory.java:116)
> at 
> org.apache.geode.cache.lucene.internal.IndexRepositoryFactory.computeIndexRepository(IndexRepositoryFactory.java:65)
> at 
> org.apache.geode.cache.lucene.internal.PartitionedRepositoryManager.computeRepository(PartitionedRepositoryManager.java:151)
> at 
> org.apache.geode.cache.lucene.internal.PartitionedRepositoryManager.lambda$computeRepository$1(PartitionedRepositoryManager.java:170)
> at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1892)
> at 
> org.apache.geode.cache.lucene.internal.PartitionedRepositoryManager.computeRepository(PartitionedRepositoryManager.java:162)
> at 
> org.apache.geode.cache.lucene.internal.LuceneBucketListener.lambda$afterPrimary$0(LuceneBucketListener.java:40)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.geode.distributed.internal.ClusterOperationExecutors.runUntilShutdown(ClusterOperationExecutors.java:442)
> at 
> org.apache.geode.distributed.internal.ClusterOperationExecutors.doWaitingThread(ClusterOperationExecutors.java:411)
> at 
> org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:119)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: _2p.si
> at 
> org.apache.geode.cache.lucene.internal.filesystem.FileSystem.getFile(FileSystem.java:101)
> at 
> org.apache.geode.cache.lucene.internal.directory.RegionDirectory.openInput(RegionDirectory.java:115)
> at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:137)
> at 
> org.apache.lucene.codecs.lucene62.Lucene62SegmentInfoFormat.read(Lucene62SegmentInfoFormat.java:89)
> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357)
> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:288)
> ... 16 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to