iamsanjay opened a new issue, #13373:
URL: https://github.com/apache/lucene/issues/13373

   ### Description
   
   As being discussed on email list that `DataOutput.writeGroupVInts` throws as 
IntegerOverflow exception.  The goal is to find out the main reason and also to 
improve the exception message.
   
   ```
   Exception in thread "Lucene Merge Thread #202"
   org.apache.lucene.index.MergePolicy$MergeException:
   java.lang.ArithmeticException: integer overflow at
   
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735)
 at
   
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727)
   Caused by: java.lang.ArithmeticException: integer overflow at 
   java.base/java.lang.Math.toIntExact(Math.java:1135) at 
   org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354) at 
   
org.apache.lucene.codecs.lucene99.Lucene99PostingsWriter.finishTerm(Lucene99PostingsWriter.java:379)
 at 
   
org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:173)
 at 
   
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.write(Lucene90BlockTreeTermsWriter.java:1097)
 at
   
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:398)
 at 
   org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95) at
   
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:205)
 at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:209) at
   
org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:298) 
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:137) at
   org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5252) at 
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4740) at
   
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6541)
 at
   
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:639)
 at
   
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:700)
   ```
   
   More context from the reporter
   
   > Looking deeper into this. I think we overflowed a term frequency field.
   > Looking in some statistics, in a previous release we had 1,288,526,281
   > of a certain field, this would be larger now. Each of these would have
   > had a limited set of values. But crucially nearly all of them would have
   > had the term "positional" or "non-positional" added to the document.
   > 
   > There is no good reason to do this today, we should just turn this into
   > a boolean field and update the UI. I will do this and report back.
   > 
   > Do you think that a patch for a try/catch for a more informative log
   > message be appreciated by the community? e.g. mentioning the field name
   > in the exception?
   
   > The index that had an issue when merging into one segment definitely had
   > more than 1 billion times the word "positional" in it. I hope to be able
   > to give a closer number once re-indexing finished with a "work-around".
   > 
   > Of course the "work-around" is to just fix this correctly by not having
   > that word so often in the index and definitely not as docs, freqs and
   > postings.
   > 
   > For background information.
   > 
   > The use case was to find a set of documents that where either
   > "positional" or "non-positional". This was present in the first check in
   > of our code 18 years ago! since then our data has grown a bit ;) The
   > code was using Lucene 1.4.3 at that time. Users would search using this
   > as what now would be a facet `type:positional`. I changed this to a
   > field only IndexOptions.DOCS which is called 'positional' and searched
   > as `positional:yes` rewriting the previous query syntax behind the scene
   > to not break any user tools.
   
   
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to