iamsanjay opened a new issue, #13373: URL: https://github.com/apache/lucene/issues/13373
### Description As being discussed on email list that `DataOutput.writeGroupVInts` throws as IntegerOverflow exception. The goal is to find out the main reason and also to improve the exception message. ``` Exception in thread "Lucene Merge Thread #202" org.apache.lucene.index.MergePolicy$MergeException: java.lang.ArithmeticException: integer overflow at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727) Caused by: java.lang.ArithmeticException: integer overflow at java.base/java.lang.Math.toIntExact(Math.java:1135) at org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354) at org.apache.lucene.codecs.lucene99.Lucene99PostingsWriter.finishTerm(Lucene99PostingsWriter.java:379) at org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:173) at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.write(Lucene90BlockTreeTermsWriter.java:1097) at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:398) at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:205) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:209) at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:298) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:137) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5252) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4740) at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6541) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:639) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:700) ``` More context from the reporter > Looking deeper into this. I think we overflowed a term frequency field. > Looking in some statistics, in a previous release we had 1,288,526,281 > of a certain field, this would be larger now. Each of these would have > had a limited set of values. But crucially nearly all of them would have > had the term "positional" or "non-positional" added to the document. > > There is no good reason to do this today, we should just turn this into > a boolean field and update the UI. I will do this and report back. > > Do you think that a patch for a try/catch for a more informative log > message be appreciated by the community? e.g. mentioning the field name > in the exception? > The index that had an issue when merging into one segment definitely had > more than 1 billion times the word "positional" in it. I hope to be able > to give a closer number once re-indexing finished with a "work-around". > > Of course the "work-around" is to just fix this correctly by not having > that word so often in the index and definitely not as docs, freqs and > postings. > > For background information. > > The use case was to find a set of documents that where either > "positional" or "non-positional". This was present in the first check in > of our code 18 years ago! since then our data has grown a bit ;) The > code was using Lucene 1.4.3 at that time. Users would search using this > as what now would be a facet `type:positional`. I changed this to a > field only IndexOptions.DOCS which is called 'positional' and searched > as `positional:yes` rewriting the previous query syntax behind the scene > to not break any user tools. ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org