mikemccand commented on pull request #2080: URL: https://github.com/apache/lucene-solr/pull/2080#issuecomment-745375768
> > > Hmm, but I think sumTotalTermFreq, which is per field sum of all totalTermFreq across all terms in that field, could overflow long even today, in and adversarial case. And it would not be detected by Lucene... > > I don't think so. I like to think of this as "number of tokens" in the corpus. Because each doc is limited to Integer.MAX_VALUE and there can only be Integer.MAX_VALUE docs, sumTotalTermFreq can't overflow. and totalTermFreq is <= sumTotalTermFreq (it would be equal, in a degraded case where all your documents only have a single word repeated many times). Ahh you're right ... no more than `Integer.MAX_VALUE` tokens in one document, OK. > > How about decoupling these two problems? First, let's fix the aggregation of totalTermFreq and sumTotalTermFreq to explicitly catch any overflow instead of just doing the dangerous += today: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/PushPostingsWriterBase.java#L142 and https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/blocktree/BlockTreeTermsWriter.java#L915? I.e. switch these accumluations to Math.addExact. This will explicitly catch long overflow for either of these stats. > > I don't think this is correct. You wouldn't trip this until after merge, far after you've already overflowed the values and caused broken search results (assuming you have more than one segment). Hrmph, also correct, boo. Alright I guess there is nothing we can fix here ... applications simply must not create > `Integer.MAX_VALUE` term frequencies in one doc/field. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org