mikemccand commented on pull request #2080:
URL: https://github.com/apache/lucene-solr/pull/2080#issuecomment-745375768


   > > > Hmm, but I think sumTotalTermFreq, which is per field sum of all 
totalTermFreq across all terms in that field, could overflow long even today, 
in and adversarial case. And it would not be detected by Lucene...
   > 
   > I don't think so. I like to think of this as "number of tokens" in the 
corpus. Because each doc is limited to Integer.MAX_VALUE and there can only be 
Integer.MAX_VALUE docs, sumTotalTermFreq can't overflow. and totalTermFreq is 
<= sumTotalTermFreq (it would be equal, in a degraded case where all your 
documents only have a single word repeated many times).
   
   Ahh you're right ... no more than `Integer.MAX_VALUE` tokens in one 
document, OK.
   
   > > How about decoupling these two problems? First, let's fix the 
aggregation of totalTermFreq and sumTotalTermFreq to explicitly catch any 
overflow instead of just doing the dangerous += today: 
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/PushPostingsWriterBase.java#L142
 and 
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/blocktree/BlockTreeTermsWriter.java#L915?
 I.e. switch these accumluations to Math.addExact. This will explicitly catch 
long overflow for either of these stats.
   > 
   > I don't think this is correct. You wouldn't trip this until after merge, 
far after you've already overflowed the values and caused broken search results 
(assuming you have more than one segment).
   
   Hrmph, also correct, boo.
   
   Alright I guess there is nothing we can fix here ... applications simply 
must not create > `Integer.MAX_VALUE` term frequencies in one doc/field.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to