jpountz commented on PR #14080:
URL: https://github.com/apache/lucene/pull/14080#issuecomment-2556958224

   Nightly benchmarks picked up this change, the bump is pretty cool. :) 
https://benchmarks.mikemccandless.com/CountAndHighHigh.html I pushed an 
annotation.
   
   > It makes me wonder if I should try reviving a dense posting encoding I had 
played around with a while ago where very-high-frequency terms would be encoded 
in the index using a bitset. If we had that we could use those directly.
   
   Yes, this sounds like it could be interesting indeed! I wonder if we should 
make the decision on a per-block basis rather than for the whole postings, in 
order to benefit postings that are only dense on some specific ranges of the 
doc ID space (which can happen when using index sorting or recursive graph 
bisection).
   
   Another idea that crossed my mind would consist of storing doc IDs as deltas 
from the first doc ID in the block in a short[] to further take advantage of 
SIMD (when applicable).
   
   > we don't really have such terms in luceneutil
   
   Wouldn't stop words qualify? E.g. I see that "the", "of" and "not" appear in 
77%, 78% and 28% of documents respectively.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to