jpountz commented on PR #14080: URL: https://github.com/apache/lucene/pull/14080#issuecomment-2556958224
Nightly benchmarks picked up this change, the bump is pretty cool. :) https://benchmarks.mikemccandless.com/CountAndHighHigh.html I pushed an annotation. > It makes me wonder if I should try reviving a dense posting encoding I had played around with a while ago where very-high-frequency terms would be encoded in the index using a bitset. If we had that we could use those directly. Yes, this sounds like it could be interesting indeed! I wonder if we should make the decision on a per-block basis rather than for the whole postings, in order to benefit postings that are only dense on some specific ranges of the doc ID space (which can happen when using index sorting or recursive graph bisection). Another idea that crossed my mind would consist of storing doc IDs as deltas from the first doc ID in the block in a short[] to further take advantage of SIMD (when applicable). > we don't really have such terms in luceneutil Wouldn't stop words qualify? E.g. I see that "the", "of" and "not" appear in 77%, 78% and 28% of documents respectively. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org