jfboeuf opened a new pull request, #11900: URL: https://github.com/apache/lucene/pull/11900
BloomFilteringPostingsFormat currently relies on a bloom filter with one hash function (k=1). For a target false positive probability of 10%, 1 is never the optimal value for k. Using the best value for k would either: * achieve a much better fpp with the same bitset size, or * achieve the same fpp with a reduced size (half of what is currently used. From tests: * targeting a better false positive probability doesn't bring significant enough better performance for the increased size; * Targeting a smaller size by degrading the false positive probability comes with a significant performance hit. As consequence, a target false positive probability of about 10% seems to be a good trade-off. I slightly raised this value (to 0.1023f) so the size of newly allocated bloom filters is always half the size of what they used to be. The effective false positive probability varies from significantly better in most cases to slightly worse in rare cases. [This graph ](https://drive.google.com/file/d/1RgofprJ0GyYaDQUZD59Pdp_b2gfmJASA/view?usp=sharing) compares both size and effective false positive probability of the current and proposed implementations. Overall performance remains comparable (slightly but not significantly better); the reduced size and the improved false positive probability compensate for the cost of having additional hashes. You can find in branch bloomPerfBench the class BloomBench I used to check for performance. In addition, the implementation of the bitset is based on a long array, so picking up a size lower than 64 bits is pointless. API change: * HashFunction.hash(BytesRef) returns a long: more accuracy with a 64bits hash useful to derivate additional hashes from the original one. The proposed implementation remains compatible with existing/persisted bloom filters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org