gsmiller opened a new pull request, #14614: URL: https://github.com/apache/lucene/pull/14614
### Description Estimating bloom filter cardinality should account for the number of hash functions used. It appears this method assumes one function is used, which isn't correct. Note: It also looks like the API surface area of `FuzzySet` in general may be due for a cleanup. This method doesn't appear to have any direct usage today, so it's debatable whether-or-not we should deprecate this. I'm in favor of keeping it, but would probably suggest deprecating a few other methods (include the version of this I kept around that assumes one hash function). I'll publish a separate PR where we can discuss this, but I think it's worth fixing this bug for now. ### Testing I verified that this method dramatically over-estimates bloom filter cardinality in its current state (by a factor approximately equal to the hash count) and verified this change corrects it. This was in ad hoc testing related to something else I'm working on. I notice there are no unit tests for `FuzzySet` so I didn't add any explicit test cases for now. I can create some tests for `FuzzySet` as part of this if folks have a strong opinion on this. I can also add testing as a separate task when looking into cleaning up the API surface area... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org