gsmiller opened a new pull request, #14614:
URL: https://github.com/apache/lucene/pull/14614

   ### Description
   
   Estimating bloom filter cardinality should account for the number of hash 
functions used. It appears this method assumes one function is used, which 
isn't correct.
   
   Note: It also looks like the API surface area of `FuzzySet` in general may 
be due for a cleanup. This method doesn't appear to have any direct usage 
today, so it's debatable whether-or-not we should deprecate this. I'm in favor 
of keeping it, but would probably suggest deprecating a few other methods 
(include the version of this I kept around that assumes one hash function). 
I'll publish a separate PR where we can discuss this, but I think it's worth 
fixing this bug for now.
   
   ### Testing
   
   I verified that this method dramatically over-estimates bloom filter 
cardinality in its current state (by a factor approximately equal to the hash 
count) and verified this change corrects it. This was in ad hoc testing related 
to something else I'm working on. I notice there are no unit tests for 
`FuzzySet` so I didn't add any explicit test cases for now. I can create some 
tests for `FuzzySet` as part of this if folks have a strong opinion on this. I 
can also add testing as a separate task when looking into cleaning up the API 
surface area...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to