Re: [PR] Reuse BitSet when there are deleted documents in the index instead of creating new BitSet [lucene]

via GitHub Fri, 01 Dec 2023 10:43:42 -0800


shubhamvishu commented on PR #12857:
URL: https://github.com/apache/lucene/pull/12857#issuecomment-1836604882


   @kaivalnp We could use the `acceptDocs.cardinality()` when its a 
`BitSetIterator` to get the upper bound which might have some deletes but that 
would still change the decision sometimes of whether to go for exact search or 
not. Since we don't know how many of those docs are live but we do know the num 
of deletes in the segment(we don't know the intersections of these two). One 
thing that might be tried is to come up with some heuristic that adds some 
penalty to the cost based on the num of deletes in the segment (i.e. 
`ctx.reader().numDeletedDocs()/ctx.reader().maxDoc()`). Like maybe if there are 
10% deletes we could for eg decrease the cost by 10% or maybe 5%. This might 
help in cases where we miss falling back to exact search. Though this would 
need some thorough benchmarking to see what works best.
   
   On separate note, I'm thinking if there is some use case where we don't 
require to know this cost upfront and directly go for approximate search only 
for instance. Currently, this optimization only kicks in when the iterator is 
of `BitSetIterator` but if its possible to ignore this cost step or get this 
cost by some other heuristic/approximation then we could completely make it 
completely lazily evaluated using `DISI#advance(docid)` for those use cases. 
@msokolov @benwtrent Maybe you could share your thoughts on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Reuse BitSet when there are deleted documents in the index instead of creating new BitSet [lucene]

Reply via email to