[GitHub] [lucene] jpountz commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

GitBox Tue, 22 Feb 2022 00:09:30 -0800


jpountz commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1047526553



   In my opinion the API as it is today isn't bad. The only thing we might want 
to change is to make `DocIdSetBuilder#grow` take a long instead of an int.
   
   Maybe it's a javadocs issue because `DocIdSetBuilder#grow` says that it 
returns "a `BulkAdder` object that can be used to add up to `numDocs` 
documents", which might suggest that `numDocs` is the number of unique 
documents contributed, when in fact this number is simply an upper bound of the 
number of times that you may call `BulkAdder#add` on the returned `BulkAdder` 
object.
   
   > I'm still a bit confused about why we need to grow(long) on a bitset that 
can only hold Integer.MAX_VALUE elements.
   
   This doesn't have anything to do with the `long counter` that you looked at.
   
   The point of `BulkAdder#add` is to call it every time we find a matching 
(docID, value) pair, and the number of matching pairs may be larger than 
`Integer#MAX_VALUE` (e.g. a range over a multi-valued field that matches all 
docs but one), hence the long. This is the same reason why e.g. 
`SortedSetDocValues#nextOrd` returns a long.
   
   > in the sparse/buffer case, wouldn't a much simpler estimation simply be 
the length of int array?
   
   This is already the case today, see the `else` block in 
`DocIdSetBuilder#build`. The cost estimation logic only happens in the dense 
case when a `FixedBitSet` is used to hold the set of matching docs.
   
   FWIW we could change the estimation logic to perform a popCount over a 
subset of the `FixedBitSet` and scale it according to the size of the bitset or 
something along these lines, if we think that it would be better than tracking 
this counter and dividing it by the number of values per doc.
   
   > I'm also confused why we have this sorted array buffer case instead of 
using SparseFixedBitSet
   
   `SparseFixedBitSet` is the right choice for the sparse case when you need 
something that implements the `BitSet` API. Here we only need to produce a 
`DocIdSet` and buffering doc IDs into an array and sorting them using radix 
sort proved to be faster than accumulating doc IDs into a `SparseFixedBitSet`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Reply via email to