gf2121 commented on PR #13221:
URL: https://github.com/apache/lucene/pull/13221#issuecomment-2022184017

   > A downside is that this approach may be less memory-efficient, since we 
store competitive docs as integers, never as a bit set like today. But we may 
be able to work around it by just starting dynamic pruning a bit later, when we 
have fewer docs to store.
   
   +1 for memory concern. I'd limit the `doc num threshold` to `maxDoc >> 5` so 
that it won't allocate more than maxDoc bits memory.
   
   > It's a bit surprising to me that we are configuring the disjunction by the 
number of docs on each clause (MIN_BLOCK_DISI_LENGTH) as this could create 
disjunctions with many many clauses (which is something I don't like about 
TermOrdValComparator). Maybe we could configure it via a target number of 
clauses instead, e.g. something like min(128, leadCost / 512)?
   
   More disjunction clauses means we can update the competitive iterator more 
often but also introduce more overhead of CompetitiveIterator#advance. I like 
the idea to configure in the disjunction clause num, will do.
   
   > I wonder how this approach compares with the current one in the worst-case 
scenario when documents get collected in competitive order, e.g. increasing 
order for a descending sort. I don't have a good intuition for whether it would 
perform better or worse, do you know?
   
   In the worst case we can not skip any docs in the first segment so I think 
the overhead can be considered in following perspectives:
   
   * **Update bottom overhead:** Current logic need to estimate the point count 
interval round while this patch computing the threshold value once. This patch 
wins.
   
   * **CompetitiveIterator iteration overhead:** Current CompetitiveIterator 
builds a array / bitset iterator so it has a cheap advance while this patch 
builds a disjunction of small arrays. Main wins.
   
   * **CompetitiveIterator building overhead:** 
       - For low/medium cardinality fields, this patch can win because we are 
building small disis so we can skip the sort for a leaf block when docs are in 
order.
       - For high cardinality fields, if current logic kick in the bitset case 
then Main win because building a bitset can be cheaper than sort.
       - For high cardinality fields, if current logic not kick in the bitset 
then they are equal since we are using a O(n) LSBRadixSorter to sort these docs.
       
   In general, for the worst case, i think this patch may speed up medium 
cardinality fields but slow down hight cardinality fields that kick in bitset. 
I'll played with luceneutil to verify the guess.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to