gf2121 commented on PR #13221: URL: https://github.com/apache/lucene/pull/13221#issuecomment-2022184017
> A downside is that this approach may be less memory-efficient, since we store competitive docs as integers, never as a bit set like today. But we may be able to work around it by just starting dynamic pruning a bit later, when we have fewer docs to store. +1 for memory concern. I'd limit the `doc num threshold` to `maxDoc >> 5` so that it won't allocate more than maxDoc bits memory. > It's a bit surprising to me that we are configuring the disjunction by the number of docs on each clause (MIN_BLOCK_DISI_LENGTH) as this could create disjunctions with many many clauses (which is something I don't like about TermOrdValComparator). Maybe we could configure it via a target number of clauses instead, e.g. something like min(128, leadCost / 512)? More disjunction clauses means we can update the competitive iterator more often but also introduce more overhead of CompetitiveIterator#advance. I like the idea to configure in the disjunction clause num, will do. > I wonder how this approach compares with the current one in the worst-case scenario when documents get collected in competitive order, e.g. increasing order for a descending sort. I don't have a good intuition for whether it would perform better or worse, do you know? In the worst case we can not skip any docs in the first segment so I think the overhead can be considered in following perspectives: * **Update bottom overhead:** Current logic need to estimate the point count interval round while this patch computing the threshold value once. This patch wins. * **CompetitiveIterator iteration overhead:** Current CompetitiveIterator builds a array / bitset iterator so it has a cheap advance while this patch builds a disjunction of small arrays. Main wins. * **CompetitiveIterator building overhead:** - For low/medium cardinality fields, this patch can win because we are building small disis so we can skip the sort for a leaf block when docs are in order. - For high cardinality fields, if current logic kick in the bitset case then Main win because building a bitset can be cheaper than sort. - For high cardinality fields, if current logic not kick in the bitset then they are equal since we are using a O(n) LSBRadixSorter to sort these docs. In general, for the worst case, i think this patch may speed up medium cardinality fields but slow down hight cardinality fields that kick in bitset. I'll played with luceneutil to verify the guess. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org