Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-17 Thread via GitHub
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2730909459 BTW I don't have plans to explore this further. Anyone should feel free to take over. Or abandon if nobody cares -- I admit it's very unusual to even have a top level disjunction, let

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-17 Thread via GitHub
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2729988278 The current approach is probably not the fastest indeed. We should add a task to nightly benchmarks if we want to optimize this. Something like a disjunction of phrase queries (possibly

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-17 Thread via GitHub
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2729965328 An aside: `org.apache.lucene.search.DisjunctionScorer.TwoPhase#matches` looks kind of sad, in that each matches() call is going to build a priority queue of "unverified matches" (DisiWr

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2728182572 I could imagine improving BooleanScorer so that the TPI clauses are separated and converted to a filter around the collector to try to match docs *not* collected (i.e. test for docs inbe

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727625240 > If one or more DISI has a high cost (irrespective of TPIs), thus matching many docs, I could see avoiding BS1 as well. I imagine that your idea is that if most of the cost comes

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727629320 In case you missed it, `BooleanScorer` had optimizations recently that make it hard to beat by `DisjunctionScorer` when clauses are `PostingsEnum`s: - `DocIdSetIterator#intoBitSet` he

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727528269 If one or more DISI has a high cost (irrespective of TPIs), thus matching many docs, I could see avoiding BS1 as well. An aside, if we are going to refer to these as BS1 vs BS2, th

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727502419 BS2 uses a heap to merge multiple `DocIdSetIterator`s. Unfortunately, reordering this heap on every call to `nextDoc()` or `advance(int)` is not completely free and BS1's approach of loa

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727499162 Thanks for your confirmation of the problem. The collect-per-clause is surprising to me; like what would benefit from that algorithm? Wouldn't that _only_ be in fact _needed_ if scores

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727461724 +1 let's use `DisjunctionSumScorerwhich` (which already supports two-phase iteration) when one of the clauses exposes a non-null two-phase iterator? -- This is an automated message fro

[PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-15 Thread via GitHub
dsmiley opened a new pull request, #14357: URL: https://github.com/apache/lucene/pull/14357 Showing a performance problem here in BooleanScorer (used for disjunctions -- "OR"). BS will score all its clauses indepenently, overlapping the same documents, some of which might be expensive wit