[GitHub] [lucene] jpountz opened a new pull request, #12444: Add BS1 optimization to MaxScoreBulkScorer.

2023-07-17 Thread via GitHub
jpountz opened a new pull request, #12444: URL: https://github.com/apache/lucene/pull/12444 Lucene's scorers that can dynamically prune on score provide great speedups when they manage to skip many hits. Unfortunately, there are also cases when they cannot skip hits efficiently, one example

[GitHub] [lucene] jpountz commented on pull request #12444: Add BS1 optimization to MaxScoreBulkScorer.

2023-07-17 Thread via GitHub
jpountz commented on PR #12444: URL: https://github.com/apache/lucene/pull/12444#issuecomment-1637514621 I played with the following tasks file to evaluate the impact of this change: ``` OrHigh2: several following OrHigh3: several following publisher OrHigh4: several followin

[GitHub] [lucene] jpountz commented on pull request #12444: Add BS1 optimization to MaxScoreBulkScorer.

2023-07-17 Thread via GitHub
jpountz commented on PR #12444: URL: https://github.com/apache/lucene/pull/12444#issuecomment-1637792504 Here is the usual set of queries, still on wikimedium10m. Sparser disjunctive queries like `Fuzzy1`, `Fuzzy2` and `OrHighLow` can get a slowdown when the majority of clauses have very fe

[GitHub] [lucene] jpountz commented on pull request #12444: Add BS1 optimization to MaxScoreBulkScorer.

2023-07-17 Thread via GitHub
jpountz commented on PR #12444: URL: https://github.com/apache/lucene/pull/12444#issuecomment-1637854931 Here is a similar table as above but with low-cardinality clauses instead of high-cardinality clauses in order to show how the overhead of the bitset manifests: ``` OrLow2: riv

[GitHub] [lucene] jpountz commented on issue #12439: Switch from MAXSCORE to BS1 with high numbers of clauses

2023-07-17 Thread via GitHub
jpountz commented on issue #12439: URL: https://github.com/apache/lucene/issues/12439#issuecomment-1638155116 The above idea actually works quite well, see #12444. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

[GitHub] [lucene] epotyom opened a new pull request, #12445: Expression: add a set of duplicate variables

2023-07-17 Thread via GitHub
epotyom opened a new pull request, #12445: URL: https://github.com/apache/lucene/pull/12445 Keep a set of Expression variables that are used more than once. This set can then be used by Lucene application to decide if corresponding DoubleValuesSource can benefit from caching caching.

[GitHub] [lucene] jpountz opened a new pull request, #12446: Enable rank-unsafe optimizations for MAXSCORE/WAND.

2023-07-17 Thread via GitHub
jpountz opened a new pull request, #12446: URL: https://github.com/apache/lucene/pull/12446 Both MAXSCORE and WAND can easily be tuned to perform rank-unsafe optimizations, by skipping doc IDs that are unlikely to make it to the top-k. The main challenge is how to expose this kind of optimi

[GitHub] [lucene] shubhamvishu commented on pull request #12183: Make some heavy query rewrites concurrent

2023-07-17 Thread via GitHub
shubhamvishu commented on PR #12183: URL: https://github.com/apache/lucene/pull/12183#issuecomment-1638374134 > I'm starting to believe that we should fix the executor to run tasks in the current thread if called from a thread of the pool instead of fixing our collectors in the testing fram

[GitHub] [lucene] shubhamvishu commented on issue #12394: Add the ability to compute vector similarity scores with the new ValuesSource API

2023-07-17 Thread via GitHub
shubhamvishu commented on issue #12394: URL: https://github.com/apache/lucene/issues/12394#issuecomment-1638379241 I seethanks for clarifying @jpountz -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [lucene] jpountz commented on pull request #12446: Enable rank-unsafe optimizations for MAXSCORE/WAND.

2023-07-17 Thread via GitHub
jpountz commented on PR #12446: URL: https://github.com/apache/lucene/pull/12446#issuecomment-1638400035 As an example, with this PR and calling `searcher.setMaxEvaluatedHitRatio(.001f)`, the query `be (+mostly +interview)` goes from 7.0ms to 2.7ms while still returning the same top 100 hit

[GitHub] [lucene] benwtrent commented on pull request #12434: Add ParentJoin KNN support

2023-07-17 Thread via GitHub
benwtrent commented on PR #12434: URL: https://github.com/apache/lucene/pull/12434#issuecomment-1638442919 @jpountz my original benchmarks were flawed. There was a bug in my testing. Nested is actually 80% slower (or 1.8x times) than the current search times. I am investigating the c

[GitHub] [lucene] ChrisHegarty commented on pull request #12417: forutil add vectorized and scalar code

2023-07-17 Thread via GitHub
ChrisHegarty commented on PR #12417: URL: https://github.com/apache/lucene/pull/12417#issuecomment-1638836015 Here's where I'm at, after spending the best part of the last three days hacking in this area - I'm on the fence about whether or not this is worth it. The current code and fo

[GitHub] [lucene] tang-hi commented on pull request #12417: forutil add vectorized and scalar code

2023-07-17 Thread via GitHub
tang-hi commented on PR #12417: URL: https://github.com/apache/lucene/pull/12417#issuecomment-1639151537 I also currently believe that it may not be a good time to vectorize it. Although vectorized code combined with lazy compute does improve performance, we currently cannot achieve scalar