[GitHub] [lucene] Tjianke commented on issue #11707: Re-evaluate different ways to encode postings [LUCENE-10672]
Tjianke commented on issue #11707: URL: https://github.com/apache/lucene/issues/11707#issuecomment-1449598942 Lucene community has the good tradition of incorporating academic results. Recent studies show many efficient algorithms like [Partitioned Elias-Fano](http://groups.di.unipi.it/~ottavian/files/elias_fano_sigir14.pdf). Here's a great [paper ](https://arxiv.org/abs/1908.10598)comparing the space-time trade-off of various index compression. I'm interested in evaluating current codec and implementing state-of-art compressions if viable. But I'm new to Lucene, which aspects (index size/query speed/memory/CPU) do we need to evaluate? Is there any mature benchmark to exploit, I noticed a LuceneBenchmark repo though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kaivalnp commented on a diff in pull request #12160: Concurrent rewrite for KnnVectorQuery
kaivalnp commented on code in PR #12160: URL: https://github.com/apache/lucene/pull/12160#discussion_r1121404545 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -73,17 +77,48 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { .build(); Query rewritten = indexSearcher.rewrite(booleanQuery); filterWeight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1f); +} else { + filterWeight = null; } -for (LeafReaderContext ctx : reader.leaves()) { - TopDocs results = searchLeaf(ctx, filterWeight); - if (ctx.docBase > 0) { -for (ScoreDoc scoreDoc : results.scoreDocs) { - scoreDoc.doc += ctx.docBase; -} - } - perLeafResults[ctx.ord] = results; -} +List> tasks = +reader.leaves().stream() +.map( +ctx -> +new FutureTask<>( +() -> { + try { +TopDocs results = searchLeaf(ctx, filterWeight); +if (ctx.docBase > 0) { + for (ScoreDoc scoreDoc : results.scoreDocs) { +scoreDoc.doc += ctx.docBase; + } +} +return results; + } catch (IOException e) { +throw new UncheckedIOException(e); + } +})) +.toList(); + +Executor executor = Objects.requireNonNullElse(indexSearcher.getExecutor(), Runnable::run); +SliceExecutor sliceExecutor = new SliceExecutor(executor); +sliceExecutor.invokeAll(tasks); Review Comment: For the logic duplication, it just updated the doc ids (by adding `ctx.docBase` to get index-level doc ids): and I put it in a separate function > I think the minor if statement is worth it. It creates fewer objects and is a simpler function. It might be more readable if you broke the results gathering into individual private methods. [Here](https://github.com/apache/lucene/compare/main...kaivalnp:lucene:concurrent-knn-reduce-overhead) are the sample changes, please let me know if these look good: and I'll commit it in this PR Note that I had to wrap the sequential execution in a `try - catch`, and wrap exceptions in a `RuntimeException` for consistency with exceptions thrown during parallel execution (also to pass test cases) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kaivalnp commented on a diff in pull request #12160: Concurrent rewrite for KnnVectorQuery
kaivalnp commented on code in PR #12160: URL: https://github.com/apache/lucene/pull/12160#discussion_r1120780564 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -73,17 +77,48 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { .build(); Query rewritten = indexSearcher.rewrite(booleanQuery); filterWeight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1f); +} else { + filterWeight = null; } -for (LeafReaderContext ctx : reader.leaves()) { - TopDocs results = searchLeaf(ctx, filterWeight); - if (ctx.docBase > 0) { -for (ScoreDoc scoreDoc : results.scoreDocs) { - scoreDoc.doc += ctx.docBase; -} - } - perLeafResults[ctx.ord] = results; -} +List> tasks = +reader.leaves().stream() +.map( +ctx -> +new FutureTask<>( +() -> { + try { +TopDocs results = searchLeaf(ctx, filterWeight); +if (ctx.docBase > 0) { + for (ScoreDoc scoreDoc : results.scoreDocs) { +scoreDoc.doc += ctx.docBase; + } +} +return results; + } catch (IOException e) { +throw new UncheckedIOException(e); + } +})) +.toList(); + +Executor executor = Objects.requireNonNullElse(indexSearcher.getExecutor(), Runnable::run); +SliceExecutor sliceExecutor = new SliceExecutor(executor); +sliceExecutor.invokeAll(tasks); Review Comment: > We shouldn't bother with any parallelism if indexSearcher.getExecutor() == null || reader.leaves().size() <= 1. Its a simple if branch that allows us to remove all the overhead associated with parallel rewrites when no parallelism can be achieved. I would totally agree with you here! We shouldn't add overhead for non-concurrent executions If I understand correctly, you are suggesting to add an `if` block with the condition: - When it evaluates to `true`, we want to perform search [as before](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L78-L86) - `else` we perform the search [as per this PR](https://github.com/kaivalnp/lucene/blob/concurrent-knn/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L84-L120) I have tried to implement the changes [here](https://github.com/apache/lucene/compare/main...kaivalnp:lucene:concurrent-knn-reduce-overhead). I ran some benchmarks for these (with the executor as `null`): ### enwiki (topK = 100, segment count = 10, executor = null) | recall | Sequential | SliceExecutor | ReduceOverhead | nDoc | fanout | maxConn | beamWidth | |-|-|-|-|-|-|-|-| | 0.995 | 0.95 | 0.96 | 0.95 | 1 | 0 | 16 | 32 | | 0.998 | 1.26 | 1.30 | 1.29 | 1 | 50 | 16 | 32 | | 0.998 | 1.05 | 1.07 | 1.07 | 1 | 0 | 16 | 64 | | 0.999 | 1.41 | 1.43 | 1.43 | 1 | 50 | 16 | 64 | | 0.995 | 0.98 | 0.99 | 0.98 | 1 | 0 | 32 | 32 | | 0.998 | 1.31 | 1.33 | 1.34 | 1 | 50 | 32 | 32 | | 0.998 | 0.99 | 1.01 | 1.01 | 1 | 0 | 32 | 64 | | 0.999 | 1.33 | 1.36 | 1.36 | 1 | 50 | 32 | 64 | | 0.987 | 1.70 | 1.70 | 1.71 | 10 | 0 | 16 | 32 | | 0.992 | 2.30 | 2.30 | 2.31 | 10 | 50 | 16 | 32 | | 0.993 | 1.92 | 1.89 | 1.94 | 10 | 0 | 16 | 64 | | 0.996 | 2.63 | 2.65 | 2.64 | 10 | 50 | 16 | 64 | | 0.987 | 1.73 | 1.70 | 1.74 | 10 | 0 | 32 | 32 | | 0.992 | 2.34 | 2.30 | 2.37 | 10 | 50 | 32 | 32 | | 0.994 | 1.96 | 1.92 | 1.98 | 10 | 0 | 32 | 64 | | 0.997 | 2.66 | 2.61 | 2.69 | 10 | 50 | 32 | 64 | | 0.971 | 2.72 | 2.70 | 2.74 | 100 | 0 | 16 | 32 | | 0.982 | 3.77 | 3.79 | 3.78 | 100 | 50 | 16 | 32 | | 0.985 | 3.13 | 3.19 | 3.19 | 100 | 0 | 16 | 64 | | 0.991 | 4.34 | 4.37 | 4.36 | 100 | 50 | 16 | 64 | | 0.973 | 2.86 | 2.94 | 2.94 | 100 | 0 | 32 | 32 | | 0.983 | 3.94 | 3.98 | 3.97 | 100 | 50 | 32 | 32 | | 0.986 | 3.38 | 3.37 | 3.38 | 100 | 0 | 32 | 64 | | 0.992 | 4.63 | 4.66 | 4.67 | 100 | 50 | 32 | 64 | ### enwiki (topK = 100, segment count = 5, executor = null) | recall | Sequential | SliceExecutor | ReduceOverhead | nDoc | fanout | maxConn | beamWidth | |-|-|-|-|-|-|-|-| | 0.991 | 0.59 | 0.61 | 0.59 | 1 | 0 | 16 | 32 | | 0.996 | 0.82 | 0.83 | 0.81 | 1 | 50 | 16 | 32 | | 0.997 | 0.61 | 0.62 | 0.60 | 1 | 0 | 16 | 64 | | 0.999 | 0.88 | 0.88 | 0.86 | 1 | 50 | 16 | 64
[GitHub] [lucene] dantuzi commented on a diff in pull request #12169: Introduced the Word2VecSynonymFilter
dantuzi commented on code in PR #12169: URL: https://github.com/apache/lucene/pull/12169#discussion_r1121445530 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamTestCase.java: ## @@ -221,6 +223,12 @@ public static void assertTokenStreamContents( flagsAtt = ts.getAttribute(FlagsAttribute.class); } +BoostAttribute boostAtt = null; Review Comment: Hi @rmuir, thank you for notice it. Actually the BoostAttribute is already used in the DelimitedBoostTokenFilter: https://github.com/apache/lucene/blob/475fbd0bdde31c6a2ae62c59505cf9e8becd50e4/lucene/analysis/common/src/java/org/apache/lucene/analysis/boost/DelimitedBoostTokenFilter.java#L39 How about if we do some refactoring (and update the documentation) to use the BoostAttribute in both cases? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dantuzi commented on pull request #12169: Introduced the Word2VecSynonymFilter
dantuzi commented on PR #12169: URL: https://github.com/apache/lucene/pull/12169#issuecomment-1449791310 @rmuir we did some tests at both query and index time. We tried to index some documents using the following CustomAnalyzer which includes our Word2VecSynonymFilter and we verified the content of the index using Luke. ``` Analyzer analyzer = CustomAnalyzer.builder() .withTokenizer(ClassicTokenizerFactory.NAME) .addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", "wiki-ita-w2v-model.zip") .addTokenFilter(FlattenGraphFilterFactory.NAME) .addTokenFilter(LowerCaseFilterFactory.NAME) .build(); ``` Please find some screenshots in our presentation from slide 32 https://www.slideshare.net/SeaseLtd/word2vec-model-to-generate-synonyms-on-the-fly-in-apache-lucenepdf -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12042: Implement MMapDirectory with Java 20 Project Panama Preview API
uschindler commented on PR #12042: URL: https://github.com/apache/lucene/pull/12042#issuecomment-1449822469 Hi @mbien, This is why the PR is currently in draft status. We build and test it already with a local install. It is enough to set an env variable. Lucene always runs Gradle with Java 17. You can still test against later version by passing RUNTIME_JAVA_HOME. This PR is just not yet merged to not being too much hassle for contributors/committers. When it does not Autodownload it's too hard to use. Once GA release is available on Adoptium we will merge this. I have another solution in my dev folder that allows to compile without jdk 19 or 20 downloaded or installed at all, but it is a bit hacky (it ships with a signatures jar file that's compiled against). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12169: Introduced the Word2VecSynonymFilter
rmuir commented on PR #12169: URL: https://github.com/apache/lucene/pull/12169#issuecomment-1449913809 I think you misunderstand the question. What happens to `BoostAttribute` at index-time? absolutely nothing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12169: Introduced the Word2VecSynonymFilter
rmuir commented on code in PR #12169: URL: https://github.com/apache/lucene/pull/12169#discussion_r1121546026 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamTestCase.java: ## @@ -221,6 +223,12 @@ public static void assertTokenStreamContents( flagsAtt = ts.getAttribute(FlagsAttribute.class); } +BoostAttribute boostAtt = null; Review Comment: This is wrong, wrong wrong. Remove any use of boostattribute outside of MultiTermQuery. It documents *not to do this*. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12169: Introduced the Word2VecSynonymFilter
rmuir commented on PR #12169: URL: https://github.com/apache/lucene/pull/12169#issuecomment-1449923913 From what I can tell, this probably shouldnt be an analyzer at all. Seems it only works at query-time and will simply do the wrong thing at index-time. The attempted boost manipulation by an *analyzer* is what gives you the hint that the design is wrong. I think what it is doing is more like query expansion, where it formulates query directly. Then there is no need to mess around with any boostattribute or anything like that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery
gsmiller commented on PR #12156: URL: https://github.com/apache/lucene/pull/12156#issuecomment-1450145244 Thanks @uschindler. > I am not fully sure what default rewrite method is best here. The nice thing is it's easy to control now (bitset rewrite, boolean scoring, doc values post-filtering, etc.). Based on the benchmark wins in #12055 for other multi-term queries, I thought it would be good to use the same rewrite by default, at least for now. We can change the default easily if we learn more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery
gsmiller merged PR #12156: URL: https://github.com/apache/lucene/pull/12156 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on issue #11707: Re-evaluate different ways to encode postings [LUCENE-10672]
gsmiller commented on issue #11707: URL: https://github.com/apache/lucene/issues/11707#issuecomment-1450169614 @Tjianke the [luceneutil](https://github.com/mikemccand/luceneutil) benchmarks are a great place to start. These power the [nightly benchmarks](https://home.apache.org/~mikemccand/lucenebench/) that continuously run. For changes to such a core part of Lucene, the existing benchmark tasks will likely capture performance changes well (I would think). They're primarily focused on throughput/query-cost, but they can create separate indexes for your baseline/candidate code branches and you can manually diff the index sizes. For something like this, running microbenchmarks may also be useful. We did this the last time we made a change to the postings format. There's some history in #10889. I can try to jog my memory on that issue and answer questions if you have any :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #12173: Deprecate TermInSetQuery#getTermData
gsmiller merged PR #12173: URL: https://github.com/apache/lucene/pull/12173 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new issue, #12174: MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE should provide a custom BulkScorer
gsmiller opened a new issue, #12174: URL: https://github.com/apache/lucene/issues/12174 ### Description This rewrite method (implemented in `MultiTermQueryConstantScoreBlendedWrapper`) relies on `DefaultBulkScorer` when there are more than 16 terms (with 16 or fewer, a `BooleanQuery` gets created which has custom bulk scoring logic already). This is inefficient since the bulk scoring will require managing 17 postings on a heap. A simple improvement would be to always fully rewrite to a bitset filter when bulk scoring and wrap that with a `DefaultBulkScorer` (like `MultiTermQuery#CONSTANT_SCORE_REWRITE`). This can probably be done pretty simply in the abstract base class (`AbstractMultiTermQueryConstantScoreWrapper`). There may be other ways to do this—possibly relying on the logic already in `BooleanScorer` to make better use of memory—but we can probably start simple here as well. I'm not sure how frequently users would actually have a top-level term-in-set query they want to score, but it would be nice to make this better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new pull request, #12175: Remove SortedSetDocValuesSetQuery in favor of TermInSetQuery with DocValuesRewriteMethod
gsmiller opened a new pull request, #12175: URL: https://github.com/apache/lucene/pull/12175 ### Description Now that `TermInSetQuery` extends `MultiTermQuery` (#12156), we can leverage other `RewriteMethod`s to change the query execution behavior. Because of this, we can use `DocValuesRewriteMethod` in place of `SortedSetDocValuesSetQuery`, which lets us remove some mostly redundant code and simplify. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Trey314159 commented on pull request #12172: Add Romanian stopwords with s&t with comma
Trey314159 commented on PR #12172: URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450316155 _Good catch!_ I didn't consider that the stemmer might also be of a vintage to only use the older orthography. I've contacted the Snowball mailing list (message not yet accepted) to suggest making the stemmer aware of both variants of each character. A character filter or token filter to convert the comma forms to cedilla forms seems like a reasonable hack for now, and I'll add that to our local implementation right away. I have to admit that it chafes a little to convert everything to the "wrong" form, but the internal representation is just an internal representation, I guess, as long as everything is consistent about it. Feel free to decline this pull request if that makes more sense to you, though I think it might be a good bit of future proofing for each stage of the process to handle both forms, so as not to be reliant on a counterintuitive character mapping. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12160: Concurrent rewrite for KnnVectorQuery
benwtrent commented on code in PR #12160: URL: https://github.com/apache/lucene/pull/12160#discussion_r1121925554 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -73,17 +77,48 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { .build(); Query rewritten = indexSearcher.rewrite(booleanQuery); filterWeight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1f); +} else { + filterWeight = null; } -for (LeafReaderContext ctx : reader.leaves()) { - TopDocs results = searchLeaf(ctx, filterWeight); - if (ctx.docBase > 0) { -for (ScoreDoc scoreDoc : results.scoreDocs) { - scoreDoc.doc += ctx.docBase; -} - } - perLeafResults[ctx.ord] = results; -} +List> tasks = +reader.leaves().stream() +.map( +ctx -> +new FutureTask<>( +() -> { + try { +TopDocs results = searchLeaf(ctx, filterWeight); +if (ctx.docBase > 0) { + for (ScoreDoc scoreDoc : results.scoreDocs) { +scoreDoc.doc += ctx.docBase; + } +} +return results; + } catch (IOException e) { +throw new UncheckedIOException(e); + } +})) +.toList(); + +Executor executor = Objects.requireNonNullElse(indexSearcher.getExecutor(), Runnable::run); +SliceExecutor sliceExecutor = new SliceExecutor(executor); +sliceExecutor.invokeAll(tasks); Review Comment: > For the logic duplication I wouldn't worry about that. That makes things even more difficult to reason about. I would much rather have a method that takes in the filter weight and leaf contexts and one that takes the same parameters but with an added Executor. One called where `indexSearcher.getExecutor() == null` and the other when the executor is provided. Two methods like this: ``` private TopDocs[] gatherLeafResults( List leafReaderContexts, Weight filterWeight) throws IOException { TopDocs[] perLeafResults = new TopDocs[leafReaderContexts.size()]; for (LeafReaderContext ctx : leafReaderContexts) { TopDocs results = searchLeaf(ctx, filterWeight); if (ctx.docBase > 0) { for (ScoreDoc scoreDoc : results.scoreDocs) { scoreDoc.doc += ctx.docBase; } } perLeafResults[ctx.ord] = results; } return perLeafResults; } private TopDocs[] gatherLeafResults( List leafReaderContexts, Weight filterWeight, Executor executor) { List> tasks = leafReaderContexts.stream() .map( ctx -> new FutureTask<>( () -> { TopDocs results = searchLeaf(ctx, filterWeight); if (ctx.docBase > 0) { for (ScoreDoc scoreDoc : results.scoreDocs) { scoreDoc.doc += ctx.docBase; } } return results; })) .toList(); SliceExecutor sliceExecutor = new SliceExecutor(executor); sliceExecutor.invokeAll(tasks); return tasks.stream() .map( task -> { try { return task.get(); } catch (ExecutionException e) { throw new RuntimeException(e.getCause()); } catch (InterruptedException e) { throw new ThreadInterruptedException(e); } }) .toArray(TopDocs[]::new); } ``` ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -73,17 +77,48 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { .build(); Query rewritten = indexSearcher.rewrite(booleanQuery); filterWeight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1f); +} else { + filterWeight = null; } -for (LeafReaderContext ctx : reader.leaves()) { - TopDocs results = searchLeaf(ctx, filterWeight); - if (ctx.docBase > 0) { -for (ScoreDoc scoreDoc : results.scoreDocs) { - scoreDoc.doc += ctx.docBase; -} - } - perLeafResults[ctx.ord] = results; -} +List> tasks = +reader.leaves().stream() +.map( +ctx -> +new FutureTask<>( +
[GitHub] [lucene] gsmiller merged pull request #12175: Remove SortedSetDocValuesSetQuery in favor of TermInSetQuery with DocValuesRewriteMethod
gsmiller merged PR #12175: URL: https://github.com/apache/lucene/pull/12175 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir opened a new issue, #12176: TermInSetQuery could use (variant of) DaciukMihov/Terms.intersect() for faster intersection
rmuir opened a new issue, #12176: URL: https://github.com/apache/lucene/issues/12176 ### Description TermInSetQuery currently "ping-pong" intersects a sorted list against the term dictionary. Instead of sorted-list, it could possibly use Daciuk Mihov Automaton, which can be built in linear time. Then query could leverage `Terms.intersect` (e.g. TermInSetQuery could be an AutomatonQuery subclass). This should give faster intersection of the terms, which is usually the heavy part of this query. For example BlockTree terms dictionary has a very efficient `Terms.intersect` that makes use of the underlying structure. The annoying part: `DaciukMihovAutomatonBuilder` currently requires unicode strings and makes a UTF-32 automaton, which would then be converted to UTF-8 (binary) automaton via `UTF32ToUTF8`. But I think `TermInSetQuery` may allow arbitrary non-unicode binary strings? In order to support arbitrarily binary terms (and to avoid conversions), the DaciukMihov code would have to modified, to support construction of a binary automaton directly. Probably this is actually simpler? This is just an idea to get more performance, it hasn't been tested. feel free to close the issue if it doesnt work out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12172: Add Romanian stopwords with s&t with comma
rmuir commented on PR #12172: URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450458707 I think we can merge this stopword list change anyway. But I think a filter may be worthwhile as a separate PR? It has the advantage of making the terms conflate regardless of which variant of the character is being used. After reading up on the history of these characters, I think we should treat them "the same" for Romanian always. It will allow queries/documents to match in all cases. So I don't think of it as a hack, as the "different characters" may impact words outside of just stopwords and suffixes that are stemmed, too. I'd recommend doing it as a TokenFilter. Many lucene analyzers for other languages have such normalizers as tokenfilters to deal with similar issues. CharFilter is not needed in this case as it won't impact tokenization, since StandardTokenizer etc will tokenize them all the same way (same unicode wordbreak properties). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12172: Add Romanian stopwords with s&t with comma
rmuir commented on PR #12172: URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450468166 > I've contacted the Snowball mailing list (message not yet accepted) fwiw I'm subscribed that list and haven't seen a message in 10 years. I think they are just using github issues/PRs now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12172: Add Romanian stopwords with s&t with comma
rmuir commented on PR #12172: URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450554452 > I have to admit that it chafes a little to convert everything to the "wrong" form, but the internal representation is just an internal representation, I guess, as long as everything is consistent about it. It is a good point, representation isn't always internal if we consider external libraries and other things that lucene can integrate with for language processing. As another datapoint, I investigated Romanian hunspell dictionary (as a user could substitute HunspellStemFilter for SnowballFilter if they want dictionary-based stemming, and because I was curious) They are using `ș` and `ț` , but have directives at the end to handle differences during suggestions phase: ``` MAP sşș MAP tţț ``` So I think the most ideal situation would be to both fix snowball and then map cedilla to "correct" forms with a TokenFilter? I realize this doesn't solve your problem immediately, but if PR gets accepted to snowball, I can just bump our git revision to the new one and we'll have the new stemmer :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kashkambath opened a new issue, #12178: TermAutomatonQuery explain() should return relevant explain output instead of null
kashkambath opened a new issue, #12178: URL: https://github.com/apache/lucene/issues/12178 ### Description Hi! This is my first time posting a GitHub issue for Apache Lucene. Please let me know if you need anything further. https://github.com/apache/lucene/blob/569533bd76a115e239d8c621d2540462bc12d891/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/TermAutomatonQuery.java#L443-L447 We should return relevant explain output here instead of returning `null` for `TermAutomatonQuery`. This would be helpful for users running document explain on synonym queries leveraging `TermAutomatonQuery`. Otherwise, certain queries that can nest other queries could throw a `NullPointerException` if their weight's `explain()` calls `TermAutomatonQuery#explain()`. For example, if a `BooleanQuery` nests a `TermAutomatonQuery`, then `BooleanWeight` throws an NPE here when running `explain()`: https://github.com/apache/lucene/blob/569533bd76a115e239d8c621d2540462bc12d891/lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java#L75 ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kaivalnp commented on a diff in pull request #12160: Concurrent rewrite for KnnVectorQuery
kaivalnp commented on code in PR #12160: URL: https://github.com/apache/lucene/pull/12160#discussion_r1122648348 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -73,17 +77,48 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { .build(); Query rewritten = indexSearcher.rewrite(booleanQuery); filterWeight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1f); +} else { + filterWeight = null; } -for (LeafReaderContext ctx : reader.leaves()) { - TopDocs results = searchLeaf(ctx, filterWeight); - if (ctx.docBase > 0) { -for (ScoreDoc scoreDoc : results.scoreDocs) { - scoreDoc.doc += ctx.docBase; -} - } - perLeafResults[ctx.ord] = results; -} +List> tasks = +reader.leaves().stream() +.map( +ctx -> +new FutureTask<>( +() -> { + try { +TopDocs results = searchLeaf(ctx, filterWeight); +if (ctx.docBase > 0) { + for (ScoreDoc scoreDoc : results.scoreDocs) { +scoreDoc.doc += ctx.docBase; + } +} +return results; + } catch (IOException e) { +throw new UncheckedIOException(e); + } +})) +.toList(); + +Executor executor = Objects.requireNonNullElse(indexSearcher.getExecutor(), Runnable::run); +SliceExecutor sliceExecutor = new SliceExecutor(executor); +sliceExecutor.invokeAll(tasks); Review Comment: Thanks for the input! This was really helpful in reducing overhead for non-concurrent search, and improving readability! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org