[GitHub] [lucene] Tjianke commented on issue #11707: Re-evaluate different ways to encode postings [LUCENE-10672]

2023-03-01 Thread via GitHub


Tjianke commented on issue #11707:
URL: https://github.com/apache/lucene/issues/11707#issuecomment-1449598942

   Lucene community has the good tradition of incorporating academic results. 
Recent studies show many efficient algorithms like [Partitioned 
Elias-Fano](http://groups.di.unipi.it/~ottavian/files/elias_fano_sigir14.pdf). 
Here's a great [paper ](https://arxiv.org/abs/1908.10598)comparing the 
space-time trade-off of various index compression. I'm interested in evaluating 
current codec and implementing state-of-art compressions if viable. But I'm new 
to Lucene, which aspects (index size/query speed/memory/CPU) do we need to 
evaluate? Is there any mature benchmark to exploit, I noticed a LuceneBenchmark 
repo though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kaivalnp commented on a diff in pull request #12160: Concurrent rewrite for KnnVectorQuery

2023-03-01 Thread via GitHub


kaivalnp commented on code in PR #12160:
URL: https://github.com/apache/lucene/pull/12160#discussion_r1121404545


##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -73,17 +77,48 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
   .build();
   Query rewritten = indexSearcher.rewrite(booleanQuery);
   filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
+} else {
+  filterWeight = null;
 }
 
-for (LeafReaderContext ctx : reader.leaves()) {
-  TopDocs results = searchLeaf(ctx, filterWeight);
-  if (ctx.docBase > 0) {
-for (ScoreDoc scoreDoc : results.scoreDocs) {
-  scoreDoc.doc += ctx.docBase;
-}
-  }
-  perLeafResults[ctx.ord] = results;
-}
+List> tasks =
+reader.leaves().stream()
+.map(
+ctx ->
+new FutureTask<>(
+() -> {
+  try {
+TopDocs results = searchLeaf(ctx, filterWeight);
+if (ctx.docBase > 0) {
+  for (ScoreDoc scoreDoc : results.scoreDocs) {
+scoreDoc.doc += ctx.docBase;
+  }
+}
+return results;
+  } catch (IOException e) {
+throw new UncheckedIOException(e);
+  }
+}))
+.toList();
+
+Executor executor = 
Objects.requireNonNullElse(indexSearcher.getExecutor(), Runnable::run);
+SliceExecutor sliceExecutor = new SliceExecutor(executor);
+sliceExecutor.invokeAll(tasks);

Review Comment:
   For the logic duplication, it just updated the doc ids (by adding 
`ctx.docBase` to get index-level doc ids): and I put it in a separate function
   
   > I think the minor if statement is worth it. It creates fewer objects and 
is a simpler function. It might be more readable if you broke the results 
gathering into individual private methods.
   
   
[Here](https://github.com/apache/lucene/compare/main...kaivalnp:lucene:concurrent-knn-reduce-overhead)
 are the sample changes, please let me know if these look good: and I'll commit 
it in this PR
   
   Note that I had to wrap the sequential execution in a `try - catch`, and 
wrap exceptions in a `RuntimeException` for consistency with exceptions thrown 
during parallel execution (also to pass test cases)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kaivalnp commented on a diff in pull request #12160: Concurrent rewrite for KnnVectorQuery

2023-03-01 Thread via GitHub


kaivalnp commented on code in PR #12160:
URL: https://github.com/apache/lucene/pull/12160#discussion_r1120780564


##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -73,17 +77,48 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
   .build();
   Query rewritten = indexSearcher.rewrite(booleanQuery);
   filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
+} else {
+  filterWeight = null;
 }
 
-for (LeafReaderContext ctx : reader.leaves()) {
-  TopDocs results = searchLeaf(ctx, filterWeight);
-  if (ctx.docBase > 0) {
-for (ScoreDoc scoreDoc : results.scoreDocs) {
-  scoreDoc.doc += ctx.docBase;
-}
-  }
-  perLeafResults[ctx.ord] = results;
-}
+List> tasks =
+reader.leaves().stream()
+.map(
+ctx ->
+new FutureTask<>(
+() -> {
+  try {
+TopDocs results = searchLeaf(ctx, filterWeight);
+if (ctx.docBase > 0) {
+  for (ScoreDoc scoreDoc : results.scoreDocs) {
+scoreDoc.doc += ctx.docBase;
+  }
+}
+return results;
+  } catch (IOException e) {
+throw new UncheckedIOException(e);
+  }
+}))
+.toList();
+
+Executor executor = 
Objects.requireNonNullElse(indexSearcher.getExecutor(), Runnable::run);
+SliceExecutor sliceExecutor = new SliceExecutor(executor);
+sliceExecutor.invokeAll(tasks);

Review Comment:
   > We shouldn't bother with any parallelism if indexSearcher.getExecutor() == 
null || reader.leaves().size() <= 1. Its a simple if branch that allows us to 
remove all the overhead associated with parallel rewrites when no parallelism 
can be achieved.
   
   I would totally agree with you here! We shouldn't add overhead for 
non-concurrent executions
   If I understand correctly, you are suggesting to add an `if` block with the 
condition:
   - When it evaluates to `true`, we want to perform search [as 
before](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L78-L86)
   - `else` we perform the search [as per this 
PR](https://github.com/kaivalnp/lucene/blob/concurrent-knn/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L84-L120)
   
   I have tried to implement the changes 
[here](https://github.com/apache/lucene/compare/main...kaivalnp:lucene:concurrent-knn-reduce-overhead).
 I ran some benchmarks for these (with the executor as `null`):
   
   ### enwiki (topK = 100, segment count = 10, executor = null)
   | recall | Sequential | SliceExecutor | ReduceOverhead | nDoc | fanout | 
maxConn | beamWidth |
   |-|-|-|-|-|-|-|-|
   | 0.995 |  0.95 |  0.96 |  0.95 |   1 |  0 | 16 | 32 |
   | 0.998 |  1.26 |  1.30 |  1.29 |   1 | 50 | 16 | 32 |
   | 0.998 |  1.05 |  1.07 |  1.07 |   1 |  0 | 16 | 64 |
   | 0.999 |  1.41 |  1.43 |  1.43 |   1 | 50 | 16 | 64 |
   | 0.995 |  0.98 |  0.99 |  0.98 |   1 |  0 | 32 | 32 |
   | 0.998 |  1.31 |  1.33 |  1.34 |   1 | 50 | 32 | 32 |
   | 0.998 |  0.99 |  1.01 |  1.01 |   1 |  0 | 32 | 64 |
   | 0.999 |  1.33 |  1.36 |  1.36 |   1 | 50 | 32 | 64 |
   | 0.987 |  1.70 |  1.70 |  1.71 |  10 |  0 | 16 | 32 |
   | 0.992 |  2.30 |  2.30 |  2.31 |  10 | 50 | 16 | 32 |
   | 0.993 |  1.92 |  1.89 |  1.94 |  10 |  0 | 16 | 64 |
   | 0.996 |  2.63 |  2.65 |  2.64 |  10 | 50 | 16 | 64 |
   | 0.987 |  1.73 |  1.70 |  1.74 |  10 |  0 | 32 | 32 |
   | 0.992 |  2.34 |  2.30 |  2.37 |  10 | 50 | 32 | 32 |
   | 0.994 |  1.96 |  1.92 |  1.98 |  10 |  0 | 32 | 64 |
   | 0.997 |  2.66 |  2.61 |  2.69 |  10 | 50 | 32 | 64 |
   | 0.971 |  2.72 |  2.70 |  2.74 | 100 |  0 | 16 | 32 |
   | 0.982 |  3.77 |  3.79 |  3.78 | 100 | 50 | 16 | 32 |
   | 0.985 |  3.13 |  3.19 |  3.19 | 100 |  0 | 16 | 64 |
   | 0.991 |  4.34 |  4.37 |  4.36 | 100 | 50 | 16 | 64 |
   | 0.973 |  2.86 |  2.94 |  2.94 | 100 |  0 | 32 | 32 |
   | 0.983 |  3.94 |  3.98 |  3.97 | 100 | 50 | 32 | 32 |
   | 0.986 |  3.38 |  3.37 |  3.38 | 100 |  0 | 32 | 64 |
   | 0.992 |  4.63 |  4.66 |  4.67 | 100 | 50 | 32 | 64 |
   
   ### enwiki (topK = 100, segment count = 5, executor = null)
   | recall | Sequential | SliceExecutor | ReduceOverhead | nDoc | fanout | 
maxConn | beamWidth |
   |-|-|-|-|-|-|-|-|
   | 0.991 |  0.59 |  0.61 |  0.59 |   1 |  0 | 16 | 32 |
   | 0.996 |  0.82 |  0.83 |  0.81 |   1 | 50 | 16 | 32 |
   | 0.997 |  0.61 |  0.62 |  0.60 |   1 |  0 | 16 | 64 |
   | 0.999 |  0.88 |  0.88 |  0.86 |   1 | 50 | 16 | 64

[GitHub] [lucene] dantuzi commented on a diff in pull request #12169: Introduced the Word2VecSynonymFilter

2023-03-01 Thread via GitHub


dantuzi commented on code in PR #12169:
URL: https://github.com/apache/lucene/pull/12169#discussion_r1121445530


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamTestCase.java:
##
@@ -221,6 +223,12 @@ public static void assertTokenStreamContents(
   flagsAtt = ts.getAttribute(FlagsAttribute.class);
 }
 
+BoostAttribute boostAtt = null;

Review Comment:
   Hi @rmuir, thank you for notice it.
   Actually the BoostAttribute is already used in the 
DelimitedBoostTokenFilter: 
https://github.com/apache/lucene/blob/475fbd0bdde31c6a2ae62c59505cf9e8becd50e4/lucene/analysis/common/src/java/org/apache/lucene/analysis/boost/DelimitedBoostTokenFilter.java#L39
   How about if we do some refactoring (and update the documentation) to use 
the BoostAttribute in both cases?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dantuzi commented on pull request #12169: Introduced the Word2VecSynonymFilter

2023-03-01 Thread via GitHub


dantuzi commented on PR #12169:
URL: https://github.com/apache/lucene/pull/12169#issuecomment-1449791310

   @rmuir we did some tests at both query and index time.
   We tried to index some documents using the following CustomAnalyzer which 
includes our Word2VecSynonymFilter and we verified the content of the index 
using Luke.
   
   ```
   Analyzer analyzer = CustomAnalyzer.builder()
 .withTokenizer(ClassicTokenizerFactory.NAME)
 .addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", 
"wiki-ita-w2v-model.zip")
 .addTokenFilter(FlattenGraphFilterFactory.NAME)
 .addTokenFilter(LowerCaseFilterFactory.NAME)
 .build();
   ```
   
   Please find some screenshots in our presentation from slide 32
   
https://www.slideshare.net/SeaseLtd/word2vec-model-to-generate-synonyms-on-the-fly-in-apache-lucenepdf


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #12042: Implement MMapDirectory with Java 20 Project Panama Preview API

2023-03-01 Thread via GitHub


uschindler commented on PR #12042:
URL: https://github.com/apache/lucene/pull/12042#issuecomment-1449822469

   Hi @mbien,
   This is why the PR is currently in draft status. We build and test it 
already with a local install. It is enough to set an env variable. Lucene 
always runs Gradle with Java 17. You can still test against later version by 
passing RUNTIME_JAVA_HOME.
   
   This PR is just not yet merged to not being too much hassle for 
contributors/committers. When it does not Autodownload it's too hard to use.
   
   Once GA release is available on Adoptium we will merge this.
   
   I have another solution in my dev folder that allows to compile without jdk 
19 or 20 downloaded or installed at all, but it is a bit hacky (it ships with a 
signatures jar file that's compiled against).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12169: Introduced the Word2VecSynonymFilter

2023-03-01 Thread via GitHub


rmuir commented on PR #12169:
URL: https://github.com/apache/lucene/pull/12169#issuecomment-1449913809

   I think you misunderstand the question. What happens to `BoostAttribute` at 
index-time? absolutely nothing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #12169: Introduced the Word2VecSynonymFilter

2023-03-01 Thread via GitHub


rmuir commented on code in PR #12169:
URL: https://github.com/apache/lucene/pull/12169#discussion_r1121546026


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamTestCase.java:
##
@@ -221,6 +223,12 @@ public static void assertTokenStreamContents(
   flagsAtt = ts.getAttribute(FlagsAttribute.class);
 }
 
+BoostAttribute boostAtt = null;

Review Comment:
   This is wrong, wrong wrong. Remove any use of boostattribute outside of 
MultiTermQuery. It documents *not to do this*.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12169: Introduced the Word2VecSynonymFilter

2023-03-01 Thread via GitHub


rmuir commented on PR #12169:
URL: https://github.com/apache/lucene/pull/12169#issuecomment-1449923913

   From what I can tell, this probably shouldnt be an analyzer at all. Seems it 
only works at query-time and will simply do the wrong thing at index-time. The 
attempted boost manipulation by an *analyzer* is what gives you the hint that 
the design is wrong.
   
   I think what it is doing is more like query expansion, where it formulates 
query directly. Then there is no need to mess around with any boostattribute or 
anything like that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery

2023-03-01 Thread via GitHub


gsmiller commented on PR #12156:
URL: https://github.com/apache/lucene/pull/12156#issuecomment-1450145244

   Thanks @uschindler. 
   
   > I am not fully sure what default rewrite method is best here.
   
   The nice thing is it's easy to control now (bitset rewrite, boolean scoring, 
doc values post-filtering, etc.). Based on the benchmark wins in #12055 for 
other multi-term queries, I thought it would be good to use the same rewrite by 
default, at least for now. We can change the default easily if we learn more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery

2023-03-01 Thread via GitHub


gsmiller merged PR #12156:
URL: https://github.com/apache/lucene/pull/12156


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on issue #11707: Re-evaluate different ways to encode postings [LUCENE-10672]

2023-03-01 Thread via GitHub


gsmiller commented on issue #11707:
URL: https://github.com/apache/lucene/issues/11707#issuecomment-1450169614

   @Tjianke the [luceneutil](https://github.com/mikemccand/luceneutil) 
benchmarks are a great place to start. These power the [nightly 
benchmarks](https://home.apache.org/~mikemccand/lucenebench/) that continuously 
run. For changes to such a core part of Lucene, the existing benchmark tasks 
will likely capture performance changes well (I would think). They're primarily 
focused on throughput/query-cost, but they can create separate indexes for your 
baseline/candidate code branches and you can manually diff the index sizes.
   
   For something like this, running microbenchmarks may also be useful. We did 
this the last time we made a change to the postings format. There's some 
history in #10889. I can try to jog my memory on that issue and answer 
questions if you have any :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #12173: Deprecate TermInSetQuery#getTermData

2023-03-01 Thread via GitHub


gsmiller merged PR #12173:
URL: https://github.com/apache/lucene/pull/12173


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new issue, #12174: MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE should provide a custom BulkScorer

2023-03-01 Thread via GitHub


gsmiller opened a new issue, #12174:
URL: https://github.com/apache/lucene/issues/12174

   ### Description
   
   This rewrite method (implemented in 
`MultiTermQueryConstantScoreBlendedWrapper`) relies on `DefaultBulkScorer` when 
there are more than 16 terms (with 16 or fewer, a `BooleanQuery` gets created 
which has custom bulk scoring logic already). This is inefficient since the 
bulk scoring will require managing 17 postings on a heap. A simple improvement 
would be to always fully rewrite to a bitset filter when bulk scoring and wrap 
that with a `DefaultBulkScorer` (like `MultiTermQuery#CONSTANT_SCORE_REWRITE`). 
This can probably be done pretty simply in the abstract base class 
(`AbstractMultiTermQueryConstantScoreWrapper`).
   
   There may be other ways to do this—possibly relying on the logic already in 
`BooleanScorer` to make better use of memory—but we can probably start simple 
here as well. I'm not sure how frequently users would actually have a top-level 
term-in-set query they want to score, but it would be nice to make this better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request, #12175: Remove SortedSetDocValuesSetQuery in favor of TermInSetQuery with DocValuesRewriteMethod

2023-03-01 Thread via GitHub


gsmiller opened a new pull request, #12175:
URL: https://github.com/apache/lucene/pull/12175

   ### Description
   
   Now that `TermInSetQuery` extends `MultiTermQuery` (#12156), we can leverage 
other `RewriteMethod`s to change the query execution behavior. Because of this, 
we can use `DocValuesRewriteMethod` in place of `SortedSetDocValuesSetQuery`, 
which lets us remove some mostly redundant code and simplify.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Trey314159 commented on pull request #12172: Add Romanian stopwords with s&t with comma

2023-03-01 Thread via GitHub


Trey314159 commented on PR #12172:
URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450316155

   _Good catch!_ I didn't consider that the stemmer might also be of a vintage 
to only use the older orthography. I've contacted the Snowball mailing list 
(message not yet accepted) to suggest making the stemmer aware of both variants 
of each character. A character filter or token filter to convert the comma 
forms to cedilla forms seems like a reasonable hack for now, and I'll add that 
to our local implementation right away. I have to admit that it chafes a little 
to convert everything to the "wrong" form, but the internal representation is 
just an internal representation, I guess, as long as everything is consistent 
about it.
   
   Feel free to decline this pull request if that makes more sense to you, 
though I think it might be a good bit of future proofing for each stage of the 
process to handle both forms, so as not to be reliant on a counterintuitive 
character mapping.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12160: Concurrent rewrite for KnnVectorQuery

2023-03-01 Thread via GitHub


benwtrent commented on code in PR #12160:
URL: https://github.com/apache/lucene/pull/12160#discussion_r1121925554


##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -73,17 +77,48 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
   .build();
   Query rewritten = indexSearcher.rewrite(booleanQuery);
   filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
+} else {
+  filterWeight = null;
 }
 
-for (LeafReaderContext ctx : reader.leaves()) {
-  TopDocs results = searchLeaf(ctx, filterWeight);
-  if (ctx.docBase > 0) {
-for (ScoreDoc scoreDoc : results.scoreDocs) {
-  scoreDoc.doc += ctx.docBase;
-}
-  }
-  perLeafResults[ctx.ord] = results;
-}
+List> tasks =
+reader.leaves().stream()
+.map(
+ctx ->
+new FutureTask<>(
+() -> {
+  try {
+TopDocs results = searchLeaf(ctx, filterWeight);
+if (ctx.docBase > 0) {
+  for (ScoreDoc scoreDoc : results.scoreDocs) {
+scoreDoc.doc += ctx.docBase;
+  }
+}
+return results;
+  } catch (IOException e) {
+throw new UncheckedIOException(e);
+  }
+}))
+.toList();
+
+Executor executor = 
Objects.requireNonNullElse(indexSearcher.getExecutor(), Runnable::run);
+SliceExecutor sliceExecutor = new SliceExecutor(executor);
+sliceExecutor.invokeAll(tasks);

Review Comment:
   > For the logic duplication
   
   I wouldn't worry about that. That makes things even more difficult to reason 
about. I would much rather have a method that takes in the filter weight and 
leaf contexts and one that takes the same parameters but with an added Executor.

   One called where `indexSearcher.getExecutor() == null` and the other when 
the executor is provided.
   Two methods like this:
   ```
 private TopDocs[] gatherLeafResults(
 List leafReaderContexts, Weight filterWeight) 
throws IOException {
   TopDocs[] perLeafResults = new TopDocs[leafReaderContexts.size()];
   for (LeafReaderContext ctx : leafReaderContexts) {
 TopDocs results = searchLeaf(ctx, filterWeight);
 if (ctx.docBase > 0) {
   for (ScoreDoc scoreDoc : results.scoreDocs) {
 scoreDoc.doc += ctx.docBase;
   }
 }
 perLeafResults[ctx.ord] = results;
   }
   return perLeafResults;
 }
   
 private TopDocs[] gatherLeafResults(
 List leafReaderContexts, Weight filterWeight, 
Executor executor) {
   List> tasks =
   leafReaderContexts.stream()
   .map(
   ctx ->
   new FutureTask<>(
   () -> {
 TopDocs results = searchLeaf(ctx, filterWeight);
 if (ctx.docBase > 0) {
   for (ScoreDoc scoreDoc : results.scoreDocs) {
 scoreDoc.doc += ctx.docBase;
   }
 }
 return results;
   }))
   .toList();
   
   SliceExecutor sliceExecutor = new SliceExecutor(executor);
   sliceExecutor.invokeAll(tasks);
   
   return tasks.stream()
   .map(
   task -> {
 try {
   return task.get();
 } catch (ExecutionException e) {
   throw new RuntimeException(e.getCause());
 } catch (InterruptedException e) {
   throw new ThreadInterruptedException(e);
 }
   })
   .toArray(TopDocs[]::new);
 }
   ```
   
   



##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -73,17 +77,48 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
   .build();
   Query rewritten = indexSearcher.rewrite(booleanQuery);
   filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
+} else {
+  filterWeight = null;
 }
 
-for (LeafReaderContext ctx : reader.leaves()) {
-  TopDocs results = searchLeaf(ctx, filterWeight);
-  if (ctx.docBase > 0) {
-for (ScoreDoc scoreDoc : results.scoreDocs) {
-  scoreDoc.doc += ctx.docBase;
-}
-  }
-  perLeafResults[ctx.ord] = results;
-}
+List> tasks =
+reader.leaves().stream()
+.map(
+ctx ->
+new FutureTask<>(
+  

[GitHub] [lucene] gsmiller merged pull request #12175: Remove SortedSetDocValuesSetQuery in favor of TermInSetQuery with DocValuesRewriteMethod

2023-03-01 Thread via GitHub


gsmiller merged PR #12175:
URL: https://github.com/apache/lucene/pull/12175


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir opened a new issue, #12176: TermInSetQuery could use (variant of) DaciukMihov/Terms.intersect() for faster intersection

2023-03-01 Thread via GitHub


rmuir opened a new issue, #12176:
URL: https://github.com/apache/lucene/issues/12176

   ### Description
   
   TermInSetQuery currently "ping-pong" intersects a sorted list against the 
term dictionary.
   
   Instead of sorted-list, it could possibly use Daciuk Mihov Automaton, which 
can be built in linear time. Then query could leverage `Terms.intersect` (e.g. 
TermInSetQuery could be an AutomatonQuery subclass).
   
   This should give faster intersection of the terms, which is usually the 
heavy part of this query. For example BlockTree terms dictionary has a very 
efficient `Terms.intersect` that makes use of the underlying structure.
   
   The annoying part: `DaciukMihovAutomatonBuilder` currently requires unicode 
strings and makes a UTF-32 automaton, which would then be converted to UTF-8 
(binary) automaton via `UTF32ToUTF8`. But I think `TermInSetQuery` may allow 
arbitrary non-unicode binary strings?
   
   In order to support arbitrarily binary terms (and to avoid conversions), the 
DaciukMihov code would have to modified, to support construction of a binary 
automaton directly. Probably this is actually simpler?
   
   This is just an idea to get more performance, it hasn't been tested. feel 
free to close the issue if it doesnt work out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12172: Add Romanian stopwords with s&t with comma

2023-03-01 Thread via GitHub


rmuir commented on PR #12172:
URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450458707

   I think we can merge this stopword list change anyway. But I think a filter 
may be worthwhile as a separate PR?
   
   It has the advantage of making the terms conflate regardless of which 
variant of the character is being used. After reading up on the history of 
these characters, I think we should treat them "the same" for Romanian always. 
It will allow queries/documents to match in all cases. So I don't think of it 
as a hack, as the "different characters" may impact words outside of just 
stopwords and suffixes that are stemmed, too.
   
   I'd recommend doing it as a TokenFilter. Many lucene analyzers for other 
languages have such normalizers as tokenfilters to deal with similar issues. 
CharFilter is not needed in this case as it won't impact tokenization, since 
StandardTokenizer etc will tokenize them all the same way (same unicode 
wordbreak properties).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12172: Add Romanian stopwords with s&t with comma

2023-03-01 Thread via GitHub


rmuir commented on PR #12172:
URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450468166

   > I've contacted the Snowball mailing list (message not yet accepted)
   
   fwiw I'm subscribed that list and haven't seen a message in 10 years. I 
think they are just using github issues/PRs now?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12172: Add Romanian stopwords with s&t with comma

2023-03-01 Thread via GitHub


rmuir commented on PR #12172:
URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450554452

   > I have to admit that it chafes a little to convert everything to the 
"wrong" form, but the internal representation is just an internal 
representation, I guess, as long as everything is consistent about it.
   
   It is a good point, representation isn't always internal if we consider 
external libraries and other things that lucene can integrate with for language 
processing.
   
   As another datapoint, I investigated Romanian hunspell dictionary (as a user 
could substitute HunspellStemFilter for SnowballFilter if they want 
dictionary-based stemming, and because I was curious)
   
   They are using `ș` and `ț` , but have directives at the end to handle 
differences during suggestions phase:
   ```
   MAP sşș
   MAP tţț
   ```
   
   So I think the most ideal situation would be to both fix snowball and then 
map cedilla to "correct" forms with a TokenFilter? I realize this doesn't solve 
your problem immediately, but if PR gets accepted to snowball, I can just bump 
our git revision to the new one and we'll have the new stemmer :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kashkambath opened a new issue, #12178: TermAutomatonQuery explain() should return relevant explain output instead of null

2023-03-01 Thread via GitHub


kashkambath opened a new issue, #12178:
URL: https://github.com/apache/lucene/issues/12178

   ### Description
   
   Hi! This is my first time posting a GitHub issue for Apache Lucene. Please 
let me know if you need anything further.
   
   
https://github.com/apache/lucene/blob/569533bd76a115e239d8c621d2540462bc12d891/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/TermAutomatonQuery.java#L443-L447
   
   We should return relevant explain output here instead of returning `null` 
for `TermAutomatonQuery`. This would be helpful for users running document 
explain on synonym queries leveraging `TermAutomatonQuery`. Otherwise, certain 
queries that can nest other queries could throw a `NullPointerException` if 
their weight's `explain()` calls `TermAutomatonQuery#explain()`. 
   
   For example, if a `BooleanQuery` nests a `TermAutomatonQuery`, then 
`BooleanWeight` throws an NPE here when running `explain()`:
   
   
https://github.com/apache/lucene/blob/569533bd76a115e239d8c621d2540462bc12d891/lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java#L75
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kaivalnp commented on a diff in pull request #12160: Concurrent rewrite for KnnVectorQuery

2023-03-01 Thread via GitHub


kaivalnp commented on code in PR #12160:
URL: https://github.com/apache/lucene/pull/12160#discussion_r1122648348


##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -73,17 +77,48 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
   .build();
   Query rewritten = indexSearcher.rewrite(booleanQuery);
   filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
+} else {
+  filterWeight = null;
 }
 
-for (LeafReaderContext ctx : reader.leaves()) {
-  TopDocs results = searchLeaf(ctx, filterWeight);
-  if (ctx.docBase > 0) {
-for (ScoreDoc scoreDoc : results.scoreDocs) {
-  scoreDoc.doc += ctx.docBase;
-}
-  }
-  perLeafResults[ctx.ord] = results;
-}
+List> tasks =
+reader.leaves().stream()
+.map(
+ctx ->
+new FutureTask<>(
+() -> {
+  try {
+TopDocs results = searchLeaf(ctx, filterWeight);
+if (ctx.docBase > 0) {
+  for (ScoreDoc scoreDoc : results.scoreDocs) {
+scoreDoc.doc += ctx.docBase;
+  }
+}
+return results;
+  } catch (IOException e) {
+throw new UncheckedIOException(e);
+  }
+}))
+.toList();
+
+Executor executor = 
Objects.requireNonNullElse(indexSearcher.getExecutor(), Runnable::run);
+SliceExecutor sliceExecutor = new SliceExecutor(executor);
+sliceExecutor.invokeAll(tasks);

Review Comment:
   Thanks for the input!
   This was really helpful in reducing overhead for non-concurrent search, and 
improving readability!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org