[GitHub] [lucene] jpountz merged pull request #12064: Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String)

2023-01-11 Thread GitBox


jpountz merged PR #12064:
URL: https://github.com/apache/lucene/pull/12064


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12064: Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String)

2023-01-11 Thread GitBox


jpountz commented on PR #12064:
URL: https://github.com/apache/lucene/pull/12064#issuecomment-1378387639

   @benwtrent Would you mind working on a backport PR, since there are a few 
conflicts that need resolving?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang opened a new issue, #12074: Enhance XXXField#newRangeQuery

2023-01-11 Thread GitBox


LuXugang opened a new issue, #12074:
URL: https://github.com/apache/lucene/issues/12074

   ### Description
   
   Since dim of XXXField's point value is always `1`, should we introduce 
`IndexSortSortedNumericDocValuesRangeQuery` to `IntFiled#newRangeQuery` and 
`LongField#newRangeQuery` ? This idea is also mentioned by @jpountz in 
[LUCENE-10446](https://issues.apache.org/jira/browse/LUCENE-10446).
   
   like :
   
   ```java
   public static Query newRangeQuery(String field, int lowerValue, int 
upperValue) {
   PointRangeQuery.checkArgs(field, lowerValue, upperValue);
   Query fallbackQuery = IntPoint.newRangeQuery(field, lowerValue, 
upperValue);
   return new IndexOrDocValuesQuery(
   new IndexSortSortedNumericDocValuesRangeQuery(field, lowerValue, 
upperValue, fallbackQuery),
   SortedNumericDocValuesField.newSlowRangeQuery(field, lowerValue, 
upperValue));
 }
   ```
   
   If above is possible, we should also need to move 
`IndexSortSortedNumericDocValuesRangeQuery` from sandbox to core module?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #12052: Cut over Lucene Demo from LongPoint to LongField.

2023-01-11 Thread GitBox


jpountz merged PR #12052:
URL: https://github.com/apache/lucene/pull/12052


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #12070: Never throttle creation of compound files.

2023-01-11 Thread GitBox


jpountz merged PR #12070:
URL: https://github.com/apache/lucene/pull/12070


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz closed issue #12068: Is it right to throttle the creation of compound files?

2023-01-11 Thread GitBox


jpountz closed issue #12068: Is it right to throttle the creation of compound 
files?
URL: https://github.com/apache/lucene/issues/12068


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12053: Allow reusing indexed binary fields.

2023-01-11 Thread GitBox


jpountz commented on PR #12053:
URL: https://github.com/apache/lucene/pull/12053#issuecomment-1378448862

   @rmuir Do you have thoughts on this change?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12053: Allow reusing indexed binary fields.

2023-01-11 Thread GitBox


rmuir commented on PR #12053:
URL: https://github.com/apache/lucene/pull/12053#issuecomment-1378455423

   @jpountz I will try to review this today. Sorry for the delay. I haven't 
written java code in years, i'm crazy busy at work, and i try to give more time 
to `.document` api all contributing to my slowness. I had in mind that i 
wanted to checkout this branch and "poke" too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12053: Allow reusing indexed binary fields.

2023-01-11 Thread GitBox


jpountz commented on PR #12053:
URL: https://github.com/apache/lucene/pull/12053#issuecomment-1378457531

   Thank you, and no worries at all about the delay, I just wanted to check if 
it was still on your mind since you said you were interested in looking into it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent opened a new pull request, #12075: Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String) (#12064)

2023-01-11 Thread GitBox


benwtrent opened a new pull request, #12075:
URL: https://github.com/apache/lucene/pull/12075

   Backport of #12064
   
   This completes the refactoring as described in: 
https://github.com/apache/lucene/issues/11963
   
   This commit:
- splits out `ByteVectorValues` from `VectorValues`.
- Adds `getByteVectorValues(String field)` to `KnnVectorsReader`
- Adds a new `KnnByteVectorField` and disallows `BytesRef` values in the 
`KnnVectorField`
- No longer allows `ByteVectorValues` to be read from a `KnnVectorField`.
   
   These refactors are difficult to split up any further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12075: Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String) (#12064)

2023-01-11 Thread GitBox


benwtrent commented on PR #12075:
URL: https://github.com/apache/lucene/pull/12075#issuecomment-1378802761

   @jpountz backport :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sherman opened a new issue, #12076: The question about MultiRangeQuery.

2023-01-11 Thread GitBox


sherman opened a new issue, #12076:
URL: https://github.com/apache/lucene/issues/12076

   Hi!
   
   AFAICS, the family of multiple ranges queries (MultiRangeQuery) were added 
to the sandbox module recently.
   
   In our index we have a few indexed fields with type long (aren't ranges, 
it's just long ids, dimension = 1).
   
   We have about 16% of search queries which include a predicate like: where 
field in (id1, id2, id3), and the length of argument list is over 40 elements.
   
   Currently, our parser maps such predicates as a list of separate 
PointRangeQuery. When I found MultiRangeQuery, I immediately started the tests 
to see the way how to optimize such queries.
   
   The first results were impressive, because I saw the difference in the 
amount of data were read from mapped files.
   
   The list of PointRangeQuery
   ```
   16:39:23.587 [ProfilingSearchThread] INFO  
com.ozon.grpc.BlockingRequestExecutor  - Query statistics [null]. Seeks: 
[2363], Bytes were read: [78198], Shorts were read: [1266], Ints were read: 
[48],  Longs were read: [9821], Floats were read: [0], Total bytes were read: 
[159490]
   16:39:27.030 [ProfilingSearchThread] INFO  
com.ozon.grpc.BlockingRequestExecutor  - Query statistics [null]. Seeks: 
[2363], Bytes were read: [78198], Shorts were read: [1266], Ints were read: 
[48],  Longs were read: [9821], Floats were read: [0], Total bytes were read: 
[159490]
   16:39:35.890 [ProfilingSearchThread] INFO  
com.ozon.grpc.BlockingRequestExecutor  - Query statistics [null]. Seeks: 
[2363], Bytes were read: [78198], Shorts were read: [1266], Ints were read: 
[48],  Longs were read: [9821], Floats were read: [0], Total bytes were read: 
[159490]
   ```
   
   A single MultiRangeQuery
   ```
   16:49:02.372 [ProfilingSearchThread] INFO  
com.ozon.grpc.BlockingRequestExecutor  - Query statistics [null]. Seeks: [505], 
Bytes were read: [61917], Shorts were read: [1266], Ints were read: [48],  
Longs were read: [9053], Floats were read: [0], Total bytes were read: [137065]
   16:49:03.794 [ProfilingSearchThread] INFO  
com.ozon.grpc.BlockingRequestExecutor  - Query statistics [null]. Seeks: [505], 
Bytes were read: [61917], Shorts were read: [1266], Ints were read: [48],  
Longs were read: [9053], Floats were read: [0], Total bytes were read: [137065]
   16:49:05.582 [ProfilingSearchThread] INFO  
com.ozon.grpc.BlockingRequestExecutor  - Query statistics [null]. Seeks: [505], 
Bytes were read: [61917], Shorts were read: [1266], Ints were read: [48],  
Longs were read: [9053], Floats were read: [0], Total bytes were read: [137065]
   ``` 
   
   The numbers above said in case of MultiRangeQuery we have a reduction of 
seek and read long operations, as expected.
   
   Surprisingly, MultiRangeQuery is slower in terms of wall clock time.
   I compared the timings for the same queries and I see the significant 
slowdown in the build score stage.
   
   The list of PointRangeQuery
   ```
   "timings": {
   "score": "50297",
   "build_scorer_count": "2",
   "match_count": "29",
   "create_weight": "562343",
   "next_doc": "252004",
   "match": "12136",
   "create_weight_count": "1",
   "next_doc_count": "30",
   "score_count": "29",
   "build_scorer": "2884027",
   "advance": "0",
   "advance_count": "0"
  },
  "statistics": {
   "match_avg": "418",
   "next_doc_avg": "8400",
   "build_scorer_avg": "1442013",
   "score_avg": "1734"
  },
  "total": "3760898"
   }
   ```
   
   A single MultiRangeQuery
   ```
   timings": {
   "score": "7793",
   "build_scorer_count": "2",
   "match_count": "0",
   "create_weight": "7762",
   "next_doc": "12713",
   "match": "0",
   "create_weight_count": "1",
   "next_doc_count": "30",
   "score_count": "29",
   "build_scorer": "5283654",
   "advance": "0",
   "advance_count": "0"
  },
  "statistics": {
   "next_doc_avg": "423",
   "build_scorer_avg": "2641827",
   "score_avg": "268"
  },
  "total": "5311984"
   }
   ```
   
   As you can see, build_scorer is pretty "fat" in MultiRangeQuery. I guess, 
the most of time was spent for a building an interval tree in the method: 
`MultiRangeQuery.scorerSupplier()`
   
   Is it expected behaviour?
   
   And what the case of usage for MultiRangeQuery (a number of ranges reduction 
through the merges is probably one of, but it's not mine)?
   
   Is any known problems in the interval tree implementation?
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-

[GitHub] [lucene] jpountz merged pull request #12075: Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String) (#12064)

2023-01-11 Thread GitBox


jpountz merged PR #12075:
URL: https://github.com/apache/lucene/pull/12075


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12075: Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String) (#12064)

2023-01-11 Thread GitBox


jpountz commented on PR #12075:
URL: https://github.com/apache/lucene/pull/12075#issuecomment-1378896877

   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

2023-01-11 Thread GitBox


jpountz commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1378939790

   Do these queries actually use `ReqExclScorer`? I would have expected them to 
use `ReqExclBulkScorer`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

2023-01-11 Thread GitBox


rmuir commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1378947355

   I cant keep up with the conditions where bulk is unsuitable. But just 
generally, my concern is to implement a best-case 1-2% speedup and create a 
large worst-case slowdown somewhere else.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

2023-01-11 Thread GitBox


jpountz commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1378951278

   Sorry my comment was not targeted at your comment but at understanding why 
we're seeing a speedup with this change, since I wouldn't expect this scorer to 
be used (though my understanding could be outdated).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


javanna commented on code in PR #12072:
URL: https://github.com/apache/lucene/pull/12072#discussion_r1067144911


##
lucene/core/src/test/org/apache/lucene/search/TestBooleanRewrites.java:
##
@@ -322,6 +323,45 @@ public void testMatchAllMustNot() throws IOException {
 assertEquals(new MatchNoDocsQuery(), searcher.rewrite(bq2));
   }
 
+  public void testDeeplyNestedBooleanRewrite() throws IOException {
+IndexSearcher searcher = newSearcher(new MultiReader());
+
+Directory dir = newDirectory();
+(new RandomIndexWriter(random(), dir)).close();
+IndexReader r = DirectoryReader.open(dir);
+Function termQueryFunction =
+(i) -> new TermQuery(new Term("layer[" + i + "]", "foo"));
+int depth = TestUtil.nextInt(random(), 10, 30);
+TermQuery tq = termQueryFunction.apply(depth);
+Query expectedQuery = new BooleanQuery.Builder().add(tq, 
Occur.FILTER).build();

Review Comment:
   Would it make sense to write a similar test for boolean queries that do need 
scoring? I am worried that the wrapping we do in the main rewrite has some 
impact too but we don't exercise it as much with this test.



##
lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java:
##
@@ -203,9 +203,17 @@ BooleanQuery rewriteNoScoring(IndexSearcher indexSearcher) 
throws IOException {
 
 for (BooleanClause clause : clauses) {
   Query query = clause.getQuery();
-  Query rewritten = new ConstantScoreQuery(query).rewrite(indexSearcher);
-  if (rewritten instanceof ConstantScoreQuery) {
-rewritten = ((ConstantScoreQuery) rewritten).getQuery();
+  Query rewritten;
+  // If the clause query is a boolean query, we shouldn't call 
ConstantScoreQuery#rewrite as
+  // this causes
+  // exponential growth of runtime.
+  if (query instanceof BooleanQuery booleanQuery) {
+rewritten = booleanQuery.rewriteNoScoring(indexSearcher);

Review Comment:
   I would make the same change to the main rewrite method. That will reduce 
further the amount of time we spend on needlessly rewriting boolean queries.



##
lucene/core/src/test/org/apache/lucene/search/TestBooleanRewrites.java:
##
@@ -322,6 +323,45 @@ public void testMatchAllMustNot() throws IOException {
 assertEquals(new MatchNoDocsQuery(), searcher.rewrite(bq2));
   }
 
+  public void testDeeplyNestedBooleanRewrite() throws IOException {
+IndexSearcher searcher = newSearcher(new MultiReader());
+
+Directory dir = newDirectory();
+(new RandomIndexWriter(random(), dir)).close();
+IndexReader r = DirectoryReader.open(dir);
+Function termQueryFunction =
+(i) -> new TermQuery(new Term("layer[" + i + "]", "foo"));
+int depth = TestUtil.nextInt(random(), 10, 30);
+TermQuery tq = termQueryFunction.apply(depth);
+Query expectedQuery = new BooleanQuery.Builder().add(tq, 
Occur.FILTER).build();
+Query deepBuilder = new BooleanQuery.Builder().add(tq, Occur.MUST).build();
+for (int i = depth; i > 0; i--) {
+  tq = termQueryFunction.apply(i);
+  // Do this to accurately set setMinimumNumberShouldMatch to the number 
of should clauses.
+  // This makes setting expectation for rewrite much easier.
+  boolean useShoulds = random().nextBoolean();

Review Comment:
   I wonder if we should have a separate test with should clauses, just to 
simplify the test a bit.



##
lucene/core/src/test/org/apache/lucene/search/TestBooleanRewrites.java:
##
@@ -322,6 +323,45 @@ public void testMatchAllMustNot() throws IOException {
 assertEquals(new MatchNoDocsQuery(), searcher.rewrite(bq2));
   }
 
+  public void testDeeplyNestedBooleanRewrite() throws IOException {
+IndexSearcher searcher = newSearcher(new MultiReader());
+
+Directory dir = newDirectory();
+(new RandomIndexWriter(random(), dir)).close();
+IndexReader r = DirectoryReader.open(dir);
+Function termQueryFunction =
+(i) -> new TermQuery(new Term("layer[" + i + "]", "foo"));
+int depth = TestUtil.nextInt(random(), 10, 30);
+TermQuery tq = termQueryFunction.apply(depth);
+Query expectedQuery = new BooleanQuery.Builder().add(tq, 
Occur.FILTER).build();
+Query deepBuilder = new BooleanQuery.Builder().add(tq, Occur.MUST).build();
+for (int i = depth; i > 0; i--) {
+  tq = termQueryFunction.apply(i);
+  // Do this to accurately set setMinimumNumberShouldMatch to the number 
of should clauses.
+  // This makes setting expectation for rewrite much easier.
+  boolean useShoulds = random().nextBoolean();
+  BooleanQuery.Builder bq =
+  new BooleanQuery.Builder()
+  .setMinimumNumberShouldMatch(useShoulds ? 2 : 0)
+  .add(tq, useShoulds ? Occur.SHOULD : Occur.MUST)
+  .add(deepBuilder, useShoulds ? Occur.SHOULD : Occur.MUST);
+  deepBuilder = bq.build();
+  BooleanQuery.Builder expectedBq = new BooleanQuery.Builder().add(tq, 
Occur.FILT

[GitHub] [lucene] rmuir commented on issue #12076: The question about MultiRangeQuery.

2023-01-11 Thread GitBox


rmuir commented on issue #12076:
URL: https://github.com/apache/lucene/issues/12076#issuecomment-1378993896

   > In our index we have a few indexed fields with type long (aren't ranges, 
it's just long ids, dimension = 1).
   > 
   > We have about 16% of search queries which include a predicate like: where 
field in (id1, id2, id3), and the length of argument list is over 40 elements.
   
   This sounds like you want to use `LongPoint.newSetQuery()` (e.g. 
PointInSetQuery)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


benwtrent commented on code in PR #12072:
URL: https://github.com/apache/lucene/pull/12072#discussion_r1067160561


##
lucene/core/src/test/org/apache/lucene/search/TestBooleanRewrites.java:
##
@@ -322,6 +323,45 @@ public void testMatchAllMustNot() throws IOException {
 assertEquals(new MatchNoDocsQuery(), searcher.rewrite(bq2));
   }
 
+  public void testDeeplyNestedBooleanRewrite() throws IOException {
+IndexSearcher searcher = newSearcher(new MultiReader());
+
+Directory dir = newDirectory();
+(new RandomIndexWriter(random(), dir)).close();
+IndexReader r = DirectoryReader.open(dir);
+Function termQueryFunction =
+(i) -> new TermQuery(new Term("layer[" + i + "]", "foo"));
+int depth = TestUtil.nextInt(random(), 10, 30);
+TermQuery tq = termQueryFunction.apply(depth);
+Query expectedQuery = new BooleanQuery.Builder().add(tq, 
Occur.FILTER).build();

Review Comment:
   The key issue for the bug was when there was no scoring, and thus fell into 
the `rewriteNoScoring` case from the CSQ.
   
   All my testing indicates that runtime doesn't exponentially grow when we 
need scoring as we don't use the CSQ redirect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


benwtrent commented on code in PR #12072:
URL: https://github.com/apache/lucene/pull/12072#discussion_r1067162137


##
lucene/core/src/test/org/apache/lucene/search/TestBooleanRewrites.java:
##
@@ -322,6 +323,45 @@ public void testMatchAllMustNot() throws IOException {
 assertEquals(new MatchNoDocsQuery(), searcher.rewrite(bq2));
   }
 
+  public void testDeeplyNestedBooleanRewrite() throws IOException {
+IndexSearcher searcher = newSearcher(new MultiReader());
+
+Directory dir = newDirectory();
+(new RandomIndexWriter(random(), dir)).close();
+IndexReader r = DirectoryReader.open(dir);
+Function termQueryFunction =
+(i) -> new TermQuery(new Term("layer[" + i + "]", "foo"));
+int depth = TestUtil.nextInt(random(), 10, 30);
+TermQuery tq = termQueryFunction.apply(depth);
+Query expectedQuery = new BooleanQuery.Builder().add(tq, 
Occur.FILTER).build();
+Query deepBuilder = new BooleanQuery.Builder().add(tq, Occur.MUST).build();
+for (int i = depth; i > 0; i--) {
+  tq = termQueryFunction.apply(i);
+  // Do this to accurately set setMinimumNumberShouldMatch to the number 
of should clauses.
+  // This makes setting expectation for rewrite much easier.
+  boolean useShoulds = random().nextBoolean();
+  BooleanQuery.Builder bq =
+  new BooleanQuery.Builder()
+  .setMinimumNumberShouldMatch(useShoulds ? 2 : 0)
+  .add(tq, useShoulds ? Occur.SHOULD : Occur.MUST)
+  .add(deepBuilder, useShoulds ? Occur.SHOULD : Occur.MUST);
+  deepBuilder = bq.build();
+  BooleanQuery.Builder expectedBq = new BooleanQuery.Builder().add(tq, 
Occur.FILTER);
+  if (i == depth - 1) {
+expectedBq.add(termQueryFunction.apply(depth), Occur.FILTER);
+  } else {
+expectedBq.add(expectedQuery, Occur.FILTER);
+  }
+  expectedQuery = expectedBq.build();
+}
+BooleanQuery bq = new BooleanQuery.Builder().add(deepBuilder, 
Occur.FILTER).build();
+expectedQuery = new BoostQuery(new ConstantScoreQuery(expectedQuery), 
0.0f);
+Query rewritten = searcher.rewrite(bq);
+r.close();
+dir.close();
+assertEquals(expectedQuery, rewritten);

Review Comment:
   I wrote this test originally to compare my change with the original 
implementation. So, I wrote the test to pass the original optimization. Then I 
verified my change didn't break anything. 
   
   I did use it for checking runtime as well, but that was secondary to making 
sure I didn't break anything.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


javanna commented on PR #12072:
URL: https://github.com/apache/lucene/pull/12072#issuecomment-1379007346

   I did some additional testing to understand the impact of the regression and 
in light of that I view this as a bug rather than a performance regression, 
because we end up performing many needless rewrite steps that waste resources 
and  end up making rewrite take tens of seconds depending on the depth of a 
boolean query.
   
   I added some logging to the rewrite methods involved to more clearly show 
the issue. The following is the output from a boolean query with depth 3.
   
   ```
   IndexSearcher searcher = newSearcher(new MultiReader());
   Directory dir = newDirectory();
   (new RandomIndexWriter(random(), dir)).close();
   IndexReader r = DirectoryReader.open(dir);
   
   int depth = 3;
   BooleanQuery.Builder bq = new BooleanQuery.Builder().add(new 
TermQuery(new Term("field", "value")), Occur.MUST);
   for (int i = 0; i < depth; i++) {
 bq = new BooleanQuery.Builder().add(bq.build(), Occur.MUST).add(new 
TermQuery(new Term("depth" + (depth - i), "value")), Occur.MUST);
   }
   BooleanQuery booleanQuery = new BooleanQuery.Builder().add(bq.build(), 
Occur.FILTER).build();
   Query rewritten = searcher.rewrite(booleanQuery);
   System.out.println("final rewritten query: " + rewritten);
   r.close();
   dir.close();
   ```
   
   Output before the fix (87 lines):
   ```
   boolean query rewrite: #(+(+(+(+field:value) +depth3:value) +depth2:value) 
+depth1:value) -> (ConstantScore(+(+(+(+field:value) +depth3:value) 
+depth2:value) +depth1:value))^0.0
   -- top level rewrite round 
completed
   + constant score query rewriting inner query: +(+(+(+field:value) 
+depth3:value) +depth2:value) +depth1:value
   boolean query rewrite: +field:value -> field:value
   boolean query rewrite: +(+field:value) +depth3:value -> +field:value 
+depth3:value
   boolean query rewrite: +(+(+field:value) +depth3:value) +depth2:value -> 
+(+field:value +depth3:value) +depth2:value
   boolean query rewrite: +(+(+(+field:value) +depth3:value) +depth2:value) 
+depth1:value -> +(+(+field:value +depth3:value) +depth2:value) +depth1:value
   + constant score query rewriting inner boolean query no scoring: 
+(+(+field:value +depth3:value) +depth2:value) +depth1:value
   + Artificially wrapping into constant score query to rewrite
   + constant score query rewriting inner query: +(+field:value +depth3:value) 
+depth2:value
   boolean query rewrite: +field:value +depth3:value -> +field:value 
+depth3:value
   boolean query rewrite: +(+field:value +depth3:value) +depth2:value -> 
+(+field:value +depth3:value) +depth2:value
   + constant score query rewriting inner boolean query no scoring: 
+(+field:value +depth3:value) +depth2:value
   + Artificially wrapping into constant score query to rewrite
   + constant score query rewriting inner query: +field:value +depth3:value
   boolean query rewrite: +field:value +depth3:value -> +field:value 
+depth3:value
   + constant score query rewriting inner boolean query no scoring: 
+field:value +depth3:value
   + Artificially wrapping into constant score query to rewrite
   + constant score query rewriting inner query: field:value
   + Artificially wrapping into constant score query to rewrite
   + constant score query rewriting inner query: depth3:value
   boolean query rewrite no scoring: +field:value +depth3:value -> #field:value 
#depth3:value
   + Artificially wrapping into constant score query to rewrite
   + constant score query rewriting inner query: depth2:value
   boolean query rewrite no scoring: +(+field:value +depth3:value) 
+depth2:value -> #(#field:value #depth3:value) #depth2:value
   + Artificially wrapping into constant score query to rewrite
   + constant score query rewriting inner query: depth1:value
   boolean query rewrite no scoring: +(+(+field:value +depth3:value) 
+depth2:value) +depth1:value -> #(#(#field:value #depth3:value) #depth2:value) 
#depth1:value
   -- top level rewrite round 
completed
   + constant score query rewriting inner query: #(#(#field:value 
#depth3:value) #depth2:value) #depth1:value
   + constant score query rewriting inner query: #(#field:value #depth3:value) 
#depth2:value
   + constant score query rewriting inner query: #field:value #depth3:value
   + constant score query rewriting inner query: field:value
   + constant score query rewriting inner query: depth3:value
   boolean query rewrite: #field:value #depth3:value -> #field:value 
#depth3:value
   + constant score query rewriting inner boolean query no scoring: 
#field:value #depth3:value
   + Artificially wrapping into constant score query to rewrite
   + constant score query rewriting inner query: field:value
   + Artificially wrapping into constant score query to rewrite
   + constant score query rewriting inner query: depth3:v

[GitHub] [lucene] benwtrent commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


benwtrent commented on code in PR #12072:
URL: https://github.com/apache/lucene/pull/12072#discussion_r1067175218


##
lucene/core/src/test/org/apache/lucene/search/TestBooleanRewrites.java:
##
@@ -322,6 +323,45 @@ public void testMatchAllMustNot() throws IOException {
 assertEquals(new MatchNoDocsQuery(), searcher.rewrite(bq2));
   }
 
+  public void testDeeplyNestedBooleanRewrite() throws IOException {
+IndexSearcher searcher = newSearcher(new MultiReader());
+
+Directory dir = newDirectory();
+(new RandomIndexWriter(random(), dir)).close();
+IndexReader r = DirectoryReader.open(dir);
+Function termQueryFunction =
+(i) -> new TermQuery(new Term("layer[" + i + "]", "foo"));
+int depth = TestUtil.nextInt(random(), 10, 30);
+TermQuery tq = termQueryFunction.apply(depth);
+Query expectedQuery = new BooleanQuery.Builder().add(tq, 
Occur.FILTER).build();

Review Comment:
   Also, the wrapping (I think you mean in the CSQ) we do in the main rewrite 
is ONLY in the non-scoring case, when `BooleanClause.Occur#FILTER` or 
`BooleanClause.Occur#MUST_NOT`. So, only `BooleanQuery` objects that are under 
those non-scoring clauses would be wrapped. I am not sure what we would test in 
this case...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sherman closed issue #12076: The question about MultiRangeQuery.

2023-01-11 Thread GitBox


sherman closed issue #12076: The question about MultiRangeQuery.
URL: https://github.com/apache/lucene/issues/12076


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sherman commented on issue #12076: The question about MultiRangeQuery.

2023-01-11 Thread GitBox


sherman commented on issue #12076:
URL: https://github.com/apache/lucene/issues/12076#issuecomment-1379062301

   @rmuir Awesome! That's what I really need. Thanks a lot!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-11 Thread GitBox


msokolov commented on PR #12050:
URL: https://github.com/apache/lucene/pull/12050#issuecomment-1379112305

   > To support this functionality, a couple of changes to current graph 
construction process needed to be made. OnHeapHnswGraph had to support out of 
order insertion. This is because the mapped ordinals of the nodes in the graph 
used for initialization are not necessarily the first X ordinals in the new 
graph.
   
   I'm having trouble wrapping my head around this. When we start merging some 
field, each segment seg has a graph with ordinals in [0,seg.size]. Why can't we 
preserve the ordinals from the largest segment, and then let the others fall 
where they may?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #12071: Can we better take advantage of compact strings?

2023-01-11 Thread GitBox


msokolov commented on issue #12071:
URL: https://github.com/apache/lucene/issues/12071#issuecomment-1379144309

   I wonder if we could update `UnicodeUtil` to use `getBytes` internally? It 
could check the size of the byte array, and if it is equal to the length of the 
string, then just return it. Otherwise it could scan the string for the 
replacement char and update it in place before returning, in order to maintain 
backward compatibility. Not sure if it would give back the gains in those cases 
though?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


jpountz commented on code in PR #12072:
URL: https://github.com/apache/lucene/pull/12072#discussion_r1067223779


##
lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java:
##
@@ -203,9 +203,17 @@ BooleanQuery rewriteNoScoring(IndexSearcher indexSearcher) 
throws IOException {
 
 for (BooleanClause clause : clauses) {
   Query query = clause.getQuery();
-  Query rewritten = new ConstantScoreQuery(query).rewrite(indexSearcher);
-  if (rewritten instanceof ConstantScoreQuery) {
-rewritten = ((ConstantScoreQuery) rewritten).getQuery();
+  Query rewritten;
+  // If the clause query is a boolean query, we shouldn't call 
ConstantScoreQuery#rewrite as
+  // this causes
+  // exponential growth of runtime.
+  if (query instanceof BooleanQuery booleanQuery) {
+rewritten = booleanQuery.rewriteNoScoring(indexSearcher);
+  } else {
+rewritten = new ConstantScoreQuery(query).rewrite(indexSearcher);
+if (rewritten instanceof ConstantScoreQuery constantScoreQuery) {
+  rewritten = constantScoreQuery.getQuery();
+}

Review Comment:
   I'm contemplating replacing your fix with this one which is similar:
   
   ```patch
   diff --git a/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java 
b/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java
   index 0354280eb01..be323bf0e4c 100644
   --- a/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java
   +++ b/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java
   @@ -203,7 +203,13 @@ public class BooleanQuery extends Query implements 
Iterable {

for (BooleanClause clause : clauses) {
  Query query = clause.getQuery();
   -  Query rewritten = new 
ConstantScoreQuery(query).rewrite(indexSearcher);
   +  // NOTE: rewritingNoScoring() should not call rewrite(), otherwise 
this
   +  // method could run in exponential time with the depth of the query as
   +  // every new level would rewrite 2x more than its parent level.
   +  Query rewritten = query;
   +  if (rewritten instanceof BoostQuery) {
   +rewritten = ((BoostQuery) query).getQuery();
   +  }
  if (rewritten instanceof ConstantScoreQuery) {
rewritten = ((ConstantScoreQuery) rewritten).getQuery();
  }
   
   ```
   
   I think it would be a bit more robust, as it would keep working if there are 
`ConstantScoreQuery` wrappers between the inner levels of `BooleanQuery`s.



##
lucene/core/src/test/org/apache/lucene/search/TestBooleanRewrites.java:
##
@@ -322,6 +323,45 @@ public void testMatchAllMustNot() throws IOException {
 assertEquals(new MatchNoDocsQuery(), searcher.rewrite(bq2));
   }
 
+  public void testDeeplyNestedBooleanRewrite() throws IOException {
+IndexSearcher searcher = newSearcher(new MultiReader());
+
+Directory dir = newDirectory();
+(new RandomIndexWriter(random(), dir)).close();
+IndexReader r = DirectoryReader.open(dir);
+Function termQueryFunction =
+(i) -> new TermQuery(new Term("layer[" + i + "]", "foo"));
+int depth = TestUtil.nextInt(random(), 10, 30);
+TermQuery tq = termQueryFunction.apply(depth);
+Query expectedQuery = new BooleanQuery.Builder().add(tq, 
Occur.FILTER).build();
+Query deepBuilder = new BooleanQuery.Builder().add(tq, Occur.MUST).build();
+for (int i = depth; i > 0; i--) {
+  tq = termQueryFunction.apply(i);
+  // Do this to accurately set setMinimumNumberShouldMatch to the number 
of should clauses.
+  // This makes setting expectation for rewrite much easier.
+  boolean useShoulds = random().nextBoolean();
+  BooleanQuery.Builder bq =
+  new BooleanQuery.Builder()
+  .setMinimumNumberShouldMatch(useShoulds ? 2 : 0)
+  .add(tq, useShoulds ? Occur.SHOULD : Occur.MUST)
+  .add(deepBuilder, useShoulds ? Occur.SHOULD : Occur.MUST);
+  deepBuilder = bq.build();
+  BooleanQuery.Builder expectedBq = new BooleanQuery.Builder().add(tq, 
Occur.FILTER);
+  if (i == depth - 1) {
+expectedBq.add(termQueryFunction.apply(depth), Occur.FILTER);
+  } else {
+expectedBq.add(expectedQuery, Occur.FILTER);
+  }
+  expectedQuery = expectedBq.build();
+}
+BooleanQuery bq = new BooleanQuery.Builder().add(deepBuilder, 
Occur.FILTER).build();
+expectedQuery = new BoostQuery(new ConstantScoreQuery(expectedQuery), 
0.0f);
+Query rewritten = searcher.rewrite(bq);
+r.close();
+dir.close();
+assertEquals(expectedQuery, rewritten);

Review Comment:
   I wonder if we could improve the test by e.g. having a special query 
implementation that would count the number of times that rewrite() gets called 
on the innermost query of this tree of queries, and assert that rewrite() has 
only been called once.



-- 
This is an automated message from the Apache Git Service.
To respond to th

[GitHub] [lucene] jpountz commented on issue #12074: Enhance XXXField#newRangeQuery

2023-01-11 Thread GitBox


jpountz commented on issue #12074:
URL: https://github.com/apache/lucene/issues/12074#issuecomment-1379151763

   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tang-hi commented on issue #11902: Customization of Edit distance costs for different operations

2023-01-11 Thread GitBox


tang-hi commented on issue #11902:
URL: https://github.com/apache/lucene/issues/11902#issuecomment-1379206167

   Lucene does not calculate the Levenshtein distance one by one. Instead, it 
precompiles the Levenshtein automaton based on your output, and then finds 
terms that meet the distance requirements. The state transitions of the 
Levenshtein automaton are also already hard-coded in the code.This is also why 
the maximum edit distance supported by Lucene is 2.
   I think it would be difficult to support custom distances with different 
operation costs, because this would involve reworking the code related to the 
Levenshtein automaton.
   
   If you're interested, you can check out the following resources.
   
[Blog](https://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-11 Thread GitBox


jmazanec15 commented on PR #12050:
URL: https://github.com/apache/lucene/pull/12050#issuecomment-1379247267

   @msokolov The main reason I did not do this was to avoid having to modify 
the ordering of the vectors from the MergedVectorValues. I believe that the 
ordinals in the graph map to the positioning in the the vector values, so they 
need to be synchronized. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


benwtrent commented on PR #12072:
URL: https://github.com/apache/lucene/pull/12072#issuecomment-1379274249

   @jpountz applied your suggestions
   
   @javanna I added a rewrite count check and split the should & must test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


benwtrent commented on code in PR #12072:
URL: https://github.com/apache/lucene/pull/12072#discussion_r1067299935


##
lucene/core/src/test/org/apache/lucene/search/TestBooleanRewrites.java:
##
@@ -322,6 +323,45 @@ public void testMatchAllMustNot() throws IOException {
 assertEquals(new MatchNoDocsQuery(), searcher.rewrite(bq2));
   }
 
+  public void testDeeplyNestedBooleanRewrite() throws IOException {
+IndexSearcher searcher = newSearcher(new MultiReader());
+
+Directory dir = newDirectory();
+(new RandomIndexWriter(random(), dir)).close();
+IndexReader r = DirectoryReader.open(dir);
+Function termQueryFunction =
+(i) -> new TermQuery(new Term("layer[" + i + "]", "foo"));
+int depth = TestUtil.nextInt(random(), 10, 30);
+TermQuery tq = termQueryFunction.apply(depth);
+Query expectedQuery = new BooleanQuery.Builder().add(tq, 
Occur.FILTER).build();
+Query deepBuilder = new BooleanQuery.Builder().add(tq, Occur.MUST).build();
+for (int i = depth; i > 0; i--) {
+  tq = termQueryFunction.apply(i);
+  // Do this to accurately set setMinimumNumberShouldMatch to the number 
of should clauses.
+  // This makes setting expectation for rewrite much easier.
+  boolean useShoulds = random().nextBoolean();
+  BooleanQuery.Builder bq =
+  new BooleanQuery.Builder()
+  .setMinimumNumberShouldMatch(useShoulds ? 2 : 0)
+  .add(tq, useShoulds ? Occur.SHOULD : Occur.MUST)
+  .add(deepBuilder, useShoulds ? Occur.SHOULD : Occur.MUST);
+  deepBuilder = bq.build();
+  BooleanQuery.Builder expectedBq = new BooleanQuery.Builder().add(tq, 
Occur.FILTER);
+  if (i == depth - 1) {
+expectedBq.add(termQueryFunction.apply(depth), Occur.FILTER);
+  } else {
+expectedBq.add(expectedQuery, Occur.FILTER);
+  }
+  expectedQuery = expectedBq.build();
+}
+BooleanQuery bq = new BooleanQuery.Builder().add(deepBuilder, 
Occur.FILTER).build();
+expectedQuery = new BoostQuery(new ConstantScoreQuery(expectedQuery), 
0.0f);
+Query rewritten = searcher.rewrite(bq);
+r.close();
+dir.close();
+assertEquals(expectedQuery, rewritten);

Review Comment:
   added verification. It only rewrites once with `MUST`, but `SHOULD` adds 
some because of how `SHOULD`s change on rewrites.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #12071: Can we better take advantage of compact strings?

2023-01-11 Thread GitBox


rmuir commented on issue #12071:
URL: https://github.com/apache/lucene/issues/12071#issuecomment-1379311821

   there seems to be some confusion, the purpose of unicodeutil is not to 
allocate String in the first place. It doesnt use String hence getBytes is not 
really relevant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #12071: Can we better take advantage of compact strings?

2023-01-11 Thread GitBox


rmuir commented on issue #12071:
URL: https://github.com/apache/lucene/issues/12071#issuecomment-1379313710

   Nor does it allocate stuff. that's a problem with String.getBytes is that it 
forces allocation too.
   
   Sorry, I don't see anything here. If you want to speed up UnicodeUtil 
conversions, pressure openjdk to release their vector api


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih merged pull request #12051: Fix wrong assertion in TestBooleanQuery.testQueryMatchesCount

2023-01-11 Thread GitBox


zhaih merged PR #12051:
URL: https://github.com/apache/lucene/pull/12051


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

2023-01-11 Thread GitBox


gsmiller commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1379399579

   @jpountz:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

2023-01-11 Thread GitBox


gsmiller commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1379404470

   @jpountz:
   > understanding why we're seeing a speedup with this change
   Good question. I was not familiar with `ReqExclBulkScorer`, but after taking 
a bit of a look, I have the same question now. I'll have to spend more time 
looking at these benchmark tasks to understand what might be going on here. 
Thank you for pointing this out.
   
   @rmuir:
   Thanks for the feedback! You raise some good questions here. My intuition 
was that reducing the size of the first phase match set would be beneficial in 
general. The reasoning being that, 1) it reduces the number of "candidates" 
that other potential conjunction clauses need to evaluate, and 2) it reduces 
the number of candidates that need to go to a second-phase (particularly if 
there are 2nd phase checks occurring _before_ the exclusion check). But I hear 
your point about term queries being a best-case scenario here. If the exclusion 
clauses are more complex, it's more difficult to reason about.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Brain2000 commented on issue #10458: Ordered intervals can give inaccurate hits on interleaved terms [LUCENE-9418]

2023-01-11 Thread GitBox


Brain2000 commented on issue #10458:
URL: https://github.com/apache/lucene/issues/10458#issuecomment-1379477576

   @romseygeek Two years later, I pinpointed the source of the issue above. It 
was getting worse where records were just being omitted without any rhyme or 
reason.
   
   It looks like there's a problem in the TopFieldCollector.java where it calls 
"compareBottom" without calling "setBottom" first.
   I believe this is an issue if getLeafComparer( ) is called more than once.
   
   I found a workaround is to rescope the variable storing the bottom slot set 
from calling "setBottom( )", to outside of the LeafFieldComparator class to 
make it persistent (just like the setTopValue( ) is called outside of the 
LeafFieldComparator class).
   
   Making the bottom slot number persistent *should* also be safe, since the 
documentation states that setBottom will always be called before compareBottom.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


javanna commented on code in PR #12072:
URL: https://github.com/apache/lucene/pull/12072#discussion_r1067459267


##
lucene/core/src/test/org/apache/lucene/search/TestBooleanRewrites.java:
##
@@ -322,6 +323,45 @@ public void testMatchAllMustNot() throws IOException {
 assertEquals(new MatchNoDocsQuery(), searcher.rewrite(bq2));
   }
 
+  public void testDeeplyNestedBooleanRewrite() throws IOException {
+IndexSearcher searcher = newSearcher(new MultiReader());
+
+Directory dir = newDirectory();
+(new RandomIndexWriter(random(), dir)).close();
+IndexReader r = DirectoryReader.open(dir);
+Function termQueryFunction =
+(i) -> new TermQuery(new Term("layer[" + i + "]", "foo"));
+int depth = TestUtil.nextInt(random(), 10, 30);
+TermQuery tq = termQueryFunction.apply(depth);
+Query expectedQuery = new BooleanQuery.Builder().add(tq, 
Occur.FILTER).build();
+Query deepBuilder = new BooleanQuery.Builder().add(tq, Occur.MUST).build();
+for (int i = depth; i > 0; i--) {
+  tq = termQueryFunction.apply(i);
+  // Do this to accurately set setMinimumNumberShouldMatch to the number 
of should clauses.
+  // This makes setting expectation for rewrite much easier.
+  boolean useShoulds = random().nextBoolean();
+  BooleanQuery.Builder bq =
+  new BooleanQuery.Builder()
+  .setMinimumNumberShouldMatch(useShoulds ? 2 : 0)
+  .add(tq, useShoulds ? Occur.SHOULD : Occur.MUST)
+  .add(deepBuilder, useShoulds ? Occur.SHOULD : Occur.MUST);
+  deepBuilder = bq.build();
+  BooleanQuery.Builder expectedBq = new BooleanQuery.Builder().add(tq, 
Occur.FILTER);
+  if (i == depth - 1) {
+expectedBq.add(termQueryFunction.apply(depth), Occur.FILTER);
+  } else {
+expectedBq.add(expectedQuery, Occur.FILTER);
+  }
+  expectedQuery = expectedBq.build();
+}
+BooleanQuery bq = new BooleanQuery.Builder().add(deepBuilder, 
Occur.FILTER).build();
+expectedQuery = new BoostQuery(new ConstantScoreQuery(expectedQuery), 
0.0f);
+Query rewritten = searcher.rewrite(bq);
+r.close();
+dir.close();
+assertEquals(expectedQuery, rewritten);

Review Comment:
   Thanks, this is what I was looking for but I was not sure how to achieve :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


javanna commented on code in PR #12072:
URL: https://github.com/apache/lucene/pull/12072#discussion_r1067468924


##
lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java:
##
@@ -203,9 +203,18 @@ BooleanQuery rewriteNoScoring(IndexSearcher indexSearcher) 
throws IOException {
 
 for (BooleanClause clause : clauses) {
   Query query = clause.getQuery();
-  Query rewritten = new ConstantScoreQuery(query).rewrite(indexSearcher);
-  if (rewritten instanceof ConstantScoreQuery) {
-rewritten = ((ConstantScoreQuery) rewritten).getQuery();
+  // NOTE: rewritingNoScoring() should not call rewrite(), otherwise this
+  // method could run in exponential time with the depth of the query as
+  // every new level would rewrite 2x more than its parent level.
+  Query rewritten = query;
+  if (query instanceof BoostQuery) {
+rewritten = ((BoostQuery) query).getQuery();
+  }
+  if (query instanceof ConstantScoreQuery) {
+rewritten = ((ConstantScoreQuery) query).getQuery();
+  }
+  if (query instanceof BooleanQuery) {
+rewritten = ((BooleanQuery) query).rewriteNoScoring(indexSearcher);

Review Comment:
   I believe that we can completely remove the check for boolean query (like 
Adrien suggested). If we unwrap what's in the boost query the rewriteNoScoring 
will be taken care of at the next round of rewrite if the inner query is a 
constant score query. 
   
   That allows us to remove the IndexSearcher argument from this method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


javanna commented on code in PR #12072:
URL: https://github.com/apache/lucene/pull/12072#discussion_r1067469765


##
lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java:
##
@@ -203,9 +203,18 @@ BooleanQuery rewriteNoScoring(IndexSearcher indexSearcher) 
throws IOException {
 
 for (BooleanClause clause : clauses) {
   Query query = clause.getQuery();
-  Query rewritten = new ConstantScoreQuery(query).rewrite(indexSearcher);
-  if (rewritten instanceof ConstantScoreQuery) {
-rewritten = ((ConstantScoreQuery) rewritten).getQuery();
+  // NOTE: rewritingNoScoring() should not call rewrite(), otherwise this
+  // method could run in exponential time with the depth of the query as
+  // every new level would rewrite 2x more than its parent level.
+  Query rewritten = query;
+  if (query instanceof BoostQuery) {
+rewritten = ((BoostQuery) query).getQuery();
+  }
+  if (query instanceof ConstantScoreQuery) {
+rewritten = ((ConstantScoreQuery) query).getQuery();
+  }
+  if (query instanceof BooleanQuery) {
+rewritten = ((BooleanQuery) query).rewriteNoScoring(indexSearcher);

Review Comment:
   I would make the same change to the main rewrite method. it simplifies 
things further quite a bit, even if fixing rewriteNoScoring is already fixing 
the exponential runtime issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


javanna commented on code in PR #12072:
URL: https://github.com/apache/lucene/pull/12072#discussion_r1067470681


##
lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java:
##
@@ -203,9 +203,17 @@ BooleanQuery rewriteNoScoring(IndexSearcher indexSearcher) 
throws IOException {
 
 for (BooleanClause clause : clauses) {
   Query query = clause.getQuery();
-  Query rewritten = new ConstantScoreQuery(query).rewrite(indexSearcher);
-  if (rewritten instanceof ConstantScoreQuery) {
-rewritten = ((ConstantScoreQuery) rewritten).getQuery();
+  Query rewritten;
+  // If the clause query is a boolean query, we shouldn't call 
ConstantScoreQuery#rewrite as
+  // this causes
+  // exponential growth of runtime.
+  if (query instanceof BooleanQuery booleanQuery) {
+rewritten = booleanQuery.rewriteNoScoring(indexSearcher);
+  } else {
+rewritten = new ConstantScoreQuery(query).rewrite(indexSearcher);
+if (rewritten instanceof ConstantScoreQuery constantScoreQuery) {
+  rewritten = constantScoreQuery.getQuery();
+}

Review Comment:
   It seems like what Adrien is suggested removes the need to artificially wrap 
which makes me happy :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


benwtrent commented on code in PR #12072:
URL: https://github.com/apache/lucene/pull/12072#discussion_r1067472865


##
lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java:
##
@@ -203,9 +203,18 @@ BooleanQuery rewriteNoScoring(IndexSearcher indexSearcher) 
throws IOException {
 
 for (BooleanClause clause : clauses) {
   Query query = clause.getQuery();
-  Query rewritten = new ConstantScoreQuery(query).rewrite(indexSearcher);
-  if (rewritten instanceof ConstantScoreQuery) {
-rewritten = ((ConstantScoreQuery) rewritten).getQuery();
+  // NOTE: rewritingNoScoring() should not call rewrite(), otherwise this
+  // method could run in exponential time with the depth of the query as
+  // every new level would rewrite 2x more than its parent level.
+  Query rewritten = query;
+  if (query instanceof BoostQuery) {
+rewritten = ((BoostQuery) query).getQuery();
+  }
+  if (query instanceof ConstantScoreQuery) {
+rewritten = ((ConstantScoreQuery) query).getQuery();
+  }
+  if (query instanceof BooleanQuery) {
+rewritten = ((BooleanQuery) query).rewriteNoScoring(indexSearcher);

Review Comment:
   Without checking for `BooleanQuery`, the number of rewrites on the innermost 
query are no longer `1` in the `MUST` clause case. 
   
   Removing the recursive `rewriteNoScoring` here, makes the total number of 
rewrites on the innermost query to equal the depth in the `MUST` clause, and 
2*depth in the `SHOULD` clause. 
   
   @jpountz what say you?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


javanna commented on code in PR #12072:
URL: https://github.com/apache/lucene/pull/12072#discussion_r1067473812


##
lucene/core/src/test/org/apache/lucene/search/TestBooleanRewrites.java:
##
@@ -322,6 +323,45 @@ public void testMatchAllMustNot() throws IOException {
 assertEquals(new MatchNoDocsQuery(), searcher.rewrite(bq2));
   }
 
+  public void testDeeplyNestedBooleanRewrite() throws IOException {
+IndexSearcher searcher = newSearcher(new MultiReader());
+
+Directory dir = newDirectory();
+(new RandomIndexWriter(random(), dir)).close();
+IndexReader r = DirectoryReader.open(dir);
+Function termQueryFunction =
+(i) -> new TermQuery(new Term("layer[" + i + "]", "foo"));
+int depth = TestUtil.nextInt(random(), 10, 30);
+TermQuery tq = termQueryFunction.apply(depth);
+Query expectedQuery = new BooleanQuery.Builder().add(tq, 
Occur.FILTER).build();

Review Comment:
   right, I see what you mean. we are good then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


benwtrent commented on code in PR #12072:
URL: https://github.com/apache/lucene/pull/12072#discussion_r1067491908


##
lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java:
##
@@ -203,9 +203,18 @@ BooleanQuery rewriteNoScoring(IndexSearcher indexSearcher) 
throws IOException {
 
 for (BooleanClause clause : clauses) {
   Query query = clause.getQuery();
-  Query rewritten = new ConstantScoreQuery(query).rewrite(indexSearcher);
-  if (rewritten instanceof ConstantScoreQuery) {
-rewritten = ((ConstantScoreQuery) rewritten).getQuery();
+  // NOTE: rewritingNoScoring() should not call rewrite(), otherwise this
+  // method could run in exponential time with the depth of the query as
+  // every new level would rewrite 2x more than its parent level.
+  Query rewritten = query;
+  if (query instanceof BoostQuery) {
+rewritten = ((BoostQuery) query).getQuery();
+  }
+  if (query instanceof ConstantScoreQuery) {
+rewritten = ((ConstantScoreQuery) query).getQuery();
+  }
+  if (query instanceof BooleanQuery) {
+rewritten = ((BooleanQuery) query).rewriteNoScoring(indexSearcher);

Review Comment:
   > I would make the same change to the main rewrite method. it simplifies 
things further quite a bit, even if fixing rewriteNoScoring is already fixing 
the exponential runtime issue.
   
   I am not sure where you would want to make this change? I suppose I am 
dense...
   
   I am guessing right here: 
https://github.com/apache/lucene/blob/7e0b142afeed676c4fc6dc40bbc7fcd277df5c34/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java#L287
   
   The whole point is to wrap that query as a ConstantScore query. I guess you 
don't want it wrapped if its boolean and short circuit that to be a 
rewriteNoScoring. I was trying something along these lines, but then the inner 
query never got rewritten in the `MUST` case...which was troubling.
   
   But, we can already remove the IndexSearch from the method, which I am doing 
now in a following commit.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite

2023-01-11 Thread GitBox


javanna commented on code in PR #12072:
URL: https://github.com/apache/lucene/pull/12072#discussion_r1067505060


##
lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java:
##
@@ -203,9 +203,18 @@ BooleanQuery rewriteNoScoring(IndexSearcher indexSearcher) 
throws IOException {
 
 for (BooleanClause clause : clauses) {
   Query query = clause.getQuery();
-  Query rewritten = new ConstantScoreQuery(query).rewrite(indexSearcher);
-  if (rewritten instanceof ConstantScoreQuery) {
-rewritten = ((ConstantScoreQuery) rewritten).getQuery();
+  // NOTE: rewritingNoScoring() should not call rewrite(), otherwise this
+  // method could run in exponential time with the depth of the query as
+  // every new level would rewrite 2x more than its parent level.
+  Query rewritten = query;
+  if (query instanceof BoostQuery) {
+rewritten = ((BoostQuery) query).getQuery();
+  }
+  if (query instanceof ConstantScoreQuery) {
+rewritten = ((ConstantScoreQuery) query).getQuery();
+  }
+  if (query instanceof BooleanQuery) {
+rewritten = ((BooleanQuery) query).rewriteNoScoring(indexSearcher);

Review Comment:
   > I am guessing right here:
   
   yes that is what I had in mind. I was thinking the logic is exactly the same 
as what we have in rewriteNoScoring and we should be able to align the two. I 
have observed even less intermediate rewriting in my tests with that additional 
change. If it complicates things it's not required, I was assuming it does not.
   
   > But, we can already remove the IndexSearch from the method
   
   That's great. It basically enforces that we don't call rewrite (the note you 
added)  cause we no longer have a searcher to provide to it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] hossman opened a new issue, #12077: WordBreakSpellChecker.maxEvaluations usage in generateBreakUpSuggestions() makes no sense

2023-01-11 Thread GitBox


hossman opened a new issue, #12077:
URL: https://github.com/apache/lucene/issues/12077

   ### Description
   
   `WordBreakSpellChecker` has a `maxEvaluations` config option (default: 1000) 
which is suppose to be the "maximum number of word combinations to evaluate" 
but the way this setting is used in `generateBreakUpSuggestions()` makes no 
sense.
   
   The crux of this method is to iteratively loop over each character in the 
input to consider if splitting at that point in the string produces valid 
suggestion words on the left & right side of the split.  If the left side is a 
valid suggestion, then regardless of whether the right side was, recursively 
process the right side.
   
   As it's looping, it compares `maxEvaluations` to a `totalEvaluations` 
counter which is incremented on each iteration, as well as being passed down in 
each recursive call, breaking out of the loop if `totalEvaluations >= 
maxEvaluations`.
   
   But it *also* maintains a `thisTimeEvaluations` counter, which is only 
incremented based on the loop iterations, and this is what the method returns 
-- meaning that even if a nested recursive call exceeds `maxEvaluations`, the 
return value it gives back to it's caller is a much smaller number (bounded by 
the length of the string), and the caller will proceed to do more iterations 
(and more recursive calls)
   
   It's such a weird implementation choice (to explicitly maintain these two 
variables) that I'm sure my explanation above is hard to make sense of (it 
seems like it must have been intentional -- but i can't understand why?) and I 
confused myself 3 times trying to write it.
   
   The only way i found to really wrap my head around how "wrong" this is was 
by adding some sysout messages to the methods showing just how much iterating & 
recursing it does even when `maxEvaluations`.
   
   As a result of this behavior, it's actually possible for 
`WordBreakSpellChecker.suggestWordBreaks(...)` to return significantly more 
suggestions then the setting of `maxEvaluations`
   
   ---
   
   I'm attaching:
   * 
[WordBreakSpellChecker.nocommit.sysout.patch.txt](https://github.com/apache/lucene/files/10396330/WordBreakSpellChecker.nocommit.sysout.patch.txt)
 with the sysout logging i mentioned tp help visualize the recursion, and a 
trivial test case showing that even with `maxEvaluations=100` calling 
`suggestWordBreaks(...)` (on a 20 character input string) can generate 500+ 
suggested parsings.
   * 
[OUTPUT-org.apache.lucene.search.spell.TestWordBreakSpellChecker.txt](https://github.com/apache/lucene/files/10396366/OUTPUT-org.apache.lucene.search.spell.TestWordBreakSpellChecker.txt)
 with the output of the nocommit sysout println's from the above mentioned test 
-- which shows `generateSuggestWord()` (the method that looks up the `docFreq` 
to see if a "word" is valid) was called 35013 times!
   * 
[WordBreakSpellChecker.fix.patch.txt](https://github.com/apache/lucene/files/10396636/WordBreakSpellChecker.fix.patch.txt)
 with my suggested fix (and the same test which now passes)
   
   
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] hossman commented on issue #12077: WordBreakSpellChecker.maxEvaluations usage in generateBreakUpSuggestions() makes no sense

2023-01-11 Thread GitBox


hossman commented on issue #12077:
URL: https://github.com/apache/lucene/issues/12077#issuecomment-1379633112

   FWIW: It also seems strange to me that this method is essentially doing a 
"depth first" walk of the possible splits, given that it's working a character 
at a time and the only possible `BreakSuggestionSortMethod` values start with 
`NUM_CHANGES_THEN_...`.
   
   it seems like we could get "better" results, with lower values of 
`maxEvaluations`, and less recursion (even if `maxChanges` is very large) if 
the logic was something like:
   
   * init a BitSet the same length as our input
   * loop over each character postion (`i`)
 * break if `totalEvaluations >= maxEvaluations` otherwise increment 
`totalEvaluations`
 * if `leftWord` is "valid" suggestion, record `i` in our bitset
 * if `rightWord` is also a "valid" suggestion, offer this left+right combo 
to our `suggestions` queue
   * if `numberBreaks` has not yet exceeded `maxChanges`:
 * loop over each set bit (`i`) in our BitSet:
   * break if `totalEvaluations >= maxEvaluations` otherwise increment 
`totalEvaluations`
   * recursively parse the portion of our input to the "right" of `i` (in 
the context of a new prefix using the portion to the "left" of `i` and an 
incremented `numberBreaks`)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang opened a new pull request, #12078: Enhance XXXField#newRangeQuery

2023-01-11 Thread GitBox


LuXugang opened a new pull request, #12078:
URL: https://github.com/apache/lucene/pull/12078

   Introduce `IndexSortSortedNumericDocValuesRangeQuery` to 
`IntFiled#newRangeQuery` and `LongField#newRangeQuery`.
   
   See more discussion https://github.com/apache/lucene/issues/12074 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-11 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1067742826


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -94,36 +93,83 @@ public int size() {
   }
 
   /**
-   * Add node on the given level
+   * Add node on the given level. Nodes can be inserted out of order, but it 
requires that the nodes

Review Comment:
   That makes sense. This shouldnt be a big problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org