[jira] [Commented] (LUCENE-10078) Enable merge-on-refresh by default?

2022-06-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553416#comment-17553416
 ] 

ASF subversion and git services commented on LUCENE-10078:
--

Commit d850a22a511676989f29ce3ef011fcf56be71c17 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d850a22a511 ]

LUCENE-10078: Fix TestIndexWriterExceptions' expectations regarding merges on 
full flushes.


> Enable merge-on-refresh by default?
> ---
>
> Key: LUCENE-10078
> URL: https://issues.apache.org/jira/browse/LUCENE-10078
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is a spinoff from the discussion in LUCENE-10073.
> The newish merge-on-refresh ([crazy origin 
> story|https://blog.mikemccandless.com/2021/03/open-source-collaboration-or-how-we.html])
>  feature is a powerful way to reduce searched segment counts, especially 
> helpful for applications using many indexing threads.  Such usage will write 
> many tiny segments on each refresh, which could quickly be merged up during 
> the {{refresh}} operation.
> We would have to implement a default for {{findFullFlushMerges}} 
> (LUCENE-10064 is open for this), and then we would need 
> {{IndexWriterConfig.getMaxFullFlushMergeWaitMillis}} a non-zero value (this 
> issue).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10078) Enable merge-on-refresh by default?

2022-06-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553417#comment-17553417
 ] 

ASF subversion and git services commented on LUCENE-10078:
--

Commit 376195e0e7bd9052d09b3d4ca8b824dc8dbd5ce9 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=376195e0e7b ]

LUCENE-10078: Fix TestIndexWriterExceptions' expectations regarding merges on 
full flushes.


> Enable merge-on-refresh by default?
> ---
>
> Key: LUCENE-10078
> URL: https://issues.apache.org/jira/browse/LUCENE-10078
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is a spinoff from the discussion in LUCENE-10073.
> The newish merge-on-refresh ([crazy origin 
> story|https://blog.mikemccandless.com/2021/03/open-source-collaboration-or-how-we.html])
>  feature is a powerful way to reduce searched segment counts, especially 
> helpful for applications using many indexing threads.  Such usage will write 
> many tiny segments on each refresh, which could quickly be merged up during 
> the {{refresh}} operation.
> We would have to implement a default for {{findFullFlushMerges}} 
> (LUCENE-10064 is open for this), and then we would need 
> {{IndexWriterConfig.getMaxFullFlushMergeWaitMillis}} a non-zero value (this 
> issue).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10611) KnnVectorQuery throwing Heap Error for Restrictive Filters

2022-06-13 Thread Kaival Parikh (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaival Parikh updated LUCENE-10611:
---
Description: 
The HNSW graph search does not consider that visitedLimit may be reached in the 
upper levels of graph search itself

This occurs when the pre-filter is too restrictive (and its count sets the 
visitedLimit). So instead of switching over to exactSearch, it tries to [pop 
from an empty 
heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90]
 and throws an error

 

To reproduce this error, we can +increase the numDocs 
[here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500]
 to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached 
faster)

 

Stacktrace:
The heap is empty
java.lang.IllegalStateException: The heap is empty
at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0)
at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111)
at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98)
at 
org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90)
at 
org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236)
at 
org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272)
at 
org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235)
at 
org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159)

  was:
The HNSW graph search does not consider that visitedLimit may be reached in the 
upper levels of graph search itself

This occurs when the pre-filter is too restrictive (and its count sets the 
visitedLimit). So instead of switching over to exactSearch, it tries to [pop 
from an empty 
heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90]
 and throws an error

 

To reproduce this error, we can +increase the numDocs 
[here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500]
 to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached 
faster)

 

Stacktrace:
`The heap is empty
java.lang.IllegalStateException: The heap is empty
at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0)
at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111)
at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98)
at 
org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90)
at 
org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236)
at 
org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272)
at 
org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235)
at 
org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159)`


> KnnVectorQuery throwing Heap Error for Restrictive Filters
> --
>
> Key: LUCENE-10611
> URL: https://issues.apache.org/jira/browse/LUCENE-10611
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kaival Parikh
>Priority: Minor
>
> The HNSW graph search does not consider that visitedLimit may be reached in 
> the upper levels of graph search itself
> This occurs when the pre-filter is too restrictive (and its count sets the 
> visitedLimit). So instead of switching over to exactSearch, it tries to [pop 
> from an empty 
> heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90]
>  and throws an error
>  
> To reproduce this error, we can +increase the numDocs 
> [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500]
>  to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached 
> faster)
>  
> Stacktrace:
> The heap is empty
> java.lang.IllegalStateException: The heap is empty
> at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0)
> at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111)
> at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98)
> at 
> org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90)
> at 
> org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236)
> at 
> org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272)
> at 
> org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235)
> at 
> org.apache.luc

[jira] [Created] (LUCENE-10611) KnnVectorQuery throwing Heap Error for Restrictive Filters

2022-06-13 Thread Kaival Parikh (Jira)
Kaival Parikh created LUCENE-10611:
--

 Summary: KnnVectorQuery throwing Heap Error for Restrictive Filters
 Key: LUCENE-10611
 URL: https://issues.apache.org/jira/browse/LUCENE-10611
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Kaival Parikh


The HNSW graph search does not consider that visitedLimit may be reached in the 
upper levels of graph search itself

This occurs when the pre-filter is too restrictive (and its count sets the 
visitedLimit). So instead of switching over to exactSearch, it tries to [pop 
from an empty 
heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90]
 and throws an error

 

To reproduce this error, we can +increase the numDocs 
[here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500]
 to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached 
faster)

 

Stacktrace:
The heap is empty
java.lang.IllegalStateException: The heap is empty
at 
__randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0)
at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111)
at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98)
at 
org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90)
at 
org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236)
at 
org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272)
at 
org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235)
at 
org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10611) KnnVectorQuery throwing Heap Error for Restrictive Filters

2022-06-13 Thread Kaival Parikh (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaival Parikh updated LUCENE-10611:
---
Description: 
The HNSW graph search does not consider that visitedLimit may be reached in the 
upper levels of graph search itself

This occurs when the pre-filter is too restrictive (and its count sets the 
visitedLimit). So instead of switching over to exactSearch, it tries to [pop 
from an empty 
heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90]
 and throws an error

 

To reproduce this error, we can +increase the numDocs 
[here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500]
 to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached 
faster)

 

Stacktrace:
`The heap is empty
java.lang.IllegalStateException: The heap is empty
at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0)
at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111)
at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98)
at 
org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90)
at 
org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236)
at 
org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272)
at 
org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235)
at 
org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159)`

  was:
The HNSW graph search does not consider that visitedLimit may be reached in the 
upper levels of graph search itself

This occurs when the pre-filter is too restrictive (and its count sets the 
visitedLimit). So instead of switching over to exactSearch, it tries to [pop 
from an empty 
heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90]
 and throws an error

 

To reproduce this error, we can +increase the numDocs 
[here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500]
 to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached 
faster)

 

Stacktrace:
The heap is empty
java.lang.IllegalStateException: The heap is empty
at 
__randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0)
at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111)
at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98)
at 
org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90)
at 
org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236)
at 
org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272)
at 
org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235)
at 
org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159)


> KnnVectorQuery throwing Heap Error for Restrictive Filters
> --
>
> Key: LUCENE-10611
> URL: https://issues.apache.org/jira/browse/LUCENE-10611
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kaival Parikh
>Priority: Minor
>
> The HNSW graph search does not consider that visitedLimit may be reached in 
> the upper levels of graph search itself
> This occurs when the pre-filter is too restrictive (and its count sets the 
> visitedLimit). So instead of switching over to exactSearch, it tries to [pop 
> from an empty 
> heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90]
>  and throws an error
>  
> To reproduce this error, we can +increase the numDocs 
> [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500]
>  to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached 
> faster)
>  
> Stacktrace:
> `The heap is empty
> java.lang.IllegalStateException: The heap is empty
> at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0)
> at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111)
> at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98)
> at 
> org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90)
> at 
> org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236)
> at 
> org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272)
> at 
> org.apache.lucene.index.CodecReader

[jira] [Commented] (LUCENE-10611) KnnVectorQuery throwing Heap Error for Restrictive Filters

2022-06-13 Thread Kaival Parikh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553467#comment-17553467
 ] 

Kaival Parikh commented on LUCENE-10611:


As a fix, we can check if results are incomplete after [this 
line|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L89],
 and return results accordingly
{code:java}
diff --git 
a/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java 
b/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
index b1a2436166f..7c641f077ee 100644
--- a/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
+++ b/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
@@ -87,6 +87,9 @@ public final class HnswGraphSearcher {
     int numVisited = 0;
     for (int level = graph.numLevels() - 1; level >= 1; level--) {
       results = graphSearcher.searchLevel(query, 1, level, eps, vectors, 
graph, null, visitedLimit);
+      if (results.incomplete()) {
+        return results;
+      }
       eps[0] = results.pop();
 
       numVisited += results.visitedCount();{code}
 

Alternatively, we do not enforce limits in higher levels by setting the limit 
as Integer.MAX_VALUE 
[here|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L89]
 (also not updating the counts 
[here|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L92-L93]),
 but we might end up visiting more nodes than desired
{code:java}
diff --git 
a/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java 
b/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
index b1a2436166f..0101cbd7690 100644
--- a/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
+++ b/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
@@ -86,11 +86,8 @@ public final class HnswGraphSearcher {
     int[] eps = new int[] {graph.entryNode()};
     int numVisited = 0;
     for (int level = graph.numLevels() - 1; level >= 1; level--) {
-      results = graphSearcher.searchLevel(query, 1, level, eps, vectors, 
graph, null, visitedLimit);
+      results = graphSearcher.searchLevel(query, 1, level, eps, vectors, 
graph, null, Integer.MAX_VALUE);
       eps[0] = results.pop();
-
-      numVisited += results.visitedCount();
-      visitedLimit -= results.visitedCount();
     }
     results =
         graphSearcher.searchLevel(query, topK, 0, eps, vectors, graph, 
acceptOrds, visitedLimit); {code}

> KnnVectorQuery throwing Heap Error for Restrictive Filters
> --
>
> Key: LUCENE-10611
> URL: https://issues.apache.org/jira/browse/LUCENE-10611
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kaival Parikh
>Priority: Minor
>
> The HNSW graph search does not consider that visitedLimit may be reached in 
> the upper levels of graph search itself
> This occurs when the pre-filter is too restrictive (and its count sets the 
> visitedLimit). So instead of switching over to exactSearch, it tries to [pop 
> from an empty 
> heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90]
>  and throws an error
>  
> To reproduce this error, we can +increase the numDocs 
> [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500]
>  to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached 
> faster)
>  
> Stacktrace:
> The heap is empty
> java.lang.IllegalStateException: The heap is empty
> at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0)
> at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111)
> at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98)
> at 
> org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90)
> at 
> org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236)
> at 
> org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272)
> at 
> org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235)
> at 
> org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

2022-06-13 Thread GitBox


jpountz commented on code in PR #951:
URL: https://github.com/apache/lucene/pull/951#discussion_r895488795


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, 
Query filter) {
   public Query rewrite(IndexReader reader) throws IOException {
 TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
 
-BitSetCollector filterCollector = null;
+Weight filterWeight = null;
 if (filter != null) {
-  filterCollector = new BitSetCollector(reader.leaves().size());
   IndexSearcher indexSearcher = new IndexSearcher(reader);
   BooleanQuery booleanQuery =
   new BooleanQuery.Builder()
   .add(filter, BooleanClause.Occur.FILTER)
   .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER)
   .build();
-  indexSearcher.search(booleanQuery, filterCollector);
+  Query rewritten = indexSearcher.rewrite(booleanQuery);
+  filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
 }
 
 for (LeafReaderContext ctx : reader.leaves()) {
-  TopDocs results = searchLeaf(ctx, filterCollector);
+  Bits acceptDocs;
+  int cost;
+  if (filterWeight != null) {
+Scorer scorer = filterWeight.scorer(ctx);
+if (scorer != null) {
+  DocIdSetIterator iterator = scorer.iterator();
+  if (iterator instanceof BitSetIterator) {
+acceptDocs = ((BitSetIterator) iterator).getBitSet();
+  } else {
+acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc());
+  }

Review Comment:
   Do we need to apply live docs here? `Scorer#iterator` returns an iterator 
over all matches, including deleted documents.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10611) KnnVectorQuery throwing Heap Error for Restrictive Filters

2022-06-13 Thread Kaival Parikh (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaival Parikh updated LUCENE-10611:
---
Description: 
The HNSW graph search does not consider that visitedLimit may be reached in the 
upper levels of graph search itself

This occurs when the pre-filter is too restrictive (and its count sets the 
visitedLimit). So instead of switching over to exactSearch, it tries to [pop 
from an empty 
heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90]
 and throws an error

 

To reproduce this error, we can +increase the numDocs 
[here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500]
 to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached 
faster)

 

Stacktrace:
{code:java}
The heap is empty
java.lang.IllegalStateException: The heap is empty
at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0)
at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111)
at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98)
at 
org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90)
at 
org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236)
at 
org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272)
at 
org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235)
at 
org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159)
 {code}

  was:
The HNSW graph search does not consider that visitedLimit may be reached in the 
upper levels of graph search itself

This occurs when the pre-filter is too restrictive (and its count sets the 
visitedLimit). So instead of switching over to exactSearch, it tries to [pop 
from an empty 
heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90]
 and throws an error

 

To reproduce this error, we can +increase the numDocs 
[here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500]
 to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached 
faster)

 

Stacktrace:
The heap is empty
java.lang.IllegalStateException: The heap is empty
at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0)
at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111)
at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98)
at 
org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90)
at 
org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236)
at 
org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272)
at 
org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235)
at 
org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159)


> KnnVectorQuery throwing Heap Error for Restrictive Filters
> --
>
> Key: LUCENE-10611
> URL: https://issues.apache.org/jira/browse/LUCENE-10611
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kaival Parikh
>Priority: Minor
>
> The HNSW graph search does not consider that visitedLimit may be reached in 
> the upper levels of graph search itself
> This occurs when the pre-filter is too restrictive (and its count sets the 
> visitedLimit). So instead of switching over to exactSearch, it tries to [pop 
> from an empty 
> heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90]
>  and throws an error
>  
> To reproduce this error, we can +increase the numDocs 
> [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500]
>  to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached 
> faster)
>  
> Stacktrace:
> {code:java}
> The heap is empty
> java.lang.IllegalStateException: The heap is empty
> at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0)
> at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111)
> at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98)
> at 
> org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90)
> at 
> org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236)
> at 
> org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272)
> at 
> org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.

[GitHub] [lucene] jpountz commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

2022-06-13 Thread GitBox


jpountz commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1153675730

   @gsmiller I wonder if you could also test if there is a speed up if we 
remove checks that the codec has to do in order to make sure to return 
`NO_MORE_ORDS` when values for a doc are exhausted. E.g. 
`Lucene90DocValuesProducer#getSortedSet` looks like this today
   
   ```java
 @Override
 public SortedSetDocValues getSortedSet(FieldInfo field) throws IOException 
{
   SortedSetEntry entry = sortedSets.get(field.name);
   if (entry.singleValueEntry != null) {
 return DocValues.singleton(getSorted(entry.singleValueEntry));
   }
   
   final SortedNumericDocValues ords = getSortedNumeric(entry.ordsEntry);
   return new BaseSortedSetDocValues(entry, data) {
   
 int i = 0;
 int count = 0;
 boolean set = false;
   
 @Override
 public long nextOrd() throws IOException {
   if (set == false) {
 set = true;
 i = 0;
 count = ords.docValueCount();
   }
   if (i++ == count) {
 return NO_MORE_ORDS;
   }
   return ords.nextValue();
 }
   
 @Override
 public long docValueCount() {
   return ords.docValueCount();
 }
   
 @Override
 public boolean advanceExact(int target) throws IOException {
   set = false;
   return ords.advanceExact(target);
 }
   
 @Override
 public int docID() {
   return ords.docID();
 }
   
 @Override
 public int nextDoc() throws IOException {
   set = false;
   return ords.nextDoc();
 }
   
 @Override
 public int advance(int target) throws IOException {
   set = false;
   return ords.advance(target);
 }
   
 @Override
 public long cost() {
   return ords.cost();
 }
   };
 }
   ```
   
   but if we moved everything to this new iteration model, we wouldn't have to 
check if the caller is visiting more values than expected, it would just lead 
to an undefined behavior and we could remove `i`, `count`, `set` and 
`nextOrd()` could delegate directly to `ords.nextValue()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #950: LUCENE-10608: Implement Weight#count on pure conjunctions.

2022-06-13 Thread GitBox


jpountz commented on code in PR #950:
URL: https://github.com/apache/lucene/pull/950#discussion_r895526857


##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -344,6 +344,45 @@ public BulkScorer bulkScorer(LeafReaderContext context) 
throws IOException {
 }
   }
 
+  @Override
+  public int count(LeafReaderContext context) throws IOException {
+// Implement counting for pure conjunctions in the case when one clause 
doesn't match any docs,
+// or all clauses but one match all docs.
+if (weightedClauses.isEmpty()) {
+  return 0;
+}
+for (WeightedBooleanClause weightedClause : weightedClauses) {
+  switch (weightedClause.clause.getOccur()) {
+case FILTER:
+case MUST:
+  break;
+case MUST_NOT:

Review Comment:
   You are right. I think there are a few more cases we could optimize, like 
pure disjunctions of term queries on a single-valued field where we could just 
sum up counts. I was thinking of introducing these optimizations one at a time 
to keep changes easy to review and test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #950: LUCENE-10608: Implement Weight#count on pure conjunctions.

2022-06-13 Thread GitBox


jpountz commented on code in PR #950:
URL: https://github.com/apache/lucene/pull/950#discussion_r895527360


##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -344,6 +344,45 @@ public BulkScorer bulkScorer(LeafReaderContext context) 
throws IOException {
 }
   }
 
+  @Override
+  public int count(LeafReaderContext context) throws IOException {
+// Implement counting for pure conjunctions in the case when one clause 
doesn't match any docs,
+// or all clauses but one match all docs.
+if (weightedClauses.isEmpty()) {
+  return 0;
+}
+for (WeightedBooleanClause weightedClause : weightedClauses) {
+  switch (weightedClause.clause.getOccur()) {
+case FILTER:
+case MUST:
+  break;
+case MUST_NOT:
+case SHOULD:

Review Comment:
   It should already be handled by the fact that `BooleanQuery#rewrite` 
rewrites single-clause queries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-06-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553494#comment-17553494
 ] 

Adrien Grand commented on LUCENE-10480:
---

Good question, looking at your BlockMaxMaxScoreScorer it looks like it also has 
potential for being specialized in the 2-clauses case by having two sub scorers 
and tracking during document collection whether the scorer that produces lower 
scores is optional or required. I didn't have concrete plans in mind when 
opening the issue, I was just observing that we pay significant overhead for 
supporting arbitrary numbers of clauses when disjunctions often have only two 
clauses.

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec

2022-06-13 Thread Elia Porciani (Jira)
Elia Porciani created LUCENE-10612:
--

 Summary: Add parameters for HNSW codec in Lucene93Codec
 Key: LUCENE-10612
 URL: https://issues.apache.org/jira/browse/LUCENE-10612
 Project: Lucene - Core
  Issue Type: Task
  Components: core/codecs
Reporter: Elia Porciani


Currently, it is possible to specify only the compression mode for stored 
fields in the LuceneXXCodec constructors.

With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, 
LuceneXXCodec should provide an easy way to specify custom parameters for HNSW 
graph layout:

* maxConn
* beamWidth




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec

2022-06-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553499#comment-17553499
 ] 

Adrien Grand commented on LUCENE-10612:
---

We have been rejecting such requests in the past due to the impact it has on 
backward compatibility, as the default codec has strong backward compatibility 
guarantees, and we need to make sure that the compatibility guarantees hold for 
every combination of options.

Stored fields are indeed an exception because it was hard to come up with 
values that would work well enough for everyone. But it was done in a way that 
has a very small surface, e.g. it doesn't expose the algorithm that is used 
under the hood or the size of blocks, or the DEFLATE compression level, it's 
only two options with opaque implementation details. On the other hand maxConn 
and beamWidth are specific implementation details of HNSW that can take a large 
range of values. And even with only two possible options, we still set the bar 
pretty high for configurability of the default codec, e.g. there was an option 
for doc values at some point that we ended up removing.

Would it work for you to override `Lucene93Codec#getKnnVectorsFormatForField`? 
The caveat is that it is customizing file formats, so it puts you on your own 
regarding backward compatibility.

> Add parameters for HNSW codec in Lucene93Codec
> --
>
> Key: LUCENE-10612
> URL: https://issues.apache.org/jira/browse/LUCENE-10612
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Reporter: Elia Porciani
>Priority: Major
>
> Currently, it is possible to specify only the compression mode for stored 
> fields in the LuceneXXCodec constructors.
> With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, 
> LuceneXXCodec should provide an easy way to specify custom parameters for 
> HNSW graph layout:
> * maxConn
> * beamWidth



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec

2022-06-13 Thread Elia Porciani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553518#comment-17553518
 ] 

Elia Porciani commented on LUCENE-10612:


Actually, the change I'm proposing is to make it possible to specify the 
parameters for HNSM without the need to know which HNWS codec is used 
underlying.

For instance, In Solr, this is done in the way you mentioned but there is an 
explicit call of the *Lucene91HnswVectorsFormat* and for this reason, Solr 
cannot be agnostic about the codec version used in Lucene for HNSM.


 

> Add parameters for HNSW codec in Lucene93Codec
> --
>
> Key: LUCENE-10612
> URL: https://issues.apache.org/jira/browse/LUCENE-10612
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Reporter: Elia Porciani
>Priority: Major
>
> Currently, it is possible to specify only the compression mode for stored 
> fields in the LuceneXXCodec constructors.
> With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, 
> LuceneXXCodec should provide an easy way to specify custom parameters for 
> HNSW graph layout:
> * maxConn
> * beamWidth



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8193) Deprecate LowercaseTokenizer

2022-06-13 Thread Andras Salamon (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553541#comment-17553541
 ] 

Andras Salamon commented on LUCENE-8193:


This looks like a duplicate of LUCENE-8498

> Deprecate LowercaseTokenizer
> 
>
> Key: LUCENE-8193
> URL: https://issues.apache.org/jira/browse/LUCENE-8193
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Tim Allison
>Priority: Minor
>
> On LUCENE-8186, discussion favored deprecating and eventually removing 
> LowercaseTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] eliaporciani opened a new pull request, #955: LUCENE-10612: Introduced Lucene93CodecParameters for Lucene93Codec

2022-06-13 Thread GitBox


eliaporciani opened a new pull request, #955:
URL: https://github.com/apache/lucene/pull/955

   
   https://issues.apache.org/jira/browse/LUCENE-10612
   
   # Description
   Lucene93Codec should provide a way for providing custom parameters to 
HnswVectorsFormat
   
   # Solution
   For providing the various parameters to Lucene93Codec, I wrap them up in a 
Lucene93CodecParameters class. This should provide a cleaner and easier way to 
pass custom parameters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec

2022-06-13 Thread Elia Porciani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553593#comment-17553593
 ] 

Elia Porciani commented on LUCENE-10612:


However, I understand the concern about backward compatibility. I don't think 
at this time is harmful to have custom HNSW parameters but things might be 
different in future releases.

Even if we decide not to move forward, I have created this PR for making the 
proposal clearer: [https://github.com/apache/lucene/pull/955.]

> Add parameters for HNSW codec in Lucene93Codec
> --
>
> Key: LUCENE-10612
> URL: https://issues.apache.org/jira/browse/LUCENE-10612
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Reporter: Elia Porciani
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, it is possible to specify only the compression mode for stored 
> fields in the LuceneXXCodec constructors.
> With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, 
> LuceneXXCodec should provide an easy way to specify custom parameters for 
> HNSW graph layout:
> * maxConn
> * beamWidth



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec

2022-06-13 Thread Elia Porciani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553593#comment-17553593
 ] 

Elia Porciani edited comment on LUCENE-10612 at 6/13/22 1:54 PM:
-

However, I understand the concern about backward compatibility. I don't think 
at this time is harmful to have custom HNSW parameters but things might be 
different in future releases.

Even if we decide not to move forward, I have created this PR for making the 
proposal clearer: 
[https://github.com/apache/lucene/pull/955|https://github.com/apache/lucene/pull/955.]


was (Author: JIRAUSER280197):
However, I understand the concern about backward compatibility. I don't think 
at this time is harmful to have custom HNSW parameters but things might be 
different in future releases.

Even if we decide not to move forward, I have created this PR for making the 
proposal clearer: [https://github.com/apache/lucene/pull/955.]

> Add parameters for HNSW codec in Lucene93Codec
> --
>
> Key: LUCENE-10612
> URL: https://issues.apache.org/jira/browse/LUCENE-10612
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Reporter: Elia Porciani
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, it is possible to specify only the compression mode for stored 
> fields in the LuceneXXCodec constructors.
> With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, 
> LuceneXXCodec should provide an easy way to specify custom parameters for 
> HNSW graph layout:
> * maxConn
> * beamWidth



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec

2022-06-13 Thread Elia Porciani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553593#comment-17553593
 ] 

Elia Porciani edited comment on LUCENE-10612 at 6/13/22 1:54 PM:
-

However, I understand the concern about backward compatibility. I don't think 
at this time is harmful to have custom HNSW parameters but things might be 
different in future releases.

Even if we decide not to move forward, I have created this PR for making the 
proposal clearer: [https://github.com/apache/lucene/pull/955.]


was (Author: JIRAUSER280197):
However, I understand the concern about backward compatibility. I don't think 
at this time is harmful to have custom HNSW parameters but things might be 
different in future releases.

Even if we decide not to move forward, I have created this PR for making the 
proposal clearer: [https://github.com/apache/lucene/pull/955.]

> Add parameters for HNSW codec in Lucene93Codec
> --
>
> Key: LUCENE-10612
> URL: https://issues.apache.org/jira/browse/LUCENE-10612
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Reporter: Elia Porciani
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, it is possible to specify only the compression mode for stored 
> fields in the LuceneXXCodec constructors.
> With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, 
> LuceneXXCodec should provide an easy way to specify custom parameters for 
> HNSW graph layout:
> * maxConn
> * beamWidth



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] anshumg opened a new pull request, #2663: SOLR-16218: Fix bug in in-place update when failOnVersionConflicts=false

2022-06-13 Thread GitBox


anshumg opened a new pull request, #2663:
URL: https://github.com/apache/lucene-solr/pull/2663

   Added more people to CHANGES to include folks who contributed to reviewing 
this fix. Will updated the CHANGES in main and 9x too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #927: LUCENE-10151: Adding Timeout Support to IndexSearcher

2022-06-13 Thread GitBox


msokolov commented on code in PR #927:
URL: https://github.com/apache/lucene/pull/927#discussion_r895817658


##
build.gradle:
##
@@ -183,3 +183,5 @@ apply from: file('gradle/hacks/turbocharge-jvm-opts.gradle')
 apply from: file('gradle/hacks/dummy-outputs.gradle')
 
 apply from: file('gradle/pylucene/pylucene.gradle')
+sourceCompatibility = JavaVersion.VERSION_17

Review Comment:
   why did we need to add this?



##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -532,6 +536,11 @@ public TopDocs reduce(Collection 
collectors) throws IOExce
 return search(query, manager);
   }
 
+  public void setTimeout(boolean isTimeoutEnabled, QueryTimeout queryTimeout) 
throws IOException {

Review Comment:
   Could we use `queryTimeout==null` or a sentinel value `QueryTime.NONE` to 
indicate no timeout is enabled? It would save a redundant parameter and member 
variable. Actually I see QueryTimeout has a timeoutEnabled() method, so could 
we define NONE to return false and just check that in our branches instead of 
this separate boolean flag?



##
lucene/core/src/java/org/apache/lucene/search/TimeLimitingBulkScorer.java:
##
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import org.apache.lucene.index.QueryTimeout;
+import org.apache.lucene.util.Bits;
+
+/**
+ * The {@link TimeLimitingCollector} is used to timeout search requests that 
take longer than the
+ * maximum allowed search time limit. After this time is exceeded, the search 
thread is stopped by
+ * throwing a {@link TimeLimitingCollector.TimeExceededException}.
+ *
+ * @see org.apache.lucene.index.ExitableDirectoryReader
+ */
+public class TimeLimitingBulkScorer extends BulkScorer {
+
+  static final int INTERVAL = 100;

Review Comment:
   please add a comment for this constant - what is it used for? Actually we 
should describe the algorithm here; namely that we score chunks of documents at 
a time so as to avoid the cost of checking the timeout for every document we 
score



##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -766,18 +778,29 @@ protected void search(List leaves, 
Weight weight, Collector c
   }
   BulkScorer scorer = weight.bulkScorer(ctx);
   if (scorer != null) {
-try {
-  scorer.score(leafCollector, ctx.reader().getLiveDocs());
-} catch (
-@SuppressWarnings("unused")
-CollectionTerminatedException e) {
-  // collection was terminated prematurely
-  // continue with the following leaf
+if (isTimeoutEnabled) {
+  TimeLimitingBulkScorer timeLimitingBulkScorer =
+  new TimeLimitingBulkScorer(scorer, queryTimeout);
+  try {
+timeLimitingBulkScorer.score(leafCollector, 
ctx.reader().getLiveDocs());
+  } catch (
+  @SuppressWarnings("unused")
+  TimeLimitingBulkScorer.TimeExceededException e) {
+partialResult = true;

Review Comment:
   I wonder if we should use this as a way to provide some information to the 
caller, for example how much time elapsed when the timeout occurred? The 
exception could pass that back? On the other hand, then every QueryTimeout 
might have to track that, and for some of them (eg counting-based) the time 
isn't really the most important dimension.



##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -555,6 +564,9 @@ public void search(Query query, Collector results) throws 
IOException {
 search(leafContexts, createWeight(query, results.scoreMode(), 1), results);
   }
 
+  public boolean isAborted() {

Review Comment:
   How about `timedOut()` ? It will be more symmetric with the 
methods/variables using timeout in their names.



##
lucene/core/src/java/org/apache/lucene/index/ExitableDirectoryReader.java:
##
@@ -82,8 +81,8 @@ public PointValues getPointValues(String field) throws 
IOException {
 return null;
   }
   return (queryTimeout.isTimeoutEnabled())
-  ? new ExitablePointValues(pointValues, queryTim

[GitHub] [lucene-solr] anshumg merged pull request #2663: SOLR-16218: Fix bug in in-place update when failOnVersionConflicts=false

2022-06-13 Thread GitBox


anshumg merged PR #2663:
URL: https://github.com/apache/lucene-solr/pull/2663


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10611) KnnVectorQuery throwing Heap Error for Restrictive Filters

2022-06-13 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553675#comment-17553675
 ] 

Julie Tibshirani commented on LUCENE-10611:
---

Thanks for catching this [~kaivalnp]! Your first suggestion makes sense to me. 
Would you like to open a PR with the fix (plus a test like the one you 
mentioned?) 

> KnnVectorQuery throwing Heap Error for Restrictive Filters
> --
>
> Key: LUCENE-10611
> URL: https://issues.apache.org/jira/browse/LUCENE-10611
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kaival Parikh
>Priority: Minor
>
> The HNSW graph search does not consider that visitedLimit may be reached in 
> the upper levels of graph search itself
> This occurs when the pre-filter is too restrictive (and its count sets the 
> visitedLimit). So instead of switching over to exactSearch, it tries to [pop 
> from an empty 
> heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90]
>  and throws an error
>  
> To reproduce this error, we can +increase the numDocs 
> [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500]
>  to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached 
> faster)
>  
> Stacktrace:
> {code:java}
> The heap is empty
> java.lang.IllegalStateException: The heap is empty
> at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0)
> at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111)
> at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98)
> at 
> org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90)
> at 
> org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236)
> at 
> org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272)
> at 
> org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235)
> at 
> org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-13 Thread GitBox


gsmiller commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1154142817

   > Anyway, let's benchmark it, but with the analysis above, I also agree we 
should actually start with the long[] API, and replace it with a byte[] one 
only if actually performs better.
   
   +1 to starting with `long[]` and then benchmarking a `byte[]` version when 
time permits.
   
   > If I understand your change correctly, then it creates a new long[] in 
each call to matches() right? I see two main problems here
   
   Yeah, good callouts. I put this together pretty quickly as a sketched out 
idea, and didn't think super deeply about it. I was going for an approach that 
would let users extend the long-based API as the common approach, but allow 
extending the byte-based API if they really care about performance (but maybe 
it's not even more performant... TBD!).
   
   At this point, I'm convinced we should go with the long-based API for the 
initial version. Let's get this functionality shipped  and then we can 
benchmark, optimize, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller closed pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-13 Thread GitBox


gsmiller closed pull request #841: LUCENE-10274: Add hyperrectangle faceting 
capabilities
URL: https://github.com/apache/lucene/pull/841


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-13 Thread GitBox


gsmiller commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1154143249

   Ah, sorry... I accidentally hit the "close" button! My bad. Reopened.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10527) Use bigger maxConn for last layer in HNSW

2022-06-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553699#comment-17553699
 ] 

Adrien Grand commented on LUCENE-10527:
---

I pushed an annotation to nightly benchmarks for the above performance change. 
It should show up in the coming days.

> Use bigger maxConn for last layer in HNSW
> -
>
> Key: LUCENE-10527
> URL: https://issues.apache.org/jira/browse/LUCENE-10527
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.2
>
> Attachments: Screen Shot 2022-05-18 at 4.26.14 PM.png, Screen Shot 
> 2022-05-18 at 4.26.24 PM.png, Screen Shot 2022-05-18 at 4.27.37 PM.png, 
> image-2022-04-20-14-53-58-484.png
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Recently I was rereading the HNSW paper 
> ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using 
> a different maxConn for the upper layers vs. the bottom one (which contains 
> the full neighborhood graph). Specifically, they suggest using maxConn=M for 
> upper layers and maxConn=2*M for the bottom. This differs from what we do, 
> which is to use maxConn=M for all layers.
> I tried updating our logic using a hacky patch, and noticed an improvement in 
> latency for higher recall values (which is consistent with the paper's 
> observation):
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> !image-2022-04-20-14-53-58-484.png|width=400,height=367!
> As we'd expect, indexing becomes a bit slower:
> {code:java}
> Baseline: Indexed 1183514 documents in 733s 
> Candidate: Indexed 1183514 documents in 948s{code}
> When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
> big difference in recall for the same settings of M and efConstruction. (Even 
> adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
> change, the recall is now very similar:
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> {code:java}
> kApproach  Recall 
> QPS
> 10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563  
>4410.499
> 50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798  
>1956.280
> 100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862  
>1209.734
> 500  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958  
> 341.428
> 800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974  
> 230.396
> 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980  
> 188.757
> 10   hnswlib ({'M': 32, 'efConstruction': 100})0.552  
>   16745.433
> 50   hnswlib ({'M': 32, 'efConstruction': 100})0.794  
>5738.468
> 100  hnswlib ({'M': 32, 'efConstruction': 100})0.860  
>3336.386
> 500  hnswlib ({'M': 32, 'efConstruction': 100})0.956  
> 832.982
> 800  hnswlib ({'M': 32, 'efConstruction': 100})0.973  
> 541.097
> 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979  
> 442.163
> {code}
> I think it'd be nice update to maxConn so that we faithfully implement the 
> paper's algorithm. This is probably least surprising for users, and I don't 
> see a strong reason to take a different approach from the paper? Let me know 
> what you think!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10266) Move nearest-neighbor search on points to core?

2022-06-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553702#comment-17553702
 ] 

ASF subversion and git services commented on LUCENE-10266:
--

Commit fcd98fd3370b36e01f35510214cbd3628b25f0f8 in lucene's branch 
refs/heads/main from Rushabh Shah
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fcd98fd3370 ]

LUCENE-10266 Move nearest-neighbor search on points to core (#897)

Co-authored-by: Rushabh Shah 

> Move nearest-neighbor search on points to core?
> ---
>
> Key: LUCENE-10266
> URL: https://issues.apache.org/jira/browse/LUCENE-10266
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Now that the Points' public API supports running nearest-nearest neighbor 
> search, should we move it to core via helper methods on {{LatLonPoint}} and 
> {{XYPoint}}?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #897: LUCENE-10266 Move nearest-neighbor search on points to core

2022-06-13 Thread GitBox


jpountz commented on PR #897:
URL: https://github.com/apache/lucene/pull/897#issuecomment-1154147639

   Thanks @shahrs87 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #897: LUCENE-10266 Move nearest-neighbor search on points to core

2022-06-13 Thread GitBox


jpountz merged PR #897:
URL: https://github.com/apache/lucene/pull/897


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10266) Move nearest-neighbor search on points to core?

2022-06-13 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10266.
---
Fix Version/s: 10.0 (main)
   Resolution: Fixed

> Move nearest-neighbor search on points to core?
> ---
>
> Key: LUCENE-10266
> URL: https://issues.apache.org/jira/browse/LUCENE-10266
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 10.0 (main)
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Now that the Points' public API supports running nearest-nearest neighbor 
> search, should we move it to core via helper methods on {{LatLonPoint}} and 
> {{XYPoint}}?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10078) Enable merge-on-refresh by default?

2022-06-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553697#comment-17553697
 ] 

Adrien Grand commented on LUCENE-10078:
---

As expected, this slowed down refresh latency a bit. 
http://people.apache.org/~mikemccand/lucenebench/nrt.html I pushed an 
annotation that should show up in the coming days.

> Enable merge-on-refresh by default?
> ---
>
> Key: LUCENE-10078
> URL: https://issues.apache.org/jira/browse/LUCENE-10078
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is a spinoff from the discussion in LUCENE-10073.
> The newish merge-on-refresh ([crazy origin 
> story|https://blog.mikemccandless.com/2021/03/open-source-collaboration-or-how-we.html])
>  feature is a powerful way to reduce searched segment counts, especially 
> helpful for applications using many indexing threads.  Such usage will write 
> many tiny segments on each refresh, which could quickly be merged up during 
> the {{refresh}} operation.
> We would have to implement a default for {{findFullFlushMerges}} 
> (LUCENE-10064 is open for this), and then we would need 
> {{IndexWriterConfig.getMaxFullFlushMergeWaitMillis}} a non-zero value (this 
> issue).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

2022-06-13 Thread GitBox


gsmiller commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1154150491

   @jpountz +1 to testing this. Good call! Since I only tackled a subset of 
code accessing `NO_MORE_DOCS`, I think we'll have to wait to clean this up and 
test though right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

2022-06-13 Thread GitBox


jpountz commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1154152474

   We'll have to wait to clean up indeed, plus there may be lots of users doing 
old-style iteration so we'll need to deprecate and maybe only clean this up in 
10.0 or 11.0. But I'm curious of the sort of speedup that this would yield. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shahrs87 commented on pull request #897: LUCENE-10266 Move nearest-neighbor search on points to core

2022-06-13 Thread GitBox


shahrs87 commented on PR #897:
URL: https://github.com/apache/lucene/pull/897#issuecomment-1154177806

   Thank you @jpountz for the review and merge. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10613) Clean up outdated NOTICE.txt information concerning morfologik

2022-06-13 Thread Dawid Weiss (Jira)
Dawid Weiss created LUCENE-10613:


 Summary: Clean up outdated NOTICE.txt information concerning 
morfologik
 Key: LUCENE-10613
 URL: https://issues.apache.org/jira/browse/LUCENE-10613
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Dawid Weiss






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10613) Clean up outdated NOTICE.txt information concerning morfologik

2022-06-13 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-10613:
-
Fix Version/s: 9.3

> Clean up outdated NOTICE.txt information concerning morfologik
> --
>
> Key: LUCENE-10613
> URL: https://issues.apache.org/jira/browse/LUCENE-10613
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Trivial
> Fix For: 9.3
>
>
> It's been pointed out to me that NOTICE.txt contains information about 
> licensing terms that are outdated with regard to what Lucene uses nowadays. 
> It's a trivial update.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10613) Clean up outdated NOTICE.txt information concerning morfologik

2022-06-13 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-10613:
-
Description: It's been pointed out to me that NOTICE.txt contains 
information about licensing terms that are outdated with regard to what Lucene 
uses nowadays. It's a trivial update.

> Clean up outdated NOTICE.txt information concerning morfologik
> --
>
> Key: LUCENE-10613
> URL: https://issues.apache.org/jira/browse/LUCENE-10613
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Trivial
>
> It's been pointed out to me that NOTICE.txt contains information about 
> licensing terms that are outdated with regard to what Lucene uses nowadays. 
> It's a trivial update.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Deepika0510 commented on a diff in pull request #927: LUCENE-10151: Adding Timeout Support to IndexSearcher

2022-06-13 Thread GitBox


Deepika0510 commented on code in PR #927:
URL: https://github.com/apache/lucene/pull/927#discussion_r895985779


##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -766,18 +778,29 @@ protected void search(List leaves, 
Weight weight, Collector c
   }
   BulkScorer scorer = weight.bulkScorer(ctx);
   if (scorer != null) {
-try {
-  scorer.score(leafCollector, ctx.reader().getLiveDocs());
-} catch (
-@SuppressWarnings("unused")
-CollectionTerminatedException e) {
-  // collection was terminated prematurely
-  // continue with the following leaf
+if (isTimeoutEnabled) {
+  TimeLimitingBulkScorer timeLimitingBulkScorer =
+  new TimeLimitingBulkScorer(scorer, queryTimeout);
+  try {
+timeLimitingBulkScorer.score(leafCollector, 
ctx.reader().getLiveDocs());
+  } catch (
+  @SuppressWarnings("unused")
+  TimeLimitingBulkScorer.TimeExceededException e) {
+partialResult = true;

Review Comment:
   I have used counting QueryTimeout in the test. So, should we consider 
providing additional information to the user? And if yes then what all should 
we consider adding to it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10613) Clean up outdated NOTICE.txt information concerning morfologik

2022-06-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553741#comment-17553741
 ] 

ASF subversion and git services commented on LUCENE-10613:
--

Commit 67816a9508a21ec7d43f6dbbc951b28bc3de in lucene's branch 
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=67816a9508a ]

LUCENE-10613: Clean up outdated NOTICE.txt information concerning morfologik


> Clean up outdated NOTICE.txt information concerning morfologik
> --
>
> Key: LUCENE-10613
> URL: https://issues.apache.org/jira/browse/LUCENE-10613
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Trivial
> Fix For: 9.3
>
>
> It's been pointed out to me that NOTICE.txt contains information about 
> licensing terms that are outdated with regard to what Lucene uses nowadays. 
> It's a trivial update.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10613) Clean up outdated NOTICE.txt information concerning morfologik

2022-06-13 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-10613.
--
  Assignee: Dawid Weiss
Resolution: Fixed

> Clean up outdated NOTICE.txt information concerning morfologik
> --
>
> Key: LUCENE-10613
> URL: https://issues.apache.org/jira/browse/LUCENE-10613
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: 9.3
>
>
> It's been pointed out to me that NOTICE.txt contains information about 
> licensing terms that are outdated with regard to what Lucene uses nowadays. 
> It's a trivial update.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10613) Clean up outdated NOTICE.txt information concerning morfologik

2022-06-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553740#comment-17553740
 ] 

ASF subversion and git services commented on LUCENE-10613:
--

Commit 76d418676e86d03dbedd73f917bfedec1d9b3d8c in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=76d418676e8 ]

LUCENE-10613: Clean up outdated NOTICE.txt information concerning morfologik


> Clean up outdated NOTICE.txt information concerning morfologik
> --
>
> Key: LUCENE-10613
> URL: https://issues.apache.org/jira/browse/LUCENE-10613
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Trivial
> Fix For: 9.3
>
>
> It's been pointed out to me that NOTICE.txt contains information about 
> licensing terms that are outdated with regard to what Lucene uses nowadays. 
> It's a trivial update.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #927: LUCENE-10151: Adding Timeout Support to IndexSearcher

2022-06-13 Thread GitBox


dweiss commented on code in PR #927:
URL: https://github.com/apache/lucene/pull/927#discussion_r895987477


##
build.gradle:
##
@@ -183,3 +183,5 @@ apply from: file('gradle/hacks/turbocharge-jvm-opts.gradle')
 apply from: file('gradle/hacks/dummy-outputs.gradle')
 
 apply from: file('gradle/pylucene/pylucene.gradle')
+sourceCompatibility = JavaVersion.VERSION_17

Review Comment:
   IntelliJ sometimes adds such things on its own... Please revert this change 
- it's likely to crash badly with other things concerning source compatibility.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shahrs87 commented on a diff in pull request #907: LUCENE-10357 Ghost fields and postings/points

2022-06-13 Thread GitBox


shahrs87 commented on code in PR #907:
URL: https://github.com/apache/lucene/pull/907#discussion_r896006974


##
lucene/codecs/src/java/org/apache/lucene/codecs/bloom/BloomFilteringPostingsFormat.java:
##
@@ -200,8 +200,8 @@ public Terms terms(String field) throws IOException {
 return delegateFieldsProducer.terms(field);
   } else {
 Terms result = delegateFieldsProducer.terms(field);
-if (result == null) {
-  return null;
+if (result == null || result == Terms.EMPTY) {

Review Comment:
   Yes, this test case is failing even with this patch: 
   `TestMemoryIndexAgainstDirectory#testRandomQueries` 
   Reproducible by: `gradlew :lucene:memory:test --tests 
"org.apache.lucene.index.memory.TestMemoryIndexAgainstDirectory.testRandomQueries"
 -Ptests.jvms=8 -Ptests.jvmargs=-XX:TieredStopAtLevel=1 
-Ptests.seed=B19145C39C34BD03 -Ptests.gui=false -Ptests.file.encoding=UTF-8`
   
   The underlying reader it is using is MemoryIndex#MemoryIndexReader 
[here](https://github.com/apache/lucene/blob/main/lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java#L1405)
   This is the relevant snippet.
   ```
   if (info == null || info.numTokens <= 0) {
 return null;
   }
   ```
   
   Below is the text I copied from LUCENE-10357 description.
   
   > I fear that this could be a source of bugs, as a caller could be tempted 
to assume that he would get non-null terms on a FieldInfo that has IndexOptions 
that are not NONE. Should we introduce a contract that FieldsProducer (resp. 
PointsReader) must return a non-null instance when postings (resp. points) are 
indexed?
   
   I don't know which all places I need to do null check ? From the above 
description, looks like only in FieldsProducer related classes. From my limited 
understanding, this doesn't look like FiledsProducer. @jpountz  please advise.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

2022-06-13 Thread GitBox


jtibshirani commented on code in PR #951:
URL: https://github.com/apache/lucene/pull/951#discussion_r896012687


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, 
Query filter) {
   public Query rewrite(IndexReader reader) throws IOException {
 TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
 
-BitSetCollector filterCollector = null;
+Weight filterWeight = null;
 if (filter != null) {
-  filterCollector = new BitSetCollector(reader.leaves().size());
   IndexSearcher indexSearcher = new IndexSearcher(reader);
   BooleanQuery booleanQuery =
   new BooleanQuery.Builder()
   .add(filter, BooleanClause.Occur.FILTER)
   .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER)
   .build();
-  indexSearcher.search(booleanQuery, filterCollector);
+  Query rewritten = indexSearcher.rewrite(booleanQuery);
+  filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
 }
 
 for (LeafReaderContext ctx : reader.leaves()) {
-  TopDocs results = searchLeaf(ctx, filterCollector);
+  Bits acceptDocs;
+  int cost;
+  if (filterWeight != null) {
+Scorer scorer = filterWeight.scorer(ctx);
+if (scorer != null) {
+  DocIdSetIterator iterator = scorer.iterator();
+  if (iterator instanceof BitSetIterator) {
+acceptDocs = ((BitSetIterator) iterator).getBitSet();
+  } else {
+acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc());
+  }
+  cost = (int) iterator.cost();

Review Comment:
   This changes the meaning of `cost` (which is directly used as 
`visitedLimit`). Before we were using the exact number of matches, whereas now 
we ask the iterator for a cost estimation. These cost estimates are sometimes 
very imprecise, and I worry it could make the query performance unpredictable 
and harder to understand.
   
   It wonder if we could convert everything to a `BitSet` and then use the 
actual cardinality. Hopefully we could do this while still keeping the nice 
performance improvement?



##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -121,35 +140,15 @@ public Query rewrite(IndexReader reader) throws 
IOException {
 return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, BitSetCollector 
filterCollector)
-  throws IOException {
-
-if (filterCollector == null) {
-  Bits acceptDocs = ctx.reader().getLiveDocs();
-  return approximateSearch(ctx, acceptDocs, Integer.MAX_VALUE);
+  private TopDocs searchLeaf(LeafReaderContext ctx, Bits acceptDocs, int cost) 
throws IOException {
+TopDocs results = approximateSearch(ctx, acceptDocs, cost);

Review Comment:
   The new logic here drops this check -- could we make sure to keep it?
   ```
 if (filterIterator.cost() <= k) {
   // If there are <= k possible matches, short-circuit and perform 
exact search, since HNSW
   // must always visit at least k documents
   return exactSearch(ctx, filterIterator);
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

2022-06-13 Thread GitBox


gsmiller commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1154550479

   @jpountz:
   > plus there may be lots of users doing old-style iteration so we'll need to 
deprecate and maybe only clean this up in 10.0 or 11.0
   
   Right, makes sense.
   
   > But I'm curious of the sort of speedup that this would yield
   
   Me too. I hacked up a version of this change on another branch 
([here](https://github.com/gsmiller/lucene/commit/3768665ae6c173014e8288c46c27afa517c90ede))
 that let calling code explicitly ask for a "fast" version of SSDV that doesn't 
do the ordinal check, and then relied on this new code path for loading SSDV 
within the faceting module. I didn't observe a significant change with our 
benchmark tooling, but I wonder how much we actually exercise these multi-value 
cases within our faceting benchmarks. I think it will be more interesting to 
test once we've migrated more use-cases to this new SSDV iteration style.
   
   You may have had a completely different thought in mind for testing this, so 
please let me know if I'm missing the mark here. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani opened a new pull request, #956: Make sure KnnVectorQuery applies search boost

2022-06-13 Thread GitBox


jtibshirani opened a new pull request, #956:
URL: https://github.com/apache/lucene/pull/956

   Before, the rewritten query DocAndScoreQuery ignored the boost.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #956: Make sure KnnVectorQuery applies search boost

2022-06-13 Thread GitBox


jtibshirani commented on PR #956:
URL: https://github.com/apache/lucene/pull/956#issuecomment-1154554441

   Thank you @mocobeta for leading the change to allow GitHub issues in 
CHANGES.txt -- this was very convenient to fix compared to before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10614) Properly support getTopChildren in RangeFacetCounts

2022-06-13 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10614:


 Summary: Properly support getTopChildren in RangeFacetCounts
 Key: LUCENE-10614
 URL: https://issues.apache.org/jira/browse/LUCENE-10614
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Affects Versions: 10.0 (main)
Reporter: Greg Miller


As mentioned in LUCENE-10538, {{RangeFacetCounts}} is not implementing 
{{getTopChildren}}. Instead of returning "top" ranges, it returns all 
user-provided ranges in the order the user specified them when instantiating. 
This is probably more useful functionality, but it would be nice to support 
{{getTopChildren}} as well.

LUCENE-10550 is introducing the concept of {{getAllChildren}}, so once that 
lands, we can replace the current implementation of {{getTopChildren}} with an 
actual "top children" implementation and direct users to {{getAllChildren}} if 
they want to maintain the current behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #914: LUCENE-10550: Add getAllChildren functionality to facets

2022-06-13 Thread GitBox


gsmiller commented on code in PR #914:
URL: https://github.com/apache/lucene/pull/914#discussion_r896240300


##
lucene/facet/src/java/org/apache/lucene/facet/LongValueFacetCounts.java:
##
@@ -346,6 +346,43 @@ private void increment(long value) {
 }
   }
 
+  @Override
+  public FacetResult getAllChildren(String dim, String... path) throws 
IOException {
+if (dim.equals(field) == false) {
+  throw new IllegalArgumentException(
+  "invalid dim \"" + dim + "\"; should be \"" + field + "\"");
+}
+if (path.length != 0) {
+  throw new IllegalArgumentException("path.length should be 0");
+}

Review Comment:
   minor: There's some common validation logic between this and 
`getTopChildren` you could factor out into a common helper method if you 
thought it made sense.



##
lucene/facet/src/java/org/apache/lucene/facet/LongValueFacetCounts.java:
##
@@ -346,6 +346,43 @@ private void increment(long value) {
 }
   }
 
+  @Override
+  public FacetResult getAllChildren(String dim, String... path) throws 
IOException {
+if (dim.equals(field) == false) {
+  throw new IllegalArgumentException(
+  "invalid dim \"" + dim + "\"; should be \"" + field + "\"");
+}
+if (path.length != 0) {
+  throw new IllegalArgumentException("path.length should be 0");
+}
+
+List labelValues = new ArrayList<>();
+boolean countsAdded = false;
+if (hashCounts.size() != 0) {
+  for (LongIntCursor c : hashCounts) {
+int count = c.value;
+if (count != 0) {
+  if (countsAdded == false && c.key >= counts.length) {
+countsAdded = true;
+appendCounts(labelValues);
+  }
+  labelValues.add(new LabelAndValue(Long.toString(c.key), count));
+}
+  }
+}
+
+if (countsAdded == false) {
+  appendCounts(labelValues);
+}
+
+return new FacetResult(
+field,
+new String[0],
+totCount,
+labelValues.toArray(new LabelAndValue[0]),
+labelValues.size());

Review Comment:
   It looks like you're trying to maintain value-sort-order, sort of like what 
`getAllChildrenSortByValue` is doing, but since we don't make any ordering 
guarantees, I think we can simplify this a little bit. What do you think of 
doing something like this?
   
   ```suggestion
   List labelValues = new ArrayList<>();
   for (int i = 0; i < counts.length; i++) {
 if (counts[i] != 0) {
   labelValues.add(new LabelAndValue(Long.toString(i), counts[i]));
 }
   }
   
   if (hashCounts.size() != 0) {
 for (LongIntCursor c : hashCounts) {
   int count = c.value;
   if (count != 0) {
 labelValues.add(new LabelAndValue(Long.toString(c.key), c.value));
   }
 }
   }
   
   return new FacetResult(
   field,
   new String[0],
   totCount,
   labelValues.toArray(new LabelAndValue[0]),
   labelValues.size());
   ```



##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java:
##
@@ -101,6 +102,63 @@ public Number getSpecificValue(String dim, String... path) 
throws IOException {
 return values[ord];
   }
 
+  @Override
+  public FacetResult getAllChildren(String dim, String... path) throws 
IOException {

Review Comment:
   Please see my feedback on `IntTaxonomyFacets`. It should all apply here as 
well. Thanks!



##
lucene/facet/src/java/org/apache/lucene/facet/sortedset/AbstractSortedSetDocValueFacetCounts.java:
##
@@ -72,6 +72,40 @@ public FacetResult getTopChildren(int topN, String dim, 
String... path) throws I
 return createFacetResult(topChildrenForPath, dim, path);
   }
 
+  @Override
+  public FacetResult getAllChildren(String dim, String... path) throws 
IOException {
+FacetsConfig.DimConfig dimConfig = stateConfig.getDimConfig(dim);
+
+if (dimConfig.hierarchical) {
+  int pathOrd = (int) dv.lookupTerm(new 
BytesRef(FacetsConfig.pathToString(dim, path)));
+  if (pathOrd < 0) {
+// path was never indexed
+return null;
+  }
+  SortedSetDocValuesReaderState.DimTree dimTree = state.getDimTree(dim);

Review Comment:
   As a general note, I think we usually prefer to _not_ fully qualify internal 
class names unless necessary (and prefer to import them directly instead). For 
example, we're already doing `import 
org.apache.lucene.facet.sortedset.SortedSetDocValuesReaderState.DimTree;`, so 
you can just say `DimTree` here instead of 
`SortedSetDocValuesReaderState.DimTree`. I'm betting your IDE is setup with 
this style, but something to just keep an eye on.



##
lucene/facet/src/java/org/apache/lucene/facet/sortedset/AbstractSortedSetDocValueFacetCounts.java:
##
@@ -72,6 +72,40 @@ public FacetResult getTopChildren(int topN, String dim, 
String... path) throws I
 return createFacetResult(topChildr

[GitHub] [lucene] mdmarshmallow commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-13 Thread GitBox


mdmarshmallow commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1154629173

   I agree with Greg, we should not let benchmarking block releasing this. I 
pushed a commit to remove the `byte[]` matches API


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shaie commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-13 Thread GitBox


shaie commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1154701271

   I pushed some more cleanups and minor refactoring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang merged pull request #942: LUCENE-10598: Use count to record docValueCount similar to SortedNumericDocValues did

2022-06-13 Thread GitBox


LuXugang merged PR #942:
URL: https://github.com/apache/lucene/pull/942


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10598) SortedSetDocValues#docValueCount() should be always greater than zero

2022-06-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553895#comment-17553895
 ] 

ASF subversion and git services commented on LUCENE-10598:
--

Commit 7504b0a258d3c3209110e6476072b6ca6a2e82ff in lucene's branch 
refs/heads/main from Lu Xugang
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=7504b0a258d ]

LUCENE-10598: Use count to record docValueCount similar to 
SortedNumericDocValues did (#942)



> SortedSetDocValues#docValueCount() should be always greater than zero
> -
>
> Key: LUCENE-10598
> URL: https://issues.apache.org/jira/browse/LUCENE-10598
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Lu Xugang
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This test runs failed.
> {code:java}
>   public void testDocValueCount() throws IOException {
>   try (Directory d = newDirectory()) {
> try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
>   for (int j = 0; j < 1; j++) {
> Document doc = new Document();
> doc.add(new SortedSetDocValuesField("field", new BytesRef("a")));
> doc.add(new SortedSetDocValuesField("field", new BytesRef("a")));
> doc.add(new SortedSetDocValuesField("field", new BytesRef("b")));
> w.addDocument(doc);
>   }
> }
> try (IndexReader reader = DirectoryReader.open(d)) {
>   assertEquals(1, reader.leaves().size());
>   for (LeafReaderContext leaf : reader.leaves()) {
> SortedSetDocValues docValues= 
> leaf.reader().getSortedSetDocValues("field") ;
> for (int doc1 = docValues.nextDoc(); doc1 != 
> DocIdSetIterator.NO_MORE_DOCS; doc1 = docValues.nextDoc()) {
>   assert docValues.docValueCount() > 0;
> }
>   }
> }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10615) Add license information for SmartChineseAnalyzer to NOTICE.txt

2022-06-13 Thread Jan Dornseifer (Jira)
Jan Dornseifer created LUCENE-10615:
---

 Summary: Add license information for SmartChineseAnalyzer to 
NOTICE.txt
 Key: LUCENE-10615
 URL: https://issues.apache.org/jira/browse/LUCENE-10615
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Jan Dornseifer


The Lucene NOTICE file contains the statement

The SmartChineseAnalyzer source code (smartcn) was
provided by Xiaoping Gao and copyright 2009 by 
[www.imdict.net.|http://www.imdict.net./]

without providing license information. Can this information be supplemented or 
is it even outdated?

We are using Apache Lucene v8.4.1. We are currently subject to a license audit 
of our software, where also 3rd party FOSS components are checked for usage. 
Among other things, this part came to our attention. I would be very grateful 
for information.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org