[jira] [Commented] (LUCENE-10078) Enable merge-on-refresh by default?
[ https://issues.apache.org/jira/browse/LUCENE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553416#comment-17553416 ] ASF subversion and git services commented on LUCENE-10078: -- Commit d850a22a511676989f29ce3ef011fcf56be71c17 in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d850a22a511 ] LUCENE-10078: Fix TestIndexWriterExceptions' expectations regarding merges on full flushes. > Enable merge-on-refresh by default? > --- > > Key: LUCENE-10078 > URL: https://issues.apache.org/jira/browse/LUCENE-10078 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Fix For: 9.3 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This is a spinoff from the discussion in LUCENE-10073. > The newish merge-on-refresh ([crazy origin > story|https://blog.mikemccandless.com/2021/03/open-source-collaboration-or-how-we.html]) > feature is a powerful way to reduce searched segment counts, especially > helpful for applications using many indexing threads. Such usage will write > many tiny segments on each refresh, which could quickly be merged up during > the {{refresh}} operation. > We would have to implement a default for {{findFullFlushMerges}} > (LUCENE-10064 is open for this), and then we would need > {{IndexWriterConfig.getMaxFullFlushMergeWaitMillis}} a non-zero value (this > issue). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10078) Enable merge-on-refresh by default?
[ https://issues.apache.org/jira/browse/LUCENE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553417#comment-17553417 ] ASF subversion and git services commented on LUCENE-10078: -- Commit 376195e0e7bd9052d09b3d4ca8b824dc8dbd5ce9 in lucene's branch refs/heads/branch_9x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=376195e0e7b ] LUCENE-10078: Fix TestIndexWriterExceptions' expectations regarding merges on full flushes. > Enable merge-on-refresh by default? > --- > > Key: LUCENE-10078 > URL: https://issues.apache.org/jira/browse/LUCENE-10078 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Fix For: 9.3 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This is a spinoff from the discussion in LUCENE-10073. > The newish merge-on-refresh ([crazy origin > story|https://blog.mikemccandless.com/2021/03/open-source-collaboration-or-how-we.html]) > feature is a powerful way to reduce searched segment counts, especially > helpful for applications using many indexing threads. Such usage will write > many tiny segments on each refresh, which could quickly be merged up during > the {{refresh}} operation. > We would have to implement a default for {{findFullFlushMerges}} > (LUCENE-10064 is open for this), and then we would need > {{IndexWriterConfig.getMaxFullFlushMergeWaitMillis}} a non-zero value (this > issue). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10611) KnnVectorQuery throwing Heap Error for Restrictive Filters
[ https://issues.apache.org/jira/browse/LUCENE-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaival Parikh updated LUCENE-10611: --- Description: The HNSW graph search does not consider that visitedLimit may be reached in the upper levels of graph search itself This occurs when the pre-filter is too restrictive (and its count sets the visitedLimit). So instead of switching over to exactSearch, it tries to [pop from an empty heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90] and throws an error To reproduce this error, we can +increase the numDocs [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500] to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached faster) Stacktrace: The heap is empty java.lang.IllegalStateException: The heap is empty at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0) at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111) at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98) at org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90) at org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236) at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272) at org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235) at org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159) was: The HNSW graph search does not consider that visitedLimit may be reached in the upper levels of graph search itself This occurs when the pre-filter is too restrictive (and its count sets the visitedLimit). So instead of switching over to exactSearch, it tries to [pop from an empty heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90] and throws an error To reproduce this error, we can +increase the numDocs [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500] to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached faster) Stacktrace: `The heap is empty java.lang.IllegalStateException: The heap is empty at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0) at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111) at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98) at org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90) at org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236) at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272) at org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235) at org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159)` > KnnVectorQuery throwing Heap Error for Restrictive Filters > -- > > Key: LUCENE-10611 > URL: https://issues.apache.org/jira/browse/LUCENE-10611 > Project: Lucene - Core > Issue Type: Bug >Reporter: Kaival Parikh >Priority: Minor > > The HNSW graph search does not consider that visitedLimit may be reached in > the upper levels of graph search itself > This occurs when the pre-filter is too restrictive (and its count sets the > visitedLimit). So instead of switching over to exactSearch, it tries to [pop > from an empty > heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90] > and throws an error > > To reproduce this error, we can +increase the numDocs > [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500] > to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached > faster) > > Stacktrace: > The heap is empty > java.lang.IllegalStateException: The heap is empty > at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0) > at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111) > at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98) > at > org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90) > at > org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236) > at > org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272) > at > org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235) > at > org.apache.luc
[jira] [Created] (LUCENE-10611) KnnVectorQuery throwing Heap Error for Restrictive Filters
Kaival Parikh created LUCENE-10611: -- Summary: KnnVectorQuery throwing Heap Error for Restrictive Filters Key: LUCENE-10611 URL: https://issues.apache.org/jira/browse/LUCENE-10611 Project: Lucene - Core Issue Type: Bug Reporter: Kaival Parikh The HNSW graph search does not consider that visitedLimit may be reached in the upper levels of graph search itself This occurs when the pre-filter is too restrictive (and its count sets the visitedLimit). So instead of switching over to exactSearch, it tries to [pop from an empty heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90] and throws an error To reproduce this error, we can +increase the numDocs [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500] to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached faster) Stacktrace: The heap is empty java.lang.IllegalStateException: The heap is empty at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0) at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111) at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98) at org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90) at org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236) at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272) at org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235) at org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10611) KnnVectorQuery throwing Heap Error for Restrictive Filters
[ https://issues.apache.org/jira/browse/LUCENE-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaival Parikh updated LUCENE-10611: --- Description: The HNSW graph search does not consider that visitedLimit may be reached in the upper levels of graph search itself This occurs when the pre-filter is too restrictive (and its count sets the visitedLimit). So instead of switching over to exactSearch, it tries to [pop from an empty heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90] and throws an error To reproduce this error, we can +increase the numDocs [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500] to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached faster) Stacktrace: `The heap is empty java.lang.IllegalStateException: The heap is empty at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0) at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111) at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98) at org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90) at org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236) at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272) at org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235) at org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159)` was: The HNSW graph search does not consider that visitedLimit may be reached in the upper levels of graph search itself This occurs when the pre-filter is too restrictive (and its count sets the visitedLimit). So instead of switching over to exactSearch, it tries to [pop from an empty heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90] and throws an error To reproduce this error, we can +increase the numDocs [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500] to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached faster) Stacktrace: The heap is empty java.lang.IllegalStateException: The heap is empty at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0) at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111) at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98) at org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90) at org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236) at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272) at org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235) at org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159) > KnnVectorQuery throwing Heap Error for Restrictive Filters > -- > > Key: LUCENE-10611 > URL: https://issues.apache.org/jira/browse/LUCENE-10611 > Project: Lucene - Core > Issue Type: Bug >Reporter: Kaival Parikh >Priority: Minor > > The HNSW graph search does not consider that visitedLimit may be reached in > the upper levels of graph search itself > This occurs when the pre-filter is too restrictive (and its count sets the > visitedLimit). So instead of switching over to exactSearch, it tries to [pop > from an empty > heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90] > and throws an error > > To reproduce this error, we can +increase the numDocs > [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500] > to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached > faster) > > Stacktrace: > `The heap is empty > java.lang.IllegalStateException: The heap is empty > at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0) > at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111) > at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98) > at > org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90) > at > org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236) > at > org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272) > at > org.apache.lucene.index.CodecReader
[jira] [Commented] (LUCENE-10611) KnnVectorQuery throwing Heap Error for Restrictive Filters
[ https://issues.apache.org/jira/browse/LUCENE-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553467#comment-17553467 ] Kaival Parikh commented on LUCENE-10611: As a fix, we can check if results are incomplete after [this line|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L89], and return results accordingly {code:java} diff --git a/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java b/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java index b1a2436166f..7c641f077ee 100644 --- a/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java +++ b/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java @@ -87,6 +87,9 @@ public final class HnswGraphSearcher { int numVisited = 0; for (int level = graph.numLevels() - 1; level >= 1; level--) { results = graphSearcher.searchLevel(query, 1, level, eps, vectors, graph, null, visitedLimit); + if (results.incomplete()) { + return results; + } eps[0] = results.pop(); numVisited += results.visitedCount();{code} Alternatively, we do not enforce limits in higher levels by setting the limit as Integer.MAX_VALUE [here|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L89] (also not updating the counts [here|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L92-L93]), but we might end up visiting more nodes than desired {code:java} diff --git a/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java b/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java index b1a2436166f..0101cbd7690 100644 --- a/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java +++ b/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java @@ -86,11 +86,8 @@ public final class HnswGraphSearcher { int[] eps = new int[] {graph.entryNode()}; int numVisited = 0; for (int level = graph.numLevels() - 1; level >= 1; level--) { - results = graphSearcher.searchLevel(query, 1, level, eps, vectors, graph, null, visitedLimit); + results = graphSearcher.searchLevel(query, 1, level, eps, vectors, graph, null, Integer.MAX_VALUE); eps[0] = results.pop(); - - numVisited += results.visitedCount(); - visitedLimit -= results.visitedCount(); } results = graphSearcher.searchLevel(query, topK, 0, eps, vectors, graph, acceptOrds, visitedLimit); {code} > KnnVectorQuery throwing Heap Error for Restrictive Filters > -- > > Key: LUCENE-10611 > URL: https://issues.apache.org/jira/browse/LUCENE-10611 > Project: Lucene - Core > Issue Type: Bug >Reporter: Kaival Parikh >Priority: Minor > > The HNSW graph search does not consider that visitedLimit may be reached in > the upper levels of graph search itself > This occurs when the pre-filter is too restrictive (and its count sets the > visitedLimit). So instead of switching over to exactSearch, it tries to [pop > from an empty > heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90] > and throws an error > > To reproduce this error, we can +increase the numDocs > [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500] > to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached > faster) > > Stacktrace: > The heap is empty > java.lang.IllegalStateException: The heap is empty > at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0) > at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111) > at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98) > at > org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90) > at > org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236) > at > org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272) > at > org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235) > at > org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection
jpountz commented on code in PR #951: URL: https://github.com/apache/lucene/pull/951#discussion_r895488795 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, Query filter) { public Query rewrite(IndexReader reader) throws IOException { TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()]; -BitSetCollector filterCollector = null; +Weight filterWeight = null; if (filter != null) { - filterCollector = new BitSetCollector(reader.leaves().size()); IndexSearcher indexSearcher = new IndexSearcher(reader); BooleanQuery booleanQuery = new BooleanQuery.Builder() .add(filter, BooleanClause.Occur.FILTER) .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER) .build(); - indexSearcher.search(booleanQuery, filterCollector); + Query rewritten = indexSearcher.rewrite(booleanQuery); + filterWeight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1f); } for (LeafReaderContext ctx : reader.leaves()) { - TopDocs results = searchLeaf(ctx, filterCollector); + Bits acceptDocs; + int cost; + if (filterWeight != null) { +Scorer scorer = filterWeight.scorer(ctx); +if (scorer != null) { + DocIdSetIterator iterator = scorer.iterator(); + if (iterator instanceof BitSetIterator) { +acceptDocs = ((BitSetIterator) iterator).getBitSet(); + } else { +acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc()); + } Review Comment: Do we need to apply live docs here? `Scorer#iterator` returns an iterator over all matches, including deleted documents. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10611) KnnVectorQuery throwing Heap Error for Restrictive Filters
[ https://issues.apache.org/jira/browse/LUCENE-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaival Parikh updated LUCENE-10611: --- Description: The HNSW graph search does not consider that visitedLimit may be reached in the upper levels of graph search itself This occurs when the pre-filter is too restrictive (and its count sets the visitedLimit). So instead of switching over to exactSearch, it tries to [pop from an empty heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90] and throws an error To reproduce this error, we can +increase the numDocs [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500] to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached faster) Stacktrace: {code:java} The heap is empty java.lang.IllegalStateException: The heap is empty at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0) at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111) at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98) at org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90) at org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236) at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272) at org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235) at org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159) {code} was: The HNSW graph search does not consider that visitedLimit may be reached in the upper levels of graph search itself This occurs when the pre-filter is too restrictive (and its count sets the visitedLimit). So instead of switching over to exactSearch, it tries to [pop from an empty heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90] and throws an error To reproduce this error, we can +increase the numDocs [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500] to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached faster) Stacktrace: The heap is empty java.lang.IllegalStateException: The heap is empty at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0) at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111) at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98) at org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90) at org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236) at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272) at org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235) at org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159) > KnnVectorQuery throwing Heap Error for Restrictive Filters > -- > > Key: LUCENE-10611 > URL: https://issues.apache.org/jira/browse/LUCENE-10611 > Project: Lucene - Core > Issue Type: Bug >Reporter: Kaival Parikh >Priority: Minor > > The HNSW graph search does not consider that visitedLimit may be reached in > the upper levels of graph search itself > This occurs when the pre-filter is too restrictive (and its count sets the > visitedLimit). So instead of switching over to exactSearch, it tries to [pop > from an empty > heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90] > and throws an error > > To reproduce this error, we can +increase the numDocs > [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500] > to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached > faster) > > Stacktrace: > {code:java} > The heap is empty > java.lang.IllegalStateException: The heap is empty > at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0) > at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111) > at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98) > at > org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90) > at > org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236) > at > org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272) > at > org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.
[GitHub] [lucene] jpountz commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…
jpountz commented on PR #954: URL: https://github.com/apache/lucene/pull/954#issuecomment-1153675730 @gsmiller I wonder if you could also test if there is a speed up if we remove checks that the codec has to do in order to make sure to return `NO_MORE_ORDS` when values for a doc are exhausted. E.g. `Lucene90DocValuesProducer#getSortedSet` looks like this today ```java @Override public SortedSetDocValues getSortedSet(FieldInfo field) throws IOException { SortedSetEntry entry = sortedSets.get(field.name); if (entry.singleValueEntry != null) { return DocValues.singleton(getSorted(entry.singleValueEntry)); } final SortedNumericDocValues ords = getSortedNumeric(entry.ordsEntry); return new BaseSortedSetDocValues(entry, data) { int i = 0; int count = 0; boolean set = false; @Override public long nextOrd() throws IOException { if (set == false) { set = true; i = 0; count = ords.docValueCount(); } if (i++ == count) { return NO_MORE_ORDS; } return ords.nextValue(); } @Override public long docValueCount() { return ords.docValueCount(); } @Override public boolean advanceExact(int target) throws IOException { set = false; return ords.advanceExact(target); } @Override public int docID() { return ords.docID(); } @Override public int nextDoc() throws IOException { set = false; return ords.nextDoc(); } @Override public int advance(int target) throws IOException { set = false; return ords.advance(target); } @Override public long cost() { return ords.cost(); } }; } ``` but if we moved everything to this new iteration model, we wouldn't have to check if the caller is visiting more values than expected, it would just lead to an undefined behavior and we could remove `i`, `count`, `set` and `nextOrd()` could delegate directly to `ords.nextValue()`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #950: LUCENE-10608: Implement Weight#count on pure conjunctions.
jpountz commented on code in PR #950: URL: https://github.com/apache/lucene/pull/950#discussion_r895526857 ## lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java: ## @@ -344,6 +344,45 @@ public BulkScorer bulkScorer(LeafReaderContext context) throws IOException { } } + @Override + public int count(LeafReaderContext context) throws IOException { +// Implement counting for pure conjunctions in the case when one clause doesn't match any docs, +// or all clauses but one match all docs. +if (weightedClauses.isEmpty()) { + return 0; +} +for (WeightedBooleanClause weightedClause : weightedClauses) { + switch (weightedClause.clause.getOccur()) { +case FILTER: +case MUST: + break; +case MUST_NOT: Review Comment: You are right. I think there are a few more cases we could optimize, like pure disjunctions of term queries on a single-valued field where we could just sum up counts. I was thinking of introducing these optimizations one at a time to keep changes easy to review and test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #950: LUCENE-10608: Implement Weight#count on pure conjunctions.
jpountz commented on code in PR #950: URL: https://github.com/apache/lucene/pull/950#discussion_r895527360 ## lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java: ## @@ -344,6 +344,45 @@ public BulkScorer bulkScorer(LeafReaderContext context) throws IOException { } } + @Override + public int count(LeafReaderContext context) throws IOException { +// Implement counting for pure conjunctions in the case when one clause doesn't match any docs, +// or all clauses but one match all docs. +if (weightedClauses.isEmpty()) { + return 0; +} +for (WeightedBooleanClause weightedClause : weightedClauses) { + switch (weightedClause.clause.getOccur()) { +case FILTER: +case MUST: + break; +case MUST_NOT: +case SHOULD: Review Comment: It should already be handled by the fact that `BooleanQuery#rewrite` rewrites single-clause queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553494#comment-17553494 ] Adrien Grand commented on LUCENE-10480: --- Good question, looking at your BlockMaxMaxScoreScorer it looks like it also has potential for being specialized in the 2-clauses case by having two sub scorers and tracking during document collection whether the scorer that produces lower scores is optional or required. I didn't have concrete plans in mind when opening the issue, I was just observing that we pay significant overhead for supporting arbitrary numbers of clauses when disjunctions often have only two clauses. > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec
Elia Porciani created LUCENE-10612: -- Summary: Add parameters for HNSW codec in Lucene93Codec Key: LUCENE-10612 URL: https://issues.apache.org/jira/browse/LUCENE-10612 Project: Lucene - Core Issue Type: Task Components: core/codecs Reporter: Elia Porciani Currently, it is possible to specify only the compression mode for stored fields in the LuceneXXCodec constructors. With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, LuceneXXCodec should provide an easy way to specify custom parameters for HNSW graph layout: * maxConn * beamWidth -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec
[ https://issues.apache.org/jira/browse/LUCENE-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553499#comment-17553499 ] Adrien Grand commented on LUCENE-10612: --- We have been rejecting such requests in the past due to the impact it has on backward compatibility, as the default codec has strong backward compatibility guarantees, and we need to make sure that the compatibility guarantees hold for every combination of options. Stored fields are indeed an exception because it was hard to come up with values that would work well enough for everyone. But it was done in a way that has a very small surface, e.g. it doesn't expose the algorithm that is used under the hood or the size of blocks, or the DEFLATE compression level, it's only two options with opaque implementation details. On the other hand maxConn and beamWidth are specific implementation details of HNSW that can take a large range of values. And even with only two possible options, we still set the bar pretty high for configurability of the default codec, e.g. there was an option for doc values at some point that we ended up removing. Would it work for you to override `Lucene93Codec#getKnnVectorsFormatForField`? The caveat is that it is customizing file formats, so it puts you on your own regarding backward compatibility. > Add parameters for HNSW codec in Lucene93Codec > -- > > Key: LUCENE-10612 > URL: https://issues.apache.org/jira/browse/LUCENE-10612 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >Reporter: Elia Porciani >Priority: Major > > Currently, it is possible to specify only the compression mode for stored > fields in the LuceneXXCodec constructors. > With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, > LuceneXXCodec should provide an easy way to specify custom parameters for > HNSW graph layout: > * maxConn > * beamWidth -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec
[ https://issues.apache.org/jira/browse/LUCENE-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553518#comment-17553518 ] Elia Porciani commented on LUCENE-10612: Actually, the change I'm proposing is to make it possible to specify the parameters for HNSM without the need to know which HNWS codec is used underlying. For instance, In Solr, this is done in the way you mentioned but there is an explicit call of the *Lucene91HnswVectorsFormat* and for this reason, Solr cannot be agnostic about the codec version used in Lucene for HNSM. > Add parameters for HNSW codec in Lucene93Codec > -- > > Key: LUCENE-10612 > URL: https://issues.apache.org/jira/browse/LUCENE-10612 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >Reporter: Elia Porciani >Priority: Major > > Currently, it is possible to specify only the compression mode for stored > fields in the LuceneXXCodec constructors. > With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, > LuceneXXCodec should provide an easy way to specify custom parameters for > HNSW graph layout: > * maxConn > * beamWidth -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8193) Deprecate LowercaseTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553541#comment-17553541 ] Andras Salamon commented on LUCENE-8193: This looks like a duplicate of LUCENE-8498 > Deprecate LowercaseTokenizer > > > Key: LUCENE-8193 > URL: https://issues.apache.org/jira/browse/LUCENE-8193 > Project: Lucene - Core > Issue Type: Task > Components: modules/analysis >Reporter: Tim Allison >Priority: Minor > > On LUCENE-8186, discussion favored deprecating and eventually removing > LowercaseTokenizer. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] eliaporciani opened a new pull request, #955: LUCENE-10612: Introduced Lucene93CodecParameters for Lucene93Codec
eliaporciani opened a new pull request, #955: URL: https://github.com/apache/lucene/pull/955 https://issues.apache.org/jira/browse/LUCENE-10612 # Description Lucene93Codec should provide a way for providing custom parameters to HnswVectorsFormat # Solution For providing the various parameters to Lucene93Codec, I wrap them up in a Lucene93CodecParameters class. This should provide a cleaner and easier way to pass custom parameters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec
[ https://issues.apache.org/jira/browse/LUCENE-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553593#comment-17553593 ] Elia Porciani commented on LUCENE-10612: However, I understand the concern about backward compatibility. I don't think at this time is harmful to have custom HNSW parameters but things might be different in future releases. Even if we decide not to move forward, I have created this PR for making the proposal clearer: [https://github.com/apache/lucene/pull/955.] > Add parameters for HNSW codec in Lucene93Codec > -- > > Key: LUCENE-10612 > URL: https://issues.apache.org/jira/browse/LUCENE-10612 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >Reporter: Elia Porciani >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Currently, it is possible to specify only the compression mode for stored > fields in the LuceneXXCodec constructors. > With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, > LuceneXXCodec should provide an easy way to specify custom parameters for > HNSW graph layout: > * maxConn > * beamWidth -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec
[ https://issues.apache.org/jira/browse/LUCENE-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553593#comment-17553593 ] Elia Porciani edited comment on LUCENE-10612 at 6/13/22 1:54 PM: - However, I understand the concern about backward compatibility. I don't think at this time is harmful to have custom HNSW parameters but things might be different in future releases. Even if we decide not to move forward, I have created this PR for making the proposal clearer: [https://github.com/apache/lucene/pull/955|https://github.com/apache/lucene/pull/955.] was (Author: JIRAUSER280197): However, I understand the concern about backward compatibility. I don't think at this time is harmful to have custom HNSW parameters but things might be different in future releases. Even if we decide not to move forward, I have created this PR for making the proposal clearer: [https://github.com/apache/lucene/pull/955.] > Add parameters for HNSW codec in Lucene93Codec > -- > > Key: LUCENE-10612 > URL: https://issues.apache.org/jira/browse/LUCENE-10612 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >Reporter: Elia Porciani >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Currently, it is possible to specify only the compression mode for stored > fields in the LuceneXXCodec constructors. > With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, > LuceneXXCodec should provide an easy way to specify custom parameters for > HNSW graph layout: > * maxConn > * beamWidth -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec
[ https://issues.apache.org/jira/browse/LUCENE-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553593#comment-17553593 ] Elia Porciani edited comment on LUCENE-10612 at 6/13/22 1:54 PM: - However, I understand the concern about backward compatibility. I don't think at this time is harmful to have custom HNSW parameters but things might be different in future releases. Even if we decide not to move forward, I have created this PR for making the proposal clearer: [https://github.com/apache/lucene/pull/955.] was (Author: JIRAUSER280197): However, I understand the concern about backward compatibility. I don't think at this time is harmful to have custom HNSW parameters but things might be different in future releases. Even if we decide not to move forward, I have created this PR for making the proposal clearer: [https://github.com/apache/lucene/pull/955.] > Add parameters for HNSW codec in Lucene93Codec > -- > > Key: LUCENE-10612 > URL: https://issues.apache.org/jira/browse/LUCENE-10612 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >Reporter: Elia Porciani >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Currently, it is possible to specify only the compression mode for stored > fields in the LuceneXXCodec constructors. > With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, > LuceneXXCodec should provide an easy way to specify custom parameters for > HNSW graph layout: > * maxConn > * beamWidth -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] anshumg opened a new pull request, #2663: SOLR-16218: Fix bug in in-place update when failOnVersionConflicts=false
anshumg opened a new pull request, #2663: URL: https://github.com/apache/lucene-solr/pull/2663 Added more people to CHANGES to include folks who contributed to reviewing this fix. Will updated the CHANGES in main and 9x too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #927: LUCENE-10151: Adding Timeout Support to IndexSearcher
msokolov commented on code in PR #927: URL: https://github.com/apache/lucene/pull/927#discussion_r895817658 ## build.gradle: ## @@ -183,3 +183,5 @@ apply from: file('gradle/hacks/turbocharge-jvm-opts.gradle') apply from: file('gradle/hacks/dummy-outputs.gradle') apply from: file('gradle/pylucene/pylucene.gradle') +sourceCompatibility = JavaVersion.VERSION_17 Review Comment: why did we need to add this? ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -532,6 +536,11 @@ public TopDocs reduce(Collection collectors) throws IOExce return search(query, manager); } + public void setTimeout(boolean isTimeoutEnabled, QueryTimeout queryTimeout) throws IOException { Review Comment: Could we use `queryTimeout==null` or a sentinel value `QueryTime.NONE` to indicate no timeout is enabled? It would save a redundant parameter and member variable. Actually I see QueryTimeout has a timeoutEnabled() method, so could we define NONE to return false and just check that in our branches instead of this separate boolean flag? ## lucene/core/src/java/org/apache/lucene/search/TimeLimitingBulkScorer.java: ## @@ -0,0 +1,74 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +import java.io.IOException; +import org.apache.lucene.index.QueryTimeout; +import org.apache.lucene.util.Bits; + +/** + * The {@link TimeLimitingCollector} is used to timeout search requests that take longer than the + * maximum allowed search time limit. After this time is exceeded, the search thread is stopped by + * throwing a {@link TimeLimitingCollector.TimeExceededException}. + * + * @see org.apache.lucene.index.ExitableDirectoryReader + */ +public class TimeLimitingBulkScorer extends BulkScorer { + + static final int INTERVAL = 100; Review Comment: please add a comment for this constant - what is it used for? Actually we should describe the algorithm here; namely that we score chunks of documents at a time so as to avoid the cost of checking the timeout for every document we score ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -766,18 +778,29 @@ protected void search(List leaves, Weight weight, Collector c } BulkScorer scorer = weight.bulkScorer(ctx); if (scorer != null) { -try { - scorer.score(leafCollector, ctx.reader().getLiveDocs()); -} catch ( -@SuppressWarnings("unused") -CollectionTerminatedException e) { - // collection was terminated prematurely - // continue with the following leaf +if (isTimeoutEnabled) { + TimeLimitingBulkScorer timeLimitingBulkScorer = + new TimeLimitingBulkScorer(scorer, queryTimeout); + try { +timeLimitingBulkScorer.score(leafCollector, ctx.reader().getLiveDocs()); + } catch ( + @SuppressWarnings("unused") + TimeLimitingBulkScorer.TimeExceededException e) { +partialResult = true; Review Comment: I wonder if we should use this as a way to provide some information to the caller, for example how much time elapsed when the timeout occurred? The exception could pass that back? On the other hand, then every QueryTimeout might have to track that, and for some of them (eg counting-based) the time isn't really the most important dimension. ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -555,6 +564,9 @@ public void search(Query query, Collector results) throws IOException { search(leafContexts, createWeight(query, results.scoreMode(), 1), results); } + public boolean isAborted() { Review Comment: How about `timedOut()` ? It will be more symmetric with the methods/variables using timeout in their names. ## lucene/core/src/java/org/apache/lucene/index/ExitableDirectoryReader.java: ## @@ -82,8 +81,8 @@ public PointValues getPointValues(String field) throws IOException { return null; } return (queryTimeout.isTimeoutEnabled()) - ? new ExitablePointValues(pointValues, queryTim
[GitHub] [lucene-solr] anshumg merged pull request #2663: SOLR-16218: Fix bug in in-place update when failOnVersionConflicts=false
anshumg merged PR #2663: URL: https://github.com/apache/lucene-solr/pull/2663 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10611) KnnVectorQuery throwing Heap Error for Restrictive Filters
[ https://issues.apache.org/jira/browse/LUCENE-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553675#comment-17553675 ] Julie Tibshirani commented on LUCENE-10611: --- Thanks for catching this [~kaivalnp]! Your first suggestion makes sense to me. Would you like to open a PR with the fix (plus a test like the one you mentioned?) > KnnVectorQuery throwing Heap Error for Restrictive Filters > -- > > Key: LUCENE-10611 > URL: https://issues.apache.org/jira/browse/LUCENE-10611 > Project: Lucene - Core > Issue Type: Bug >Reporter: Kaival Parikh >Priority: Minor > > The HNSW graph search does not consider that visitedLimit may be reached in > the upper levels of graph search itself > This occurs when the pre-filter is too restrictive (and its count sets the > visitedLimit). So instead of switching over to exactSearch, it tries to [pop > from an empty > heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L90] > and throws an error > > To reproduce this error, we can +increase the numDocs > [here|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L500] > to 20,000+ (so that nodes have more neighbors, and visitedLimit is reached > faster) > > Stacktrace: > {code:java} > The heap is empty > java.lang.IllegalStateException: The heap is empty > at __randomizedtesting.SeedInfo.seed([D7BC2F56048D9D1A:A1F576DD0E795BBF]:0) > at org.apache.lucene.util.LongHeap.pop(LongHeap.java:111) > at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:98) > at > org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:90) > at > org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsReader.search(Lucene92HnswVectorsReader.java:236) > at > org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:272) > at > org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:235) > at > org.apache.lucene.search.KnnVectorQuery.approximateSearch(KnnVectorQuery.java:159) > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
gsmiller commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1154142817 > Anyway, let's benchmark it, but with the analysis above, I also agree we should actually start with the long[] API, and replace it with a byte[] one only if actually performs better. +1 to starting with `long[]` and then benchmarking a `byte[]` version when time permits. > If I understand your change correctly, then it creates a new long[] in each call to matches() right? I see two main problems here Yeah, good callouts. I put this together pretty quickly as a sketched out idea, and didn't think super deeply about it. I was going for an approach that would let users extend the long-based API as the common approach, but allow extending the byte-based API if they really care about performance (but maybe it's not even more performant... TBD!). At this point, I'm convinced we should go with the long-based API for the initial version. Let's get this functionality shipped and then we can benchmark, optimize, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller closed pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
gsmiller closed pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities URL: https://github.com/apache/lucene/pull/841 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
gsmiller commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1154143249 Ah, sorry... I accidentally hit the "close" button! My bad. Reopened. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10527) Use bigger maxConn for last layer in HNSW
[ https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553699#comment-17553699 ] Adrien Grand commented on LUCENE-10527: --- I pushed an annotation to nightly benchmarks for the above performance change. It should show up in the coming days. > Use bigger maxConn for last layer in HNSW > - > > Key: LUCENE-10527 > URL: https://issues.apache.org/jira/browse/LUCENE-10527 > Project: Lucene - Core > Issue Type: Task >Reporter: Julie Tibshirani >Assignee: Mayya Sharipova >Priority: Minor > Fix For: 9.2 > > Attachments: Screen Shot 2022-05-18 at 4.26.14 PM.png, Screen Shot > 2022-05-18 at 4.26.24 PM.png, Screen Shot 2022-05-18 at 4.27.37 PM.png, > image-2022-04-20-14-53-58-484.png > > Time Spent: 4h 40m > Remaining Estimate: 0h > > Recently I was rereading the HNSW paper > ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using > a different maxConn for the upper layers vs. the bottom one (which contains > the full neighborhood graph). Specifically, they suggest using maxConn=M for > upper layers and maxConn=2*M for the bottom. This differs from what we do, > which is to use maxConn=M for all layers. > I tried updating our logic using a hacky patch, and noticed an improvement in > latency for higher recall values (which is consistent with the paper's > observation): > *Results on glove-100-angular* > Parameters: M=32, efConstruction=100 > !image-2022-04-20-14-53-58-484.png|width=400,height=367! > As we'd expect, indexing becomes a bit slower: > {code:java} > Baseline: Indexed 1183514 documents in 733s > Candidate: Indexed 1183514 documents in 948s{code} > When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a > big difference in recall for the same settings of M and efConstruction. (Even > adding graph layers in LUCENE-10054 didn't really affect recall.) With this > change, the recall is now very similar: > *Results on glove-100-angular* > Parameters: M=32, efConstruction=100 > {code:java} > kApproach Recall > QPS > 10 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563 >4410.499 > 50 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798 >1956.280 > 100 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862 >1209.734 > 500 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958 > 341.428 > 800 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974 > 230.396 > 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980 > 188.757 > 10 hnswlib ({'M': 32, 'efConstruction': 100})0.552 > 16745.433 > 50 hnswlib ({'M': 32, 'efConstruction': 100})0.794 >5738.468 > 100 hnswlib ({'M': 32, 'efConstruction': 100})0.860 >3336.386 > 500 hnswlib ({'M': 32, 'efConstruction': 100})0.956 > 832.982 > 800 hnswlib ({'M': 32, 'efConstruction': 100})0.973 > 541.097 > 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979 > 442.163 > {code} > I think it'd be nice update to maxConn so that we faithfully implement the > paper's algorithm. This is probably least surprising for users, and I don't > see a strong reason to take a different approach from the paper? Let me know > what you think! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10266) Move nearest-neighbor search on points to core?
[ https://issues.apache.org/jira/browse/LUCENE-10266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553702#comment-17553702 ] ASF subversion and git services commented on LUCENE-10266: -- Commit fcd98fd3370b36e01f35510214cbd3628b25f0f8 in lucene's branch refs/heads/main from Rushabh Shah [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fcd98fd3370 ] LUCENE-10266 Move nearest-neighbor search on points to core (#897) Co-authored-by: Rushabh Shah > Move nearest-neighbor search on points to core? > --- > > Key: LUCENE-10266 > URL: https://issues.apache.org/jira/browse/LUCENE-10266 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 40m > Remaining Estimate: 0h > > Now that the Points' public API supports running nearest-nearest neighbor > search, should we move it to core via helper methods on {{LatLonPoint}} and > {{XYPoint}}? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #897: LUCENE-10266 Move nearest-neighbor search on points to core
jpountz commented on PR #897: URL: https://github.com/apache/lucene/pull/897#issuecomment-1154147639 Thanks @shahrs87 ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #897: LUCENE-10266 Move nearest-neighbor search on points to core
jpountz merged PR #897: URL: https://github.com/apache/lucene/pull/897 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10266) Move nearest-neighbor search on points to core?
[ https://issues.apache.org/jira/browse/LUCENE-10266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10266. --- Fix Version/s: 10.0 (main) Resolution: Fixed > Move nearest-neighbor search on points to core? > --- > > Key: LUCENE-10266 > URL: https://issues.apache.org/jira/browse/LUCENE-10266 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Fix For: 10.0 (main) > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Now that the Points' public API supports running nearest-nearest neighbor > search, should we move it to core via helper methods on {{LatLonPoint}} and > {{XYPoint}}? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10078) Enable merge-on-refresh by default?
[ https://issues.apache.org/jira/browse/LUCENE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553697#comment-17553697 ] Adrien Grand commented on LUCENE-10078: --- As expected, this slowed down refresh latency a bit. http://people.apache.org/~mikemccand/lucenebench/nrt.html I pushed an annotation that should show up in the coming days. > Enable merge-on-refresh by default? > --- > > Key: LUCENE-10078 > URL: https://issues.apache.org/jira/browse/LUCENE-10078 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Fix For: 9.3 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This is a spinoff from the discussion in LUCENE-10073. > The newish merge-on-refresh ([crazy origin > story|https://blog.mikemccandless.com/2021/03/open-source-collaboration-or-how-we.html]) > feature is a powerful way to reduce searched segment counts, especially > helpful for applications using many indexing threads. Such usage will write > many tiny segments on each refresh, which could quickly be merged up during > the {{refresh}} operation. > We would have to implement a default for {{findFullFlushMerges}} > (LUCENE-10064 is open for this), and then we would need > {{IndexWriterConfig.getMaxFullFlushMergeWaitMillis}} a non-zero value (this > issue). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…
gsmiller commented on PR #954: URL: https://github.com/apache/lucene/pull/954#issuecomment-1154150491 @jpountz +1 to testing this. Good call! Since I only tackled a subset of code accessing `NO_MORE_DOCS`, I think we'll have to wait to clean this up and test though right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…
jpountz commented on PR #954: URL: https://github.com/apache/lucene/pull/954#issuecomment-1154152474 We'll have to wait to clean up indeed, plus there may be lots of users doing old-style iteration so we'll need to deprecate and maybe only clean this up in 10.0 or 11.0. But I'm curious of the sort of speedup that this would yield. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on pull request #897: LUCENE-10266 Move nearest-neighbor search on points to core
shahrs87 commented on PR #897: URL: https://github.com/apache/lucene/pull/897#issuecomment-1154177806 Thank you @jpountz for the review and merge. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10613) Clean up outdated NOTICE.txt information concerning morfologik
Dawid Weiss created LUCENE-10613: Summary: Clean up outdated NOTICE.txt information concerning morfologik Key: LUCENE-10613 URL: https://issues.apache.org/jira/browse/LUCENE-10613 Project: Lucene - Core Issue Type: Improvement Reporter: Dawid Weiss -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10613) Clean up outdated NOTICE.txt information concerning morfologik
[ https://issues.apache.org/jira/browse/LUCENE-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-10613: - Fix Version/s: 9.3 > Clean up outdated NOTICE.txt information concerning morfologik > -- > > Key: LUCENE-10613 > URL: https://issues.apache.org/jira/browse/LUCENE-10613 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Trivial > Fix For: 9.3 > > > It's been pointed out to me that NOTICE.txt contains information about > licensing terms that are outdated with regard to what Lucene uses nowadays. > It's a trivial update. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10613) Clean up outdated NOTICE.txt information concerning morfologik
[ https://issues.apache.org/jira/browse/LUCENE-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-10613: - Description: It's been pointed out to me that NOTICE.txt contains information about licensing terms that are outdated with regard to what Lucene uses nowadays. It's a trivial update. > Clean up outdated NOTICE.txt information concerning morfologik > -- > > Key: LUCENE-10613 > URL: https://issues.apache.org/jira/browse/LUCENE-10613 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Trivial > > It's been pointed out to me that NOTICE.txt contains information about > licensing terms that are outdated with regard to what Lucene uses nowadays. > It's a trivial update. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Deepika0510 commented on a diff in pull request #927: LUCENE-10151: Adding Timeout Support to IndexSearcher
Deepika0510 commented on code in PR #927: URL: https://github.com/apache/lucene/pull/927#discussion_r895985779 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -766,18 +778,29 @@ protected void search(List leaves, Weight weight, Collector c } BulkScorer scorer = weight.bulkScorer(ctx); if (scorer != null) { -try { - scorer.score(leafCollector, ctx.reader().getLiveDocs()); -} catch ( -@SuppressWarnings("unused") -CollectionTerminatedException e) { - // collection was terminated prematurely - // continue with the following leaf +if (isTimeoutEnabled) { + TimeLimitingBulkScorer timeLimitingBulkScorer = + new TimeLimitingBulkScorer(scorer, queryTimeout); + try { +timeLimitingBulkScorer.score(leafCollector, ctx.reader().getLiveDocs()); + } catch ( + @SuppressWarnings("unused") + TimeLimitingBulkScorer.TimeExceededException e) { +partialResult = true; Review Comment: I have used counting QueryTimeout in the test. So, should we consider providing additional information to the user? And if yes then what all should we consider adding to it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10613) Clean up outdated NOTICE.txt information concerning morfologik
[ https://issues.apache.org/jira/browse/LUCENE-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553741#comment-17553741 ] ASF subversion and git services commented on LUCENE-10613: -- Commit 67816a9508a21ec7d43f6dbbc951b28bc3de in lucene's branch refs/heads/branch_9x from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=67816a9508a ] LUCENE-10613: Clean up outdated NOTICE.txt information concerning morfologik > Clean up outdated NOTICE.txt information concerning morfologik > -- > > Key: LUCENE-10613 > URL: https://issues.apache.org/jira/browse/LUCENE-10613 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Trivial > Fix For: 9.3 > > > It's been pointed out to me that NOTICE.txt contains information about > licensing terms that are outdated with regard to what Lucene uses nowadays. > It's a trivial update. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10613) Clean up outdated NOTICE.txt information concerning morfologik
[ https://issues.apache.org/jira/browse/LUCENE-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-10613. -- Assignee: Dawid Weiss Resolution: Fixed > Clean up outdated NOTICE.txt information concerning morfologik > -- > > Key: LUCENE-10613 > URL: https://issues.apache.org/jira/browse/LUCENE-10613 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Trivial > Fix For: 9.3 > > > It's been pointed out to me that NOTICE.txt contains information about > licensing terms that are outdated with regard to what Lucene uses nowadays. > It's a trivial update. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10613) Clean up outdated NOTICE.txt information concerning morfologik
[ https://issues.apache.org/jira/browse/LUCENE-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553740#comment-17553740 ] ASF subversion and git services commented on LUCENE-10613: -- Commit 76d418676e86d03dbedd73f917bfedec1d9b3d8c in lucene's branch refs/heads/main from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=76d418676e8 ] LUCENE-10613: Clean up outdated NOTICE.txt information concerning morfologik > Clean up outdated NOTICE.txt information concerning morfologik > -- > > Key: LUCENE-10613 > URL: https://issues.apache.org/jira/browse/LUCENE-10613 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Trivial > Fix For: 9.3 > > > It's been pointed out to me that NOTICE.txt contains information about > licensing terms that are outdated with regard to what Lucene uses nowadays. > It's a trivial update. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #927: LUCENE-10151: Adding Timeout Support to IndexSearcher
dweiss commented on code in PR #927: URL: https://github.com/apache/lucene/pull/927#discussion_r895987477 ## build.gradle: ## @@ -183,3 +183,5 @@ apply from: file('gradle/hacks/turbocharge-jvm-opts.gradle') apply from: file('gradle/hacks/dummy-outputs.gradle') apply from: file('gradle/pylucene/pylucene.gradle') +sourceCompatibility = JavaVersion.VERSION_17 Review Comment: IntelliJ sometimes adds such things on its own... Please revert this change - it's likely to crash badly with other things concerning source compatibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on a diff in pull request #907: LUCENE-10357 Ghost fields and postings/points
shahrs87 commented on code in PR #907: URL: https://github.com/apache/lucene/pull/907#discussion_r896006974 ## lucene/codecs/src/java/org/apache/lucene/codecs/bloom/BloomFilteringPostingsFormat.java: ## @@ -200,8 +200,8 @@ public Terms terms(String field) throws IOException { return delegateFieldsProducer.terms(field); } else { Terms result = delegateFieldsProducer.terms(field); -if (result == null) { - return null; +if (result == null || result == Terms.EMPTY) { Review Comment: Yes, this test case is failing even with this patch: `TestMemoryIndexAgainstDirectory#testRandomQueries` Reproducible by: `gradlew :lucene:memory:test --tests "org.apache.lucene.index.memory.TestMemoryIndexAgainstDirectory.testRandomQueries" -Ptests.jvms=8 -Ptests.jvmargs=-XX:TieredStopAtLevel=1 -Ptests.seed=B19145C39C34BD03 -Ptests.gui=false -Ptests.file.encoding=UTF-8` The underlying reader it is using is MemoryIndex#MemoryIndexReader [here](https://github.com/apache/lucene/blob/main/lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java#L1405) This is the relevant snippet. ``` if (info == null || info.numTokens <= 0) { return null; } ``` Below is the text I copied from LUCENE-10357 description. > I fear that this could be a source of bugs, as a caller could be tempted to assume that he would get non-null terms on a FieldInfo that has IndexOptions that are not NONE. Should we introduce a contract that FieldsProducer (resp. PointsReader) must return a non-null instance when postings (resp. points) are indexed? I don't know which all places I need to do null check ? From the above description, looks like only in FieldsProducer related classes. From my limited understanding, this doesn't look like FiledsProducer. @jpountz please advise. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection
jtibshirani commented on code in PR #951: URL: https://github.com/apache/lucene/pull/951#discussion_r896012687 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, Query filter) { public Query rewrite(IndexReader reader) throws IOException { TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()]; -BitSetCollector filterCollector = null; +Weight filterWeight = null; if (filter != null) { - filterCollector = new BitSetCollector(reader.leaves().size()); IndexSearcher indexSearcher = new IndexSearcher(reader); BooleanQuery booleanQuery = new BooleanQuery.Builder() .add(filter, BooleanClause.Occur.FILTER) .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER) .build(); - indexSearcher.search(booleanQuery, filterCollector); + Query rewritten = indexSearcher.rewrite(booleanQuery); + filterWeight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1f); } for (LeafReaderContext ctx : reader.leaves()) { - TopDocs results = searchLeaf(ctx, filterCollector); + Bits acceptDocs; + int cost; + if (filterWeight != null) { +Scorer scorer = filterWeight.scorer(ctx); +if (scorer != null) { + DocIdSetIterator iterator = scorer.iterator(); + if (iterator instanceof BitSetIterator) { +acceptDocs = ((BitSetIterator) iterator).getBitSet(); + } else { +acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc()); + } + cost = (int) iterator.cost(); Review Comment: This changes the meaning of `cost` (which is directly used as `visitedLimit`). Before we were using the exact number of matches, whereas now we ask the iterator for a cost estimation. These cost estimates are sometimes very imprecise, and I worry it could make the query performance unpredictable and harder to understand. It wonder if we could convert everything to a `BitSet` and then use the actual cardinality. Hopefully we could do this while still keeping the nice performance improvement? ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -121,35 +140,15 @@ public Query rewrite(IndexReader reader) throws IOException { return createRewrittenQuery(reader, topK); } - private TopDocs searchLeaf(LeafReaderContext ctx, BitSetCollector filterCollector) - throws IOException { - -if (filterCollector == null) { - Bits acceptDocs = ctx.reader().getLiveDocs(); - return approximateSearch(ctx, acceptDocs, Integer.MAX_VALUE); + private TopDocs searchLeaf(LeafReaderContext ctx, Bits acceptDocs, int cost) throws IOException { +TopDocs results = approximateSearch(ctx, acceptDocs, cost); Review Comment: The new logic here drops this check -- could we make sure to keep it? ``` if (filterIterator.cost() <= k) { // If there are <= k possible matches, short-circuit and perform exact search, since HNSW // must always visit at least k documents return exactSearch(ctx, filterIterator); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…
gsmiller commented on PR #954: URL: https://github.com/apache/lucene/pull/954#issuecomment-1154550479 @jpountz: > plus there may be lots of users doing old-style iteration so we'll need to deprecate and maybe only clean this up in 10.0 or 11.0 Right, makes sense. > But I'm curious of the sort of speedup that this would yield Me too. I hacked up a version of this change on another branch ([here](https://github.com/gsmiller/lucene/commit/3768665ae6c173014e8288c46c27afa517c90ede)) that let calling code explicitly ask for a "fast" version of SSDV that doesn't do the ordinal check, and then relied on this new code path for loading SSDV within the faceting module. I didn't observe a significant change with our benchmark tooling, but I wonder how much we actually exercise these multi-value cases within our faceting benchmarks. I think it will be more interesting to test once we've migrated more use-cases to this new SSDV iteration style. You may have had a completely different thought in mind for testing this, so please let me know if I'm missing the mark here. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani opened a new pull request, #956: Make sure KnnVectorQuery applies search boost
jtibshirani opened a new pull request, #956: URL: https://github.com/apache/lucene/pull/956 Before, the rewritten query DocAndScoreQuery ignored the boost. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #956: Make sure KnnVectorQuery applies search boost
jtibshirani commented on PR #956: URL: https://github.com/apache/lucene/pull/956#issuecomment-1154554441 Thank you @mocobeta for leading the change to allow GitHub issues in CHANGES.txt -- this was very convenient to fix compared to before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10614) Properly support getTopChildren in RangeFacetCounts
Greg Miller created LUCENE-10614: Summary: Properly support getTopChildren in RangeFacetCounts Key: LUCENE-10614 URL: https://issues.apache.org/jira/browse/LUCENE-10614 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Affects Versions: 10.0 (main) Reporter: Greg Miller As mentioned in LUCENE-10538, {{RangeFacetCounts}} is not implementing {{getTopChildren}}. Instead of returning "top" ranges, it returns all user-provided ranges in the order the user specified them when instantiating. This is probably more useful functionality, but it would be nice to support {{getTopChildren}} as well. LUCENE-10550 is introducing the concept of {{getAllChildren}}, so once that lands, we can replace the current implementation of {{getTopChildren}} with an actual "top children" implementation and direct users to {{getAllChildren}} if they want to maintain the current behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #914: LUCENE-10550: Add getAllChildren functionality to facets
gsmiller commented on code in PR #914: URL: https://github.com/apache/lucene/pull/914#discussion_r896240300 ## lucene/facet/src/java/org/apache/lucene/facet/LongValueFacetCounts.java: ## @@ -346,6 +346,43 @@ private void increment(long value) { } } + @Override + public FacetResult getAllChildren(String dim, String... path) throws IOException { +if (dim.equals(field) == false) { + throw new IllegalArgumentException( + "invalid dim \"" + dim + "\"; should be \"" + field + "\""); +} +if (path.length != 0) { + throw new IllegalArgumentException("path.length should be 0"); +} Review Comment: minor: There's some common validation logic between this and `getTopChildren` you could factor out into a common helper method if you thought it made sense. ## lucene/facet/src/java/org/apache/lucene/facet/LongValueFacetCounts.java: ## @@ -346,6 +346,43 @@ private void increment(long value) { } } + @Override + public FacetResult getAllChildren(String dim, String... path) throws IOException { +if (dim.equals(field) == false) { + throw new IllegalArgumentException( + "invalid dim \"" + dim + "\"; should be \"" + field + "\""); +} +if (path.length != 0) { + throw new IllegalArgumentException("path.length should be 0"); +} + +List labelValues = new ArrayList<>(); +boolean countsAdded = false; +if (hashCounts.size() != 0) { + for (LongIntCursor c : hashCounts) { +int count = c.value; +if (count != 0) { + if (countsAdded == false && c.key >= counts.length) { +countsAdded = true; +appendCounts(labelValues); + } + labelValues.add(new LabelAndValue(Long.toString(c.key), count)); +} + } +} + +if (countsAdded == false) { + appendCounts(labelValues); +} + +return new FacetResult( +field, +new String[0], +totCount, +labelValues.toArray(new LabelAndValue[0]), +labelValues.size()); Review Comment: It looks like you're trying to maintain value-sort-order, sort of like what `getAllChildrenSortByValue` is doing, but since we don't make any ordering guarantees, I think we can simplify this a little bit. What do you think of doing something like this? ```suggestion List labelValues = new ArrayList<>(); for (int i = 0; i < counts.length; i++) { if (counts[i] != 0) { labelValues.add(new LabelAndValue(Long.toString(i), counts[i])); } } if (hashCounts.size() != 0) { for (LongIntCursor c : hashCounts) { int count = c.value; if (count != 0) { labelValues.add(new LabelAndValue(Long.toString(c.key), c.value)); } } } return new FacetResult( field, new String[0], totCount, labelValues.toArray(new LabelAndValue[0]), labelValues.size()); ``` ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java: ## @@ -101,6 +102,63 @@ public Number getSpecificValue(String dim, String... path) throws IOException { return values[ord]; } + @Override + public FacetResult getAllChildren(String dim, String... path) throws IOException { Review Comment: Please see my feedback on `IntTaxonomyFacets`. It should all apply here as well. Thanks! ## lucene/facet/src/java/org/apache/lucene/facet/sortedset/AbstractSortedSetDocValueFacetCounts.java: ## @@ -72,6 +72,40 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I return createFacetResult(topChildrenForPath, dim, path); } + @Override + public FacetResult getAllChildren(String dim, String... path) throws IOException { +FacetsConfig.DimConfig dimConfig = stateConfig.getDimConfig(dim); + +if (dimConfig.hierarchical) { + int pathOrd = (int) dv.lookupTerm(new BytesRef(FacetsConfig.pathToString(dim, path))); + if (pathOrd < 0) { +// path was never indexed +return null; + } + SortedSetDocValuesReaderState.DimTree dimTree = state.getDimTree(dim); Review Comment: As a general note, I think we usually prefer to _not_ fully qualify internal class names unless necessary (and prefer to import them directly instead). For example, we're already doing `import org.apache.lucene.facet.sortedset.SortedSetDocValuesReaderState.DimTree;`, so you can just say `DimTree` here instead of `SortedSetDocValuesReaderState.DimTree`. I'm betting your IDE is setup with this style, but something to just keep an eye on. ## lucene/facet/src/java/org/apache/lucene/facet/sortedset/AbstractSortedSetDocValueFacetCounts.java: ## @@ -72,6 +72,40 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I return createFacetResult(topChildr
[GitHub] [lucene] mdmarshmallow commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
mdmarshmallow commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1154629173 I agree with Greg, we should not let benchmarking block releasing this. I pushed a commit to remove the `byte[]` matches API -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shaie commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
shaie commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1154701271 I pushed some more cleanups and minor refactoring. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang merged pull request #942: LUCENE-10598: Use count to record docValueCount similar to SortedNumericDocValues did
LuXugang merged PR #942: URL: https://github.com/apache/lucene/pull/942 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10598) SortedSetDocValues#docValueCount() should be always greater than zero
[ https://issues.apache.org/jira/browse/LUCENE-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553895#comment-17553895 ] ASF subversion and git services commented on LUCENE-10598: -- Commit 7504b0a258d3c3209110e6476072b6ca6a2e82ff in lucene's branch refs/heads/main from Lu Xugang [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=7504b0a258d ] LUCENE-10598: Use count to record docValueCount similar to SortedNumericDocValues did (#942) > SortedSetDocValues#docValueCount() should be always greater than zero > - > > Key: LUCENE-10598 > URL: https://issues.apache.org/jira/browse/LUCENE-10598 > Project: Lucene - Core > Issue Type: Bug >Reporter: Lu Xugang >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > This test runs failed. > {code:java} > public void testDocValueCount() throws IOException { > try (Directory d = newDirectory()) { > try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) { > for (int j = 0; j < 1; j++) { > Document doc = new Document(); > doc.add(new SortedSetDocValuesField("field", new BytesRef("a"))); > doc.add(new SortedSetDocValuesField("field", new BytesRef("a"))); > doc.add(new SortedSetDocValuesField("field", new BytesRef("b"))); > w.addDocument(doc); > } > } > try (IndexReader reader = DirectoryReader.open(d)) { > assertEquals(1, reader.leaves().size()); > for (LeafReaderContext leaf : reader.leaves()) { > SortedSetDocValues docValues= > leaf.reader().getSortedSetDocValues("field") ; > for (int doc1 = docValues.nextDoc(); doc1 != > DocIdSetIterator.NO_MORE_DOCS; doc1 = docValues.nextDoc()) { > assert docValues.docValueCount() > 0; > } > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10615) Add license information for SmartChineseAnalyzer to NOTICE.txt
Jan Dornseifer created LUCENE-10615: --- Summary: Add license information for SmartChineseAnalyzer to NOTICE.txt Key: LUCENE-10615 URL: https://issues.apache.org/jira/browse/LUCENE-10615 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Jan Dornseifer The Lucene NOTICE file contains the statement The SmartChineseAnalyzer source code (smartcn) was provided by Xiaoping Gao and copyright 2009 by [www.imdict.net.|http://www.imdict.net./] without providing license information. Can this information be supplemented or is it even outdated? We are using Apache Lucene v8.4.1. We are currently subject to a license audit of our software, where also 3rd party FOSS components are checked for usage. Among other things, this part came to our attention. I would be very grateful for information. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org