[jira] [Created] (LUCENE-10396) Automatically create sparse indexes for sort fields
Adrien Grand created LUCENE-10396: - Summary: Automatically create sparse indexes for sort fields Key: LUCENE-10396 URL: https://issues.apache.org/jira/browse/LUCENE-10396 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand Attachments: sorted_conjunction.png On Elasticsearch we're more and more leveraging index sorting not as a way to be able to early terminate sorted queries, but as a way to cluster doc IDs that share similar properties so that queries can take advantage of it. For instance imagine you're maintaining a catalog of cars for sale, by sorting by car type, then fuel type then price. Then all cars with the same type, fuel type and similar prices will be stored in a contiguous range of doc IDs. Without index sorting, conjunctions across these 3 fields would be almost a worst-case scenario as every clause might match lots of documents while their intersection might not. With index sorting enabled however, there's only a very small number of calls to advance() that would lead to doc IDs that do not match, because these advance() calls that do not lead to a match would always jump over a large number of doc IDs. I had created that example for ApacheCon last year that demonstrates the benefits of index sorting on conjunctions. In both cases, the index is storing the same data, it just gets different doc ID ordering thanks to index sorting: !sorted_conjunction.png! While index sorting can help improve query efficiency out-of-the-box, there is a lot more we can do by taking advantage of the index sort explicitly. For instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range queries on fields that are primary sort fields by performing a binary search to identify the first and last documents that match the range query. I would like to introduce [sparse indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields that are used for index sorting, with the goal of improving the runtime of {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive and making it easier and more efficient to leverage index sorting to filter on subsequent sort fields. A simple form of a sparse index could consist of storing every N-th values of the fields that are used for index sorting. In terms of implementation, sparse indexing should be cheap enough that we wouldn't need to make it configurable and could enable it automatically as soon as index sorting is enabled. And it would get its own file format abstraction. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on pull request #622: LUCENE-10395: Introduce TotalHitCountCollectorManager
javanna commented on pull request #622: URL: https://github.com/apache/lucene/pull/622#issuecomment-1025703099 This should be ready now ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10385) Implement Weight#count on IndexSortSortedNumericDocValuesRangeQuery.
[ https://issues.apache.org/jira/browse/LUCENE-10385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484648#comment-17484648 ] Luca Cavanna commented on LUCENE-10385: --- Leaving a note only to say that I would like to take this issue. I will open a PR soon. > Implement Weight#count on IndexSortSortedNumericDocValuesRangeQuery. > > > Key: LUCENE-10385 > URL: https://issues.apache.org/jira/browse/LUCENE-10385 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > This query can count matches by computing the first and last matching doc IDs > using binary search. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #622: LUCENE-10395: Introduce TotalHitCountCollectorManager
jpountz merged pull request #622: URL: https://github.com/apache/lucene/pull/622 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10395) Add support for TotalHitCountCollectorManager
[ https://issues.apache.org/jira/browse/LUCENE-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484687#comment-17484687 ] ASF subversion and git services commented on LUCENE-10395: -- Commit df12e2b195b576481d8588ae67c4155bbac189fe in lucene's branch refs/heads/main from Luca Cavanna [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=df12e2b ] LUCENE-10395: Introduce TotalHitCountCollectorManager (#622) > Add support for TotalHitCountCollectorManager > - > > Key: LUCENE-10395 > URL: https://issues.apache.org/jira/browse/LUCENE-10395 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Luca Cavanna >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > With LUCENE-10002 we are looking at replacing usages of > IndexSearcher#search(Query, Collector) with the corresponding method that > takes a CollectorManager instead of a Collector. As part of this effort we > would like to add support for a collector manager that encapsulates > TotalHitCountCollector which is commonly used in our tests and possibly by > users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #541: LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil
jpountz commented on a change in pull request #541: URL: https://github.com/apache/lucene/pull/541#discussion_r795686801 ## File path: lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java ## @@ -24,89 +24,73 @@ import org.apache.lucene.util.DocBaseBitSetIterator; import org.apache.lucene.util.FixedBitSet; -class DocIdsWriter { +final class DocIdsWriter { - private DocIdsWriter() {} + private static final byte CONTINUOUS_IDS = (byte) -2; + private static final byte BITSET_IDS = (byte) -1; + private static final byte DELTA_FOR_UTIL = (byte) 32 + 16; + private static final byte BPV_24 = (byte) 32 + 24; + private static final byte BPV_32 = (byte) 32; + // These signs are legacy, should no longer be used in the writing side. + private static final byte LEGACY_DELTA_VINT = (byte) 0; + private static final byte LEGACY_BPV_24 = (byte) 24; - static void writeDocIds(int[] docIds, int start, int count, DataOutput out) throws IOException { + private final BKDForUtil forUtil; + private final int[] scratch; + + DocIdsWriter(int maxPointsInLeaf) { +scratch = new int[maxPointsInLeaf]; +forUtil = new BKDForUtil(maxPointsInLeaf); + } + + void writeDocIds(int[] docIds, int start, int count, DataOutput out) throws IOException { // docs can be sorted either when all docs in a block have the same value // or when a segment is sorted -boolean sorted = true; boolean strictlySorted = true; +int min = docIds[0]; +int max = docIds[0]; for (int i = 1; i < count; ++i) { int last = docIds[start + i - 1]; int current = docIds[start + i]; - if (last > current) { -sorted = strictlySorted = false; -break; - } else if (last == current) { + if (last >= current) { strictlySorted = false; } + min = Math.min(min, current); + max = Math.max(max, current); } -int min2max = docIds[start + count - 1] - docIds[start] + 1; +int min2max = max - min + 1; if (strictlySorted) { if (min2max == count) { // continuous ids, typically happens when segment is sorted -out.writeByte((byte) -2); +out.writeByte(CONTINUOUS_IDS); out.writeVInt(docIds[start]); return; } else if (min2max <= (count << 4)) { assert min2max > count : "min2max: " + min2max + ", count: " + count; // Only trigger bitset optimization when max - min + 1 <= 16 * count in order to avoid // expanding too much storage. // A field with lower cardinality will have higher probability to trigger this optimization. -out.writeByte((byte) -1); +out.writeByte(BITSET_IDS); writeIdsAsBitSet(docIds, start, count, out); return; } } -if (sorted) { - out.writeByte((byte) 0); - int previous = 0; - for (int i = 0; i < count; ++i) { -int doc = docIds[start + i]; -out.writeVInt(doc - previous); -previous = doc; + +if (min2max <= 0xL) { Review comment: ```suggestion if (min2max <= 0x) { ``` ## File path: lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java ## @@ -24,89 +24,73 @@ import org.apache.lucene.util.DocBaseBitSetIterator; import org.apache.lucene.util.FixedBitSet; -class DocIdsWriter { +final class DocIdsWriter { - private DocIdsWriter() {} + private static final byte CONTINUOUS_IDS = (byte) -2; + private static final byte BITSET_IDS = (byte) -1; + private static final byte DELTA_FOR_UTIL = (byte) 32 + 16; + private static final byte BPV_24 = (byte) 32 + 24; + private static final byte BPV_32 = (byte) 32; + // These signs are legacy, should no longer be used in the writing side. + private static final byte LEGACY_DELTA_VINT = (byte) 0; + private static final byte LEGACY_BPV_24 = (byte) 24; - static void writeDocIds(int[] docIds, int start, int count, DataOutput out) throws IOException { + private final BKDForUtil forUtil; + private final int[] scratch; + + DocIdsWriter(int maxPointsInLeaf) { +scratch = new int[maxPointsInLeaf]; +forUtil = new BKDForUtil(maxPointsInLeaf); + } + + void writeDocIds(int[] docIds, int start, int count, DataOutput out) throws IOException { // docs can be sorted either when all docs in a block have the same value // or when a segment is sorted -boolean sorted = true; boolean strictlySorted = true; +int min = docIds[0]; +int max = docIds[0]; for (int i = 1; i < count; ++i) { int last = docIds[start + i - 1]; int current = docIds[start + i]; - if (last > current) { -sorted = strictlySorted = false; -break; - } else if (last == current) { + if (last >= current) { strictlySorted = false; } + min = Math.min(min, current); + max = Math.max(max, current); } -int min2max = docIds[start + count - 1] - docIds[start] + 1
[jira] [Commented] (LUCENE-10395) Add support for TotalHitCountCollectorManager
[ https://issues.apache.org/jira/browse/LUCENE-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484693#comment-17484693 ] ASF subversion and git services commented on LUCENE-10395: -- Commit 0c97682cdc1564a0eba9602de138a7750c561e42 in lucene's branch refs/heads/branch_9x from Luca Cavanna [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0c97682 ] LUCENE-10395: Introduce TotalHitCountCollectorManager (#622) > Add support for TotalHitCountCollectorManager > - > > Key: LUCENE-10395 > URL: https://issues.apache.org/jira/browse/LUCENE-10395 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Luca Cavanna >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > With LUCENE-10002 we are looking at replacing usages of > IndexSearcher#search(Query, Collector) with the corresponding method that > takes a CollectorManager instead of a Collector. As part of this effort we > would like to add support for a collector manager that encapsulates > TotalHitCountCollector which is commonly used in our tests and possibly by > users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10395) Add support for TotalHitCountCollectorManager
[ https://issues.apache.org/jira/browse/LUCENE-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10395. --- Fix Version/s: 9.1 Resolution: Fixed > Add support for TotalHitCountCollectorManager > - > > Key: LUCENE-10395 > URL: https://issues.apache.org/jira/browse/LUCENE-10395 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Luca Cavanna >Priority: Minor > Fix For: 9.1 > > Time Spent: 20m > Remaining Estimate: 0h > > With LUCENE-10002 we are looking at replacing usages of > IndexSearcher#search(Query, Collector) with the corresponding method that > takes a CollectorManager instead of a Collector. As part of this effort we > would like to add support for a collector manager that encapsulates > TotalHitCountCollector which is commonly used in our tests and possibly by > users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10390) Improve Weight#count javadocs
[ https://issues.apache.org/jira/browse/LUCENE-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484703#comment-17484703 ] Luca Cavanna commented on LUCENE-10390: --- [~jpountz] I believe this one can be closed as the corresponding PR was merged. > Improve Weight#count javadocs > - > > Key: LUCENE-10390 > URL: https://issues.apache.org/jira/browse/LUCENE-10390 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > I started working on LUCENE-10385 and chatting with [~jpountz] we said that > maybe the javadocs around Weight#count could be clarified around when it > should be implemented. It currently mentions that the method should only be > overridden by sub-classes that are able to compute/retrieve the count in O(1) > time. The intention is that computing the count should be fast and have a > much smaller overhead compared to collecting all matches. Possibly the > requirement could be relaxed to sub-linear time instead of strictly linear > time? In fact, in LUCENE-10385 we would use binary search to compute the > count which does not execute in linear time but rather O(log N) which though > seems to be acceptable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10390) Improve Weight#count javadocs
[ https://issues.apache.org/jira/browse/LUCENE-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10390. --- Fix Version/s: 9.1 Resolution: Fixed Thanks [~lucacavanna] ! > Improve Weight#count javadocs > - > > Key: LUCENE-10390 > URL: https://issues.apache.org/jira/browse/LUCENE-10390 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Priority: Minor > Fix For: 9.1 > > Time Spent: 20m > Remaining Estimate: 0h > > I started working on LUCENE-10385 and chatting with [~jpountz] we said that > maybe the javadocs around Weight#count could be clarified around when it > should be implemented. It currently mentions that the method should only be > overridden by sub-classes that are able to compute/retrieve the count in O(1) > time. The intention is that computing the count should be fast and have a > much smaller overhead compared to collecting all matches. Possibly the > requirement could be relaxed to sub-linear time instead of strictly linear > time? In fact, in LUCENE-10385 we would use binary search to compute the > count which does not execute in linear time but rather O(log N) which though > seems to be acceptable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova merged pull request #616: LUCENE-9573 Add Vectors to TestBackwardsCompatibility
mayya-sharipova merged pull request #616: URL: https://github.com/apache/lucene/pull/616 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9573) Add back compat tests for VectorFormat to TestBackwardsCompatibility
[ https://issues.apache.org/jira/browse/LUCENE-9573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484711#comment-17484711 ] ASF subversion and git services commented on LUCENE-9573: - Commit 8dfdb261e76db9db3d6e56f2e8bb0aac5b6871c8 in lucene's branch refs/heads/main from Mayya Sharipova [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8dfdb26 ] LUCENE-9573 Add Vectors to TestBackwardsCompatibility (#616) This patch adds KNN vectors for testing backward compatible indices - Add a KnnVectorField to documents when creating a new backward compatible index - Add knn vectors search and check for vector values to the testing of search of backward compatible indices - Add tests for knn vector search when changing backward compatible indices (merging them and adding new documents to them) > Add back compat tests for VectorFormat to TestBackwardsCompatibility > > > Key: LUCENE-9573 > URL: https://issues.apache.org/jira/browse/LUCENE-9573 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Assignee: Mayya Sharipova >Priority: Blocker > Fix For: 9.1 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > In LUCENE-9322 we add a new VectorFormat to the index. This issue is about > adding backwards compatibility tests for it once the index format has > crystallized into its 9.0 form -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID
Adrien Grand created LUCENE-10397: - Summary: KnnVectorQuery doesn't tie break by doc ID Key: LUCENE-10397 URL: https://issues.apache.org/jira/browse/LUCENE-10397 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple documents get the same score then the ones that have the lowest doc ID would get returned first, similarly to how SortField.SCORE also tie-breaks by doc ID. However the following test fails, suggesting that it is not the case. {code:java} public void testTieBreak() throws IOException { try (Directory d = newDirectory()) { try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) { for (int j = 0; j < 5; j++) { Document doc = new Document(); doc.add( new KnnVectorField("field", new float[] {0, 1}, VectorSimilarityFunction.DOT_PRODUCT)); w.addDocument(doc); } } try (IndexReader reader = DirectoryReader.open(d)) { assertEquals(1, reader.leaves().size()); IndexSearcher searcher = new IndexSearcher(reader); KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 3}, 3); TopDocs topHits = searcher.search(query, 3); assertEquals(0, topHits.scoreDocs[0].doc); assertEquals(1, topHits.scoreDocs[1].doc); assertEquals(2, topHits.scoreDocs[2].doc); } } } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID
[ https://issues.apache.org/jira/browse/LUCENE-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-10397: -- Issue Type: Bug (was: Task) > KnnVectorQuery doesn't tie break by doc ID > -- > > Key: LUCENE-10397 > URL: https://issues.apache.org/jira/browse/LUCENE-10397 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > > I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple > documents get the same score then the ones that have the lowest doc ID would > get returned first, similarly to how SortField.SCORE also tie-breaks by doc > ID. > However the following test fails, suggesting that it is not the case. > {code:java} > public void testTieBreak() throws IOException { > try (Directory d = newDirectory()) { > try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) { > for (int j = 0; j < 5; j++) { > Document doc = new Document(); > doc.add( > new KnnVectorField("field", new float[] {0, 1}, > VectorSimilarityFunction.DOT_PRODUCT)); > w.addDocument(doc); > } > } > try (IndexReader reader = DirectoryReader.open(d)) { > assertEquals(1, reader.leaves().size()); > IndexSearcher searcher = new IndexSearcher(reader); > KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, > 3}, 3); > TopDocs topHits = searcher.search(query, 3); > assertEquals(0, topHits.scoreDocs[0].doc); > assertEquals(1, topHits.scoreDocs[1].doc); > assertEquals(2, topHits.scoreDocs[2].doc); > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484759#comment-17484759 ] Joel Bernstein commented on LUCENE-10382: - I'm going to start the brute force implementation inside of KnnVectorQuery soon. My plan is to advance through the VectorValues using the BitsFilter and score the vectors with the KnnVectorField.vectorSimilarityFunction. If there is code available that does this already let me know. > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484759#comment-17484759 ] Joel Bernstein edited comment on LUCENE-10382 at 1/31/22, 4:19 PM: --- I'm going to start the brute force implementation inside of KnnVectorQuery soon. My plan is to advance through the VectorValues using the bitsFilter and score the vectors with the KnnVectorField.vectorSimilarityFunction. If there is code available that does this already let me know. was (Author: joel.bernstein): I'm going to start the brute force implementation inside of KnnVectorQuery soon. My plan is to advance through the VectorValues using the BitsFilter and score the vectors with the KnnVectorField.vectorSimilarityFunction. If there is code available that does this already let me know. > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484759#comment-17484759 ] Joel Bernstein edited comment on LUCENE-10382 at 1/31/22, 4:20 PM: --- I'm going to start the brute force implementation inside of KnnVectorQuery soon. My plan is to advance through the VectorValues using the bitsFilter and score the vectors with the KnnVectorField.vectorSimilarityFunction. If there is code available that does this already let me know. Adding a brute force flag to KnnVectorQuery might be a good idea as well. was (Author: joel.bernstein): I'm going to start the brute force implementation inside of KnnVectorQuery soon. My plan is to advance through the VectorValues using the bitsFilter and score the vectors with the KnnVectorField.vectorSimilarityFunction. If there is code available that does this already let me know. > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
[ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484830#comment-17484830 ] Adrien Grand commented on LUCENE-10396: --- One interesting question about this will be the read API. What I have in mind for now looks something like this: {code:java} class SparseIndexProducer { DocIdRangeIterator getIterator() } class DocIdRange { int first, last; } class DocIdRangeIterator { /** * Return a range of doc IDs that may contain {@code value} as the value of the sortPos-th sort field. * All documents between target included and {@code range.first} excluded are guaranteed to not * contain {@code value}. {@code range.first} equal to NO_MORE_DOCS is used to signal that no more * documents may contain {@code value}. */ DocIdRange nextRangeNumeric(int target, int sortPos, long value); /** * Likewise for SORTED doc values. */ DocIdRange nextRangeSorted(int target, int sortPos, long ord); /** * Likewise for BINARY doc values. */ DocIdRange nextRangeBinary(int target, int sortPos, BytesRef binaryValue); } {code} > Automatically create sparse indexes for sort fields > --- > > Key: LUCENE-10396 > URL: https://issues.apache.org/jira/browse/LUCENE-10396 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Attachments: sorted_conjunction.png > > > On Elasticsearch we're more and more leveraging index sorting not as a way to > be able to early terminate sorted queries, but as a way to cluster doc IDs > that share similar properties so that queries can take advantage of it. For > instance imagine you're maintaining a catalog of cars for sale, by sorting by > car type, then fuel type then price. Then all cars with the same type, fuel > type and similar prices will be stored in a contiguous range of doc IDs. > Without index sorting, conjunctions across these 3 fields would be almost a > worst-case scenario as every clause might match lots of documents while their > intersection might not. With index sorting enabled however, there's only a > very small number of calls to advance() that would lead to doc IDs that do > not match, because these advance() calls that do not lead to a match would > always jump over a large number of doc IDs. I had created that example for > ApacheCon last year that demonstrates the benefits of index sorting on > conjunctions. In both cases, the index is storing the same data, it just gets > different doc ID ordering thanks to index sorting: > !sorted_conjunction.png! > While index sorting can help improve query efficiency out-of-the-box, there > is a lot more we can do by taking advantage of the index sort explicitly. For > instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range > queries on fields that are primary sort fields by performing a binary search > to identify the first and last documents that match the range query. I would > like to introduce [sparse > indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields > that are used for index sorting, with the goal of improving the runtime of > {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive > and making it easier and more efficient to leverage index sorting to filter > on subsequent sort fields. A simple form of a sparse index could consist of > storing every N-th values of the fields that are used for index sorting. > In terms of implementation, sparse indexing should be cheap enough that we > wouldn't need to make it configurable and could enable it automatically as > soon as index sorting is enabled. And it would get its own file format > abstraction. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #632: LUCENE-10050 Remove DrillSideways#search(DrillDownQuery,Collector) in favor of DrillSideways#search(DrillDownQuery,CollectorManager)
gsmiller commented on pull request #632: URL: https://github.com/apache/lucene/pull/632#issuecomment-1026319472 Thanks for tackling this @gautamworah96! I'm hoping to dig into this PR tomorrow. Just letting you know I saw it and am planning to have a look. Thanks again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10398) Add static method for getting Terms from LeafReader
Marc D'Mello created LUCENE-10398: - Summary: Add static method for getting Terms from LeafReader Key: LUCENE-10398 URL: https://issues.apache.org/jira/browse/LUCENE-10398 Project: Lucene - Core Issue Type: Improvement Reporter: Marc D'Mello Hi all, {{LeafReader}} has methods like {{getBinaryDocValues(String field)}} that return {{null}} values if the field is not indexed. These methods also have equivalent {{DocValues}} static methods, such as {{DocValues.getBinary()}}, which return an {{emptyBinary()}} rather than a {{null}} if there is no field. I noticed that {{Terms}} does not have an equivalent static method for {{LeafReader.terms()}} like {{Terms.terms()}} or something similar. I was wondering if there was a reason for this, or if a method like this could be useful. Thanks! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #418: LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery
zacharymorn commented on pull request #418: URL: https://github.com/apache/lucene/pull/418#issuecomment-1026512709 Hi @jpountz , just want to check again to see if I have addressed your concern with the utility approach? To better illustrate the potential issue I'm considering with the higher level abstraction approach, I've copied over some key code snippets here: If we were to extract out `SynonymImpactsSource` (from `SynonymQuery`), it may look something like this: ``` public class SynonymImpactsSource implements ImpactsSource { private final ImpactsEnum[] impactsEnums; private final Impacts[] impacts; private final float[] boosts; private Impacts lead; public SynonymImpactsSource(ImpactsEnum[] impactsEnums, float[] boosts) { ... } @Override public Impacts getImpacts() throws IOException { // Use the impacts that have the lower next boundary as a lead. // It will decide on the number of levels and the block boundaries. if (lead == null) { Impacts tmpLead = null; for (int i = 0; i < impactsEnums.length; ++i) { impacts[i] = impactsEnums[i].getImpacts(); if (tmpLead == null || impacts[i].getDocIdUpTo(0) < tmpLead.getDocIdUpTo(0)) { tmpLead = impacts[i]; } } lead = tmpLead; } return new Impacts() { @Override public int numLevels() { // Delegate to the lead return lead.numLevels(); } @Override public int getDocIdUpTo(int level) { // Delegate to the lead return lead.getDocIdUpTo(level); } @Override public List getImpacts(int level) { final int docIdUpTo = getDocIdUpTo(level); return ImpactsMergingUtils.mergeImpactsPerField( impactsEnums, impacts, boosts, docIdUpTo, false); } }; } @Override public void advanceShallow(int target) throws IOException { for (ImpactsEnum impactsEnum : impactsEnums) { if (impactsEnum.docID() < target) { impactsEnum.advanceShallow(target); } } } public Impacts[] impacts() { return impacts; } } ``` However, the above `SynonymImpactsSource#getImpacts(level)` may not be directly usable for `CombinedFieldsQuery`, as illustrated below ``` Impacts CombinedFieldsQuery$ImpactsSource#getImpacts { for (entry : Map) { // this is potentially problematic, as this methods only takes in level info, and it internally looks up a field-specific docIdUpTo to get valid impacts (see definition above). However, for CombinedFieldQuery, docIdUpTo may be different among different fields with same level combine SynonymImpactsSource.getImpacts(level) } return combined Impacts } ``` Hence we may actually need to change some APIs if we were to make `SynonymImpactsSource` to work in `CombinedFieldQuery` use case ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10396) Automatically create sparse indexes for sort fields
[ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484830#comment-17484830 ] Adrien Grand edited comment on LUCENE-10396 at 2/1/22, 7:31 AM: One interesting question about this will be the read API. What I have in mind for now looks something like this: {code:java} class SparseIndexProducer { /** * Get an iterator over ranges of doc IDs that may contain values that fall in the given range of values. */ DocIdRangeIterator getNumeric(FieldInfo field, long rangeMin, long rangeMax); DocIdRangeIterator getSorted(FieldInfo field, long rangeMinOrd, long rangeMaxOrd); DocIdRangeIterator getBinary(FieldInfo field, BytesRef rangeMin, BytesRef rangeMax); } class DocIdRange { int first, last; } class DocIdRangeIterator { /** * Return the next range of doc IDs so that range.min is greater than or equal to the target. */ DocIdRange advance(int target); } {code} was (Author: jpountz): One interesting question about this will be the read API. What I have in mind for now looks something like this: {code:java} class SparseIndexProducer { DocIdRangeIterator getIterator() } class DocIdRange { int first, last; } class DocIdRangeIterator { /** * Return a range of doc IDs that may contain {@code value} as the value of the sortPos-th sort field. * All documents between target included and {@code range.first} excluded are guaranteed to not * contain {@code value}. {@code range.first} equal to NO_MORE_DOCS is used to signal that no more * documents may contain {@code value}. */ DocIdRange nextRangeNumeric(int target, int sortPos, long value); /** * Likewise for SORTED doc values. */ DocIdRange nextRangeSorted(int target, int sortPos, long ord); /** * Likewise for BINARY doc values. */ DocIdRange nextRangeBinary(int target, int sortPos, BytesRef binaryValue); } {code} > Automatically create sparse indexes for sort fields > --- > > Key: LUCENE-10396 > URL: https://issues.apache.org/jira/browse/LUCENE-10396 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Attachments: sorted_conjunction.png > > > On Elasticsearch we're more and more leveraging index sorting not as a way to > be able to early terminate sorted queries, but as a way to cluster doc IDs > that share similar properties so that queries can take advantage of it. For > instance imagine you're maintaining a catalog of cars for sale, by sorting by > car type, then fuel type then price. Then all cars with the same type, fuel > type and similar prices will be stored in a contiguous range of doc IDs. > Without index sorting, conjunctions across these 3 fields would be almost a > worst-case scenario as every clause might match lots of documents while their > intersection might not. With index sorting enabled however, there's only a > very small number of calls to advance() that would lead to doc IDs that do > not match, because these advance() calls that do not lead to a match would > always jump over a large number of doc IDs. I had created that example for > ApacheCon last year that demonstrates the benefits of index sorting on > conjunctions. In both cases, the index is storing the same data, it just gets > different doc ID ordering thanks to index sorting: > !sorted_conjunction.png! > While index sorting can help improve query efficiency out-of-the-box, there > is a lot more we can do by taking advantage of the index sort explicitly. For > instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range > queries on fields that are primary sort fields by performing a binary search > to identify the first and last documents that match the range query. I would > like to introduce [sparse > indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields > that are used for index sorting, with the goal of improving the runtime of > {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive > and making it easier and more efficient to leverage index sorting to filter > on subsequent sort fields. A simple form of a sparse index could consist of > storing every N-th values of the fields that are used for index sorting. > In terms of implementation, sparse indexing should be cheap enough that we > wouldn't need to make it configurable and could enable it automatically as > soon as index sorting is enabled. And it would get its own file format > abstraction. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...
[GitHub] [lucene] vigyasharma opened a new pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
vigyasharma opened a new pull request #633: URL: https://github.com/apache/lucene/pull/633 # Description > _Work in Progress PR to share approach and get early feedback on changes_ The addIndexes(CodecReader...) API today merges all provided readers into a single merge, in one large blocking call. We want to add concurrency here by invoking multiple parallel merges on subsets of readers, in a way that is configurable by users. The merged segments created, can later be merged further in the regular, non-blocking, background merges triggered by Lucene. Currently, users are responsible for managing their API run times, by invoking it multiple times with subsets of readers. JIRA - https://issues.apache.org/jira/browse/LUCENE-10216 # Solution In this change, we leverage the configured MergeScheduler to invoke all merges required by the addIndexes API. MergePolicy also exposes a `findMerges(List readers)` API that users can override to configure how their readers should be clustered into merges. We add a default implementation that retains current behavior of adding all readers into a single merge. # Tests - Existing tests passing - Pending: New tests to be added # Checklist Please review the following and check all that apply: - [ ] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [ ] I have created a Jira issue and added the issue ID to my pull request title. - [ ] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [ ] I have developed this patch against the `main` branch. - [ ] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API
[ https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485083#comment-17485083 ] Vigya Sharma commented on LUCENE-10216: --- I took a shot at the MergePolicy + MergeScheduler idea. Have raised a draft, work-in-progress PR - https://github.com/apache/lucene/pull/633, to get some high level feedback on the approach. It has existing tests passing, and is pending new tests specific to this change. > Add concurrency to addIndexes(CodecReader…) API > --- > > Key: LUCENE-10216 > URL: https://issues.apache.org/jira/browse/LUCENE-10216 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Vigya Sharma >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > I work at Amazon Product Search, and we use Lucene to power search for the > e-commerce platform. I’m working on a project that involves applying > metadata+ETL transforms and indexing documents on n different _indexing_ > boxes, combining them into a single index on a separate _reducer_ box, and > making it available for queries on m different _search_ boxes (replicas). > Segments are asynchronously copied from indexers to reducers to searchers as > they become available for the next layer to consume. > I am using the addIndexes API to combine multiple indexes into one on the > reducer boxes. Since we also have taxonomy data, we need to remap facet field > ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version > of this API. The API leverages {{SegmentMerger.merge()}} to create segments > with new ordinal values while also merging all provided segments in the > process. > _This is however a blocking call that runs in a single thread._ Until we have > written segments with new ordinal values, we cannot copy them to searcher > boxes, which increases the time to make documents available for search. > I was playing around with the API by creating multiple concurrent merges, > each with only a single reader, creating a concurrently running 1:1 > conversion from old segments to new ones (with new ordinal values). We follow > this up with non-blocking background merges. This lets us copy the segments > to searchers and replicas as soon as they are available, and later replace > them with merged segments as background jobs complete. On the Amazon dataset > I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. > Each call was given about 5 readers to add on average. > This might be useful add to Lucene. We could create another {{addIndexes()}} > API with a {{boolean}} flag for concurrency, that internally submits multiple > merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, > and waits for them to complete before returning. > While this is doable from outside Lucene by using your thread pool, starting > multiple addIndexes() calls and waiting for them to complete, I felt it needs > some understanding of what addIndexes does, why you need to wait on the merge > and why it makes sense to pass a single reader in the addIndexes API. > Out of box support in Lucene could simplify this for folks a similar use case. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10054) Handle hierarchy in HNSW graph
[ https://issues.apache.org/jira/browse/LUCENE-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485084#comment-17485084 ] Adrien Grand commented on LUCENE-10054: --- It looks like this change made both indexing (+5%) and searching (+7%) vectors slightly faster. http://people.apache.org/~mikemccand/lucenebench/VectorSearch.html > Handle hierarchy in HNSW graph > -- > > Key: LUCENE-10054 > URL: https://issues.apache.org/jira/browse/LUCENE-10054 > Project: Lucene - Core > Issue Type: Task >Reporter: Mayya Sharipova >Priority: Major > Time Spent: 20h > Remaining Estimate: 0h > > Currently HNSW graph is represented as a single layer graph. > We would like to extend it to handle hierarchy as per > [discussion|https://issues.apache.org/jira/browse/LUCENE-9004?focusedCommentId=17393216&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17393216]. > > > TODO tasks: > - add multiple layers in the HnswGraph class > - modify the format in Lucene90HnswVectorsWriter and > Lucene90HnswVectorsReader to handle multiple layers > - modify graph construction and search algorithm to handle hierarchy > - run benchmarks -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org