[GitHub] [lucene] jpountz opened a new pull request #223: LUCENE-10015: Make VectorSimilarityFunction a codec parameter.
jpountz opened a new pull request #223: URL: https://github.com/apache/lucene/pull/223 VectorSimilarityFunction moves to the `org.apache.lucene.util.hnsw` package. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10015) Remove VectorValues.SimilarityFunction, remove NONE
[ https://issues.apache.org/jira/browse/LUCENE-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385377#comment-17385377 ] Adrien Grand commented on LUCENE-10015: --- Thanks Tomoko. I opened a PR to see how it played out and it looks like making the similarity function a codec parameter works rather nicely. We're losing the ability to validate that all segments have the same similarity function, but it's probably ok as changing the similarity function is a super expert option? > Remove VectorValues.SimilarityFunction, remove NONE > --- > > Key: LUCENE-10015 > URL: https://issues.apache.org/jira/browse/LUCENE-10015 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > This stuff is HNSW-implementation specific. It can be moved to a codec > parameter. > The NONE option should be removed: it just makes the codec more complex. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10015) Remove VectorValues.SimilarityFunction, remove NONE
[ https://issues.apache.org/jira/browse/LUCENE-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385408#comment-17385408 ] Julie Tibshirani commented on LUCENE-10015: --- I think it makes sense to keep the ability to configure the similarity function at the field level. I don't see it as a very 'expert' option -- based on what the vectors represent and how they've been processed, it's necessary to use the right similarity function to obtain good results. Also unlike 'maxConn' and 'beamWidth' (which were specific to our HNSW implementation), it's a concept that makes sense across NN algorithms generally. To the best of my knowledge, many NN algorithms can handle the full set of common similarity functions (Euclidean, dot product, cosine). In case it's helpful context: currently we only support Euclidean and cosine distance, which is technically redundant. For cosine similarity, users could normalize the vectors to unit length and use Euclidean. But I'm assuming we'll add support for inner product too, which seems very popular and cannot be expressed in terms of Euclidean distance. The FAISS library currently supports only Euclidean distance and inner product. > Remove VectorValues.SimilarityFunction, remove NONE > --- > > Key: LUCENE-10015 > URL: https://issues.apache.org/jira/browse/LUCENE-10015 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > This stuff is HNSW-implementation specific. It can be moved to a codec > parameter. > The NONE option should be removed: it just makes the codec more complex. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #222: LUCENE-10032: Remove leafDocMaps from MergeState
jpountz commented on a change in pull request #222: URL: https://github.com/apache/lucene/pull/222#discussion_r674713006 ## File path: lucene/core/src/java/org/apache/lucene/index/MergeState.java ## @@ -227,19 +222,7 @@ public int get(int docID) { } private List maybeSortReaders( Review comment: maybe should rename and make it void now that it only seems to perform validation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10015) Remove VectorValues.SimilarityFunction, remove NONE
[ https://issues.apache.org/jira/browse/LUCENE-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385463#comment-17385463 ] Adrien Grand commented on LUCENE-10015: --- If there's agreement we should keep the similarity function, then let's resolve this issue as the other thing we discussed, the removal of NONE, has already been merged? > Remove VectorValues.SimilarityFunction, remove NONE > --- > > Key: LUCENE-10015 > URL: https://issues.apache.org/jira/browse/LUCENE-10015 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > This stuff is HNSW-implementation specific. It can be moved to a codec > parameter. > The NONE option should be removed: it just makes the codec more complex. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn commented on a change in pull request #222: LUCENE-10032: Remove leafDocMaps from MergeState
dnhatn commented on a change in pull request #222: URL: https://github.com/apache/lucene/pull/222#discussion_r674773365 ## File path: lucene/core/src/java/org/apache/lucene/index/MergeState.java ## @@ -227,19 +222,7 @@ public int get(int docID) { } private List maybeSortReaders( Review comment: ++, I pushed https://github.com/apache/lucene/pull/222/commits/707f4f982fc831f56d51db1288e1ff468025eee0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10015) Remove VectorValues.SimilarityFunction, remove NONE
[ https://issues.apache.org/jira/browse/LUCENE-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385488#comment-17385488 ] Tomoko Uchida commented on LUCENE-10015: I am fine with the current field API, sorry for the disturbance. Besides whether the similarity function should be configurable by users, I agree with it is surely search-algorithm independent feature - the most ann/knn algorithms are distance-measure agnostic in my understanding (maybe Humming distance would be also worth supported). > Remove VectorValues.SimilarityFunction, remove NONE > --- > > Key: LUCENE-10015 > URL: https://issues.apache.org/jira/browse/LUCENE-10015 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > This stuff is HNSW-implementation specific. It can be moved to a codec > parameter. > The NONE option should be removed: it just makes the codec more complex. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10015) Remove VectorValues.SimilarityFunction, remove NONE
[ https://issues.apache.org/jira/browse/LUCENE-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385488#comment-17385488 ] Tomoko Uchida edited comment on LUCENE-10015 at 7/22/21, 1:15 PM: -- I am fine with the current field API, sorry for the disturbance. Besides whether the similarity function should be configurable by users, I agree with it is surely search-algorithm independent feature - most ann/knn algorithms are distance-measure agnostic in my understanding (maybe Humming distance would be also worth supported). was (Author: tomoko uchida): I am fine with the current field API, sorry for the disturbance. Besides whether the similarity function should be configurable by users, I agree with it is surely search-algorithm independent feature - the most ann/knn algorithms are distance-measure agnostic in my understanding (maybe Humming distance would be also worth supported). > Remove VectorValues.SimilarityFunction, remove NONE > --- > > Key: LUCENE-10015 > URL: https://issues.apache.org/jira/browse/LUCENE-10015 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > This stuff is HNSW-implementation specific. It can be moved to a codec > parameter. > The NONE option should be removed: it just makes the codec more complex. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10015) Remove VectorValues.SimilarityFunction, remove NONE
[ https://issues.apache.org/jira/browse/LUCENE-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385546#comment-17385546 ] Michael Sokolov commented on LUCENE-10015: -- Yes, let's keep the similarity function! It's an essential component of a metric space (or non-metric space), and is implicit in the vectors: you can't just use any arbitrary function, as Julie says. It makes my life a little more complicated that we removed NONE since it simplifies schema management at the next level. You can define a schema with fields of type vector, and then separately say whether you want to support KNN search for such a field. We used to support such vectors with BinaryDocValues, and then switched to this impl. Now we will have to switch back. But I guess we are now saying these fields are *only* to be be used for NN search, so it makes sense, and I won't block it. > Remove VectorValues.SimilarityFunction, remove NONE > --- > > Key: LUCENE-10015 > URL: https://issues.apache.org/jira/browse/LUCENE-10015 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > This stuff is HNSW-implementation specific. It can be moved to a codec > parameter. > The NONE option should be removed: it just makes the codec more complex. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a change in pull request #218: LUCENE-9855: Rename knn search vector format
msokolov commented on a change in pull request #218: URL: https://github.com/apache/lucene/pull/218#discussion_r674861123 ## File path: lucene/CHANGES.txt ## @@ -7,7 +7,7 @@ http://s.apache.org/luceneversions New Features -* LUCENE-9322: Vector-valued fields, Lucene90 Codec (Mike Sokolov, Julie Tibshirani, Tomoko Uchida) +* LUCENE-9322 LUCENE-9855: Vector-valued fields, Lucene90 Codec (Mike Sokolov, Julie Tibshirani, Tomoko Uchida) * LUCENE-9004: Approximate nearest vector search via NSW graphs Review comment: I think you and I were mostly working on it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10015) Remove VectorValues.SimilarityFunction, remove NONE
[ https://issues.apache.org/jira/browse/LUCENE-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10015. --- Resolution: Fixed > Remove VectorValues.SimilarityFunction, remove NONE > --- > > Key: LUCENE-10015 > URL: https://issues.apache.org/jira/browse/LUCENE-10015 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > This stuff is HNSW-implementation specific. It can be moved to a codec > parameter. > The NONE option should be removed: it just makes the codec more complex. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10015) Remove VectorValues.SimilarityFunction.NONE
[ https://issues.apache.org/jira/browse/LUCENE-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-10015: -- Summary: Remove VectorValues.SimilarityFunction.NONE (was: Remove VectorValues.SimilarityFunction, remove NONE) > Remove VectorValues.SimilarityFunction.NONE > --- > > Key: LUCENE-10015 > URL: https://issues.apache.org/jira/browse/LUCENE-10015 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > This stuff is HNSW-implementation specific. It can be moved to a codec > parameter. > The NONE option should be removed: it just makes the codec more complex. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz closed pull request #223: LUCENE-10015: Make VectorSimilarityFunction a codec parameter.
jpountz closed pull request #223: URL: https://github.com/apache/lucene/pull/223 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a change in pull request #218: LUCENE-9855: Rename knn search vector format
mocobeta commented on a change in pull request #218: URL: https://github.com/apache/lucene/pull/218#discussion_r674908384 ## File path: lucene/CHANGES.txt ## @@ -7,7 +7,7 @@ http://s.apache.org/luceneversions New Features -* LUCENE-9322: Vector-valued fields, Lucene90 Codec (Mike Sokolov, Julie Tibshirani, Tomoko Uchida) +* LUCENE-9322 LUCENE-9855: Vector-valued fields, Lucene90 Codec (Mike Sokolov, Julie Tibshirani, Tomoko Uchida) * LUCENE-9004: Approximate nearest vector search via NSW graphs Review comment: Thanks, I'll add the credit on this pr. Actually, I cannot identify who is supposed to be listed here since so many people were involved (though I know your name should be firstly credited)... please let me know if there are other outstanding contributors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a change in pull request #218: LUCENE-9855: Rename knn search vector format
mocobeta commented on a change in pull request #218: URL: https://github.com/apache/lucene/pull/218#discussion_r674908384 ## File path: lucene/CHANGES.txt ## @@ -7,7 +7,7 @@ http://s.apache.org/luceneversions New Features -* LUCENE-9322: Vector-valued fields, Lucene90 Codec (Mike Sokolov, Julie Tibshirani, Tomoko Uchida) +* LUCENE-9322 LUCENE-9855: Vector-valued fields, Lucene90 Codec (Mike Sokolov, Julie Tibshirani, Tomoko Uchida) * LUCENE-9004: Approximate nearest vector search via NSW graphs Review comment: Thanks, I'll add the credit on this pr. Actually, I cannot identify who are supposed to be listed here since so many people were involved (though I know your name should be firstly credited)... please let me know if there are other outstanding contributors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn commented on a change in pull request #221: LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields.
dnhatn commented on a change in pull request #221: URL: https://github.com/apache/lucene/pull/221#discussion_r674914345 ## File path: lucene/core/src/java/org/apache/lucene/index/DocIDMerger.java ## @@ -144,59 +159,45 @@ protected boolean lessThan(Sub a, Sub b) { public void reset() throws IOException { // caller may not have fully consumed the queue: queue.clear(); + current = null; boolean first = true; for (T sub : subs) { if (first) { // by setting mappedDocID = -1, this entry is guaranteed to be the top of the queue // so the first call to next() will advance it sub.mappedDocID = -1; + current = sub; first = false; -} else { - int mappedDocID; - while (true) { -int docID = sub.nextDoc(); -if (docID == NO_MORE_DOCS) { - mappedDocID = NO_MORE_DOCS; - break; -} -mappedDocID = sub.docMap.get(docID); -if (mappedDocID != -1) { - break; -} - } - if (mappedDocID == NO_MORE_DOCS) { -// all docs in this sub were deleted; do not add it to the queue! -continue; - } - sub.mappedDocID = mappedDocID; -} -queue.add(sub); +} else if (sub.nextMappedDoc() != NO_MORE_DOCS) { + queue.add(sub); +} // else all docs in this sub were deleted; do not add it to the queue! } + queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID; } @Override public T next() throws IOException { - T top = queue.top(); - - while (true) { -int docID = top.nextDoc(); -if (docID == NO_MORE_DOCS) { - queue.pop(); - top = queue.top(); - break; -} -int mappedDocID = top.docMap.get(docID); -if (mappedDocID == -1) { - // doc was deleted - continue; + int nextDoc = current.nextMappedDoc(); + // the below condition is unlikely when the index sort is on a low-cardinality field + if (nextDoc >= queueNextMappedDoc) { +if (nextDoc == NO_MORE_DOCS) { + if (queue.size() == 0) { +current = null; +return null; + } else { +current = queue.pop(); +queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID; Review comment: I think we can remove `queueNextMappedDoc` and replace it with `queue.top().mappedDocID`? Below is my version without `queueNextMappedDoc`. WDYT? ```java int nextDoc = current.nextMappedDoc(); // the below condition is unlikely when the index sort is on a low-cardinality field if (nextDoc == NO_MORE_DOCS) { if (queue.size() == 0) { return null; } current = queue.pop(); } else if (queue.size() > 0 && nextDoc > queue.top().mappedDocID) { T newCurrent = queue.top(); queue.updateTop(current); current = newCurrent; } return current; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn commented on a change in pull request #221: LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields.
dnhatn commented on a change in pull request #221: URL: https://github.com/apache/lucene/pull/221#discussion_r674914345 ## File path: lucene/core/src/java/org/apache/lucene/index/DocIDMerger.java ## @@ -144,59 +159,45 @@ protected boolean lessThan(Sub a, Sub b) { public void reset() throws IOException { // caller may not have fully consumed the queue: queue.clear(); + current = null; boolean first = true; for (T sub : subs) { if (first) { // by setting mappedDocID = -1, this entry is guaranteed to be the top of the queue // so the first call to next() will advance it sub.mappedDocID = -1; + current = sub; first = false; -} else { - int mappedDocID; - while (true) { -int docID = sub.nextDoc(); -if (docID == NO_MORE_DOCS) { - mappedDocID = NO_MORE_DOCS; - break; -} -mappedDocID = sub.docMap.get(docID); -if (mappedDocID != -1) { - break; -} - } - if (mappedDocID == NO_MORE_DOCS) { -// all docs in this sub were deleted; do not add it to the queue! -continue; - } - sub.mappedDocID = mappedDocID; -} -queue.add(sub); +} else if (sub.nextMappedDoc() != NO_MORE_DOCS) { + queue.add(sub); +} // else all docs in this sub were deleted; do not add it to the queue! } + queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID; } @Override public T next() throws IOException { - T top = queue.top(); - - while (true) { -int docID = top.nextDoc(); -if (docID == NO_MORE_DOCS) { - queue.pop(); - top = queue.top(); - break; -} -int mappedDocID = top.docMap.get(docID); -if (mappedDocID == -1) { - // doc was deleted - continue; + int nextDoc = current.nextMappedDoc(); + // the below condition is unlikely when the index sort is on a low-cardinality field + if (nextDoc >= queueNextMappedDoc) { +if (nextDoc == NO_MORE_DOCS) { + if (queue.size() == 0) { +current = null; +return null; + } else { +current = queue.pop(); +queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID; Review comment: I think we can remove `queueNextMappedDoc` and replace it with `queue.top().mappedDocID`? Below is my version without `queueNextMappedDoc`. WDYT? ```java int nextDoc = current.nextMappedDoc(); // the below condition is unlikely when the index sort is on a low-cardinality field if (nextDoc == NO_MORE_DOCS) { current = queue.pop(); } else if (queue.size() > 0 && nextDoc > queue.top().mappedDocID) { T newCurrent = queue.top(); queue.updateTop(current); current = newCurrent; } return current; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9450) Taxonomy index should use DocValues not StoredFields
[ https://issues.apache.org/jira/browse/LUCENE-9450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385612#comment-17385612 ] Michael McCandless commented on LUCENE-9450: OK, +1 for Scenario 2 (use index created version)! > Taxonomy index should use DocValues not StoredFields > > > Key: LUCENE-9450 > URL: https://issues.apache.org/jira/browse/LUCENE-9450 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 8.5.2 >Reporter: Gautam Worah >Priority: Minor > Labels: performance > Fix For: main (9.0) > > Attachments: LUCENE-9450-localrun.py-v1, wip_taxonomy_patch > > Time Spent: 4h > Remaining Estimate: 0h > > The taxonomy index that maps binning labels to ordinals was created before > Lucene added BinaryDocValues. > I've attached a WIP patch (does not pass tests currently) > Issue suggested by [~mikemccand] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] cpoerschke merged pull request #2514: Fix some psuedo->pseudo typos in Solr Ref Guide.
cpoerschke merged pull request #2514: URL: https://github.com/apache/lucene-solr/pull/2514 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #221: LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields.
jpountz commented on a change in pull request #221: URL: https://github.com/apache/lucene/pull/221#discussion_r674980048 ## File path: lucene/core/src/java/org/apache/lucene/index/DocIDMerger.java ## @@ -144,59 +159,45 @@ protected boolean lessThan(Sub a, Sub b) { public void reset() throws IOException { // caller may not have fully consumed the queue: queue.clear(); + current = null; boolean first = true; for (T sub : subs) { if (first) { // by setting mappedDocID = -1, this entry is guaranteed to be the top of the queue // so the first call to next() will advance it sub.mappedDocID = -1; + current = sub; first = false; -} else { - int mappedDocID; - while (true) { -int docID = sub.nextDoc(); -if (docID == NO_MORE_DOCS) { - mappedDocID = NO_MORE_DOCS; - break; -} -mappedDocID = sub.docMap.get(docID); -if (mappedDocID != -1) { - break; -} - } - if (mappedDocID == NO_MORE_DOCS) { -// all docs in this sub were deleted; do not add it to the queue! -continue; - } - sub.mappedDocID = mappedDocID; -} -queue.add(sub); +} else if (sub.nextMappedDoc() != NO_MORE_DOCS) { + queue.add(sub); +} // else all docs in this sub were deleted; do not add it to the queue! } + queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID; } @Override public T next() throws IOException { - T top = queue.top(); - - while (true) { -int docID = top.nextDoc(); -if (docID == NO_MORE_DOCS) { - queue.pop(); - top = queue.top(); - break; -} -int mappedDocID = top.docMap.get(docID); -if (mappedDocID == -1) { - // doc was deleted - continue; + int nextDoc = current.nextMappedDoc(); + // the below condition is unlikely when the index sort is on a low-cardinality field + if (nextDoc >= queueNextMappedDoc) { +if (nextDoc == NO_MORE_DOCS) { + if (queue.size() == 0) { +current = null; +return null; + } else { +current = queue.pop(); +queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID; Review comment: Thanks. I had to add a check whether the queue is empty under `if (nextDoc == NO_MORE_DOCS) {` for it to work. I ran a quick benchmark with a merge that merges 10M docs that have 10 single-valued SORTED_SET doc-value fields each. Without index sorting, the merge runs in ~740ms. With index sorting on a NumericDocValues field of cardinality 8, I get the following merging times: - master: ~1300ms - the first version of this PR: ~860ms, - your suggestion: ~910ms I think the extra simplicity is worth the minor slowdown, it's still significantly faster than master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn commented on a change in pull request #221: LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields.
dnhatn commented on a change in pull request #221: URL: https://github.com/apache/lucene/pull/221#discussion_r674984362 ## File path: lucene/core/src/java/org/apache/lucene/index/DocIDMerger.java ## @@ -144,59 +159,45 @@ protected boolean lessThan(Sub a, Sub b) { public void reset() throws IOException { // caller may not have fully consumed the queue: queue.clear(); + current = null; boolean first = true; for (T sub : subs) { if (first) { // by setting mappedDocID = -1, this entry is guaranteed to be the top of the queue // so the first call to next() will advance it sub.mappedDocID = -1; + current = sub; first = false; -} else { - int mappedDocID; - while (true) { -int docID = sub.nextDoc(); -if (docID == NO_MORE_DOCS) { - mappedDocID = NO_MORE_DOCS; - break; -} -mappedDocID = sub.docMap.get(docID); -if (mappedDocID != -1) { - break; -} - } - if (mappedDocID == NO_MORE_DOCS) { -// all docs in this sub were deleted; do not add it to the queue! -continue; - } - sub.mappedDocID = mappedDocID; -} -queue.add(sub); +} else if (sub.nextMappedDoc() != NO_MORE_DOCS) { + queue.add(sub); +} // else all docs in this sub were deleted; do not add it to the queue! } + queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID; } @Override public T next() throws IOException { - T top = queue.top(); - - while (true) { -int docID = top.nextDoc(); -if (docID == NO_MORE_DOCS) { - queue.pop(); - top = queue.top(); - break; -} -int mappedDocID = top.docMap.get(docID); -if (mappedDocID == -1) { - // doc was deleted - continue; + int nextDoc = current.nextMappedDoc(); + // the below condition is unlikely when the index sort is on a low-cardinality field + if (nextDoc >= queueNextMappedDoc) { +if (nextDoc == NO_MORE_DOCS) { + if (queue.size() == 0) { +current = null; +return null; + } else { +current = queue.pop(); +queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID; Review comment: @jpountz Thanks for benchmarking this. I am fine with either versions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings
Adrien Grand created LUCENE-10033: - Summary: Encode doc values in smaller blocks of values, like postings Key: LUCENE-10033 URL: https://issues.apache.org/jira/browse/LUCENE-10033 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand This is a follow-up to the discussion on this thread: https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E. Our current approach for doc values uses large blocks of 16k values where values can be decompressed independently, using DirectWriter/DirectReader. This is a bit inefficient in some cases, e.g. a single outlier can grow the number of bits per value for the entire block, we can't easily use run-length compression, etc. Plus, it encourages using a different sub-class for every compression technique, which puts pressure on the JVM. We'd like to move to an approach that would be more similar to postings with smaller blocks (e.g. 128 values) whose values get all decompressed at once (using SIMD instructions), with skip data within blocks in order to efficiently skip to arbitrary doc IDs (or maybe still use jump tables as today's doc values, and as discussed here for postings: https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #217: LUCENE-10030: [DrillSidewaysScorer.doQueryFirstScoring] disable scoring for near-miss hits
gsmiller commented on a change in pull request #217: URL: https://github.com/apache/lucene/pull/217#discussion_r675059243 ## File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java ## @@ -195,11 +195,8 @@ private void doQueryFirstScoring(Bits acceptDocs, LeafCollector collector, DocsA collectDocID = docID; Review comment: I was thinking about this a little more and wanted to see what you thought about a slightly different approach. Full disclosure, I’ve been away from my computer for a day and am on my phone right now, so I haven’t looked into this as deeply as I’d like. It looks to me, when doing “query first scoring”, that the ScoreAndDoc wrapper is essentially useless. The baseScorer will be positioned on the “collectDoc” whenever it collects a full hit or near miss. Can’t we just pass baseScorer into the collectors (via setScorer) when doing queryFirstScoring and avoid using a ScoreAndDoc instance completely? That would also allow the score to be lazily evaluated since the collectors would just be calling into baseScorer directly if they need the score. If I remember correctly, there’s a caching scorer wrapper we could use here as well to ensure the score is only computed once in the case that multiple drill sideways collectors call back for the score. Like I said, I’m on my phone and haven’t been able to try this out at all, so maybe it’s completely flawed for some reason. But maybe you could have a look and see if something along these lines would work? Thanks again for taking this on! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a change in pull request #218: LUCENE-9855: Rename knn search vector format
msokolov commented on a change in pull request #218: URL: https://github.com/apache/lucene/pull/218#discussion_r675162465 ## File path: lucene/CHANGES.txt ## @@ -7,7 +7,7 @@ http://s.apache.org/luceneversions New Features -* LUCENE-9322: Vector-valued fields, Lucene90 Codec (Mike Sokolov, Julie Tibshirani, Tomoko Uchida) +* LUCENE-9322 LUCENE-9855: Vector-valued fields, Lucene90 Codec (Mike Sokolov, Julie Tibshirani, Tomoko Uchida) * LUCENE-9004: Approximate nearest vector search via NSW graphs Review comment: Yeah I looked at the discussion on the issue, and there were lots of folks weighing in there with helpful contributions, but given there were so many, and as far as I can remember only the two of us actually contributed any code, I'd just list us. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #218: LUCENE-9855: Rename knn search vector format
mocobeta commented on pull request #218: URL: https://github.com/apache/lucene/pull/218#issuecomment-885343046 Thanks, @msokolov for your comment. I'll wait for feedback from others (if there are further suggestions) for a while before merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #196: LUCENE-10000: Make MultiCollectorManager consistent with MultiCollector
gsmiller merged pull request #196: URL: https://github.com/apache/lucene/pull/196 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10000) MultiCollectorManager should have parity with MultiCollector behavior
[ https://issues.apache.org/jira/browse/LUCENE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385845#comment-17385845 ] ASF subversion and git services commented on LUCENE-1: -- Commit ad7746d6e31b698dd0f7d124288de576b4f95686 in lucene's branch refs/heads/main from Greg Miller [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ad7746d ] LUCENE-1: Make MultiCollectorManager consistent with MultiCollector (#196) MultiCollectorManager is now consistent with MultiCollector with respect to early termination, min score setting and score caching. > MultiCollectorManager should have parity with MultiCollector behavior > - > > Key: LUCENE-1 > URL: https://issues.apache.org/jira/browse/LUCENE-1 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: main (9.0) >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > The behavior of {{MultiCollectorManager}} and {{MultiCollector}} differ with > respect to how they handle {{CollectionTerminationException}} and > minCompetitiveScore calls. > In {{MultiCollector}}, remaining collectors will continue collecting when one > or more have thrown a {{CollectionTerminationException}}, but > {{MultiCollectorManager}} will allow the exception to propagate and stop > collecting against all collectors as soon as one throws (see: LUCENE-6772). > Also, {{MultiCollector}} properly handles setting of min competitive scores > by ensuring the lowest set across all wrapped collectors is used, or by > making it a no-op in cases where at least one collector is using > {{ScoreMode.COMPLETE}} (see: LUCENE-9402). {{MultiCollectorManager}} will > share the same {{Scorable}} across all collectors, which seems like the wrong > thing to do. > Finally, {{MultiCollector}} will cache scores when a {{Scorable}} is shared > by more than one collector (see: LUCENE-6263). > We should update {{MultiCollectorManager}} to use common behavior with > {{MultiCollector}} in all these areas. It should be easy to do by delegating > to {{MultiCollector}} within {{MultiCollectorManager}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #196: LUCENE-10000: Make MultiCollectorManager consistent with MultiCollector
gsmiller commented on pull request #196: URL: https://github.com/apache/lucene/pull/196#issuecomment-885353122 > Thanks for fixing this! Sure. Thanks for the feedback! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10000) MultiCollectorManager should have parity with MultiCollector behavior
[ https://issues.apache.org/jira/browse/LUCENE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-1: - Fix Version/s: main (9.0) > MultiCollectorManager should have parity with MultiCollector behavior > - > > Key: LUCENE-1 > URL: https://issues.apache.org/jira/browse/LUCENE-1 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: main (9.0) >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > Fix For: main (9.0) > > Time Spent: 50m > Remaining Estimate: 0h > > The behavior of {{MultiCollectorManager}} and {{MultiCollector}} differ with > respect to how they handle {{CollectionTerminationException}} and > minCompetitiveScore calls. > In {{MultiCollector}}, remaining collectors will continue collecting when one > or more have thrown a {{CollectionTerminationException}}, but > {{MultiCollectorManager}} will allow the exception to propagate and stop > collecting against all collectors as soon as one throws (see: LUCENE-6772). > Also, {{MultiCollector}} properly handles setting of min competitive scores > by ensuring the lowest set across all wrapped collectors is used, or by > making it a no-op in cases where at least one collector is using > {{ScoreMode.COMPLETE}} (see: LUCENE-9402). {{MultiCollectorManager}} will > share the same {{Scorable}} across all collectors, which seems like the wrong > thing to do. > Finally, {{MultiCollector}} will cache scores when a {{Scorable}} is shared > by more than one collector (see: LUCENE-6263). > We should update {{MultiCollectorManager}} to use common behavior with > {{MultiCollector}} in all these areas. It should be easy to do by delegating > to {{MultiCollector}} within {{MultiCollectorManager}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10000) MultiCollectorManager should have parity with MultiCollector behavior
[ https://issues.apache.org/jira/browse/LUCENE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-1. -- Resolution: Fixed > MultiCollectorManager should have parity with MultiCollector behavior > - > > Key: LUCENE-1 > URL: https://issues.apache.org/jira/browse/LUCENE-1 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: main (9.0) >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > Fix For: main (9.0) > > Time Spent: 50m > Remaining Estimate: 0h > > The behavior of {{MultiCollectorManager}} and {{MultiCollector}} differ with > respect to how they handle {{CollectionTerminationException}} and > minCompetitiveScore calls. > In {{MultiCollector}}, remaining collectors will continue collecting when one > or more have thrown a {{CollectionTerminationException}}, but > {{MultiCollectorManager}} will allow the exception to propagate and stop > collecting against all collectors as soon as one throws (see: LUCENE-6772). > Also, {{MultiCollector}} properly handles setting of min competitive scores > by ensuring the lowest set across all wrapped collectors is used, or by > making it a no-op in cases where at least one collector is using > {{ScoreMode.COMPLETE}} (see: LUCENE-9402). {{MultiCollectorManager}} will > share the same {{Scorable}} across all collectors, which seems like the wrong > thing to do. > Finally, {{MultiCollector}} will cache scores when a {{Scorable}} is shared > by more than one collector (see: LUCENE-6263). > We should update {{MultiCollectorManager}} to use common behavior with > {{MultiCollector}} in all these areas. It should be easy to do by delegating > to {{MultiCollector}} within {{MultiCollectorManager}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org