[GitHub] [lucene] vsop-479 commented on pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.
vsop-479 commented on PR #12528: URL: https://github.com/apache/lucene/pull/12528#issuecomment-1733041087 @iverase Does it make sense to you if MatchState defined in other class, such as BKDReader or IntersectVisitor, and only leave the sorted dimension in IntersectVisitor' visit method as a parameter? I just implemented visitWithState in PointRangeQuery, and there is some speed up. I think it is same to other queries, such as LatLonPointDistanceQuery, which use a similar IntersectVisitor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sylph-eu commented on issue #11507: Increase the number of dims for KNN vectors to 2048 [LUCENE-10471]
sylph-eu commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-1733118806 Last comment is already a couple of months old, so please let me clarify the status of this initiative. If there's a chance it's going to be merged? If there's any blocker or action item that prevents one from being merged? The context of my inquiry is that Lucene-based solutions (e.g. OpenSearch) are commonly deployed within enterprises, which makes them good candidates to experiment with vector search and commercial LLM-offerings, without deploying and maintaining specialized technologies. Max dimensionality of 1024, however, puts certain restrictions (similar thoughts are here https://arxiv.org/abs/2308.14963). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 opened a new pull request, #12586: Remove over-counting of deleted terms
gf2121 opened a new pull request, #12586: URL: https://github.com/apache/lucene/pull/12586 `BufferedUpdates` used to count deleted terms without deduplication to respect `IndexWriterConfig.setMaxBufferedDeleteTerms`. As `IndexWriterConfig.setMaxBufferedDeleteTerms` is removed since [LUCENE-7868](https://issues.apache.org/jira/browse/LUCENE-7868), the overcounting can be avoided now. Previous talk: https://github.com/apache/lucene/pull/12573/files#r1332924157 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 commented on a diff in pull request #12586: Remove over-counting of deleted terms
gf2121 commented on code in PR #12586: URL: https://github.com/apache/lucene/pull/12586#discussion_r1335560326 ## lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java: ## @@ -284,6 +276,13 @@ void forEachOrdered(DeletedTermConsumer consumer) throw public long ramBytesUsed() { return bytesUsed.get(); } + +@Override +public String toString() { Review Comment: `DeletedTerms#toString` could be called when `VERBOSE_DELETES` enabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #11507: Increase the number of dims for KNN vectors to 2048 [LUCENE-10471]
uschindler commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-1733223795 Hi, actually this issue is already resolved, although the DEFAULT did not change (and won't change due to performance risks), see here: https://github.com/apache/lucene/pull/12436 - this PR allows users of Lucene to raise the limit (at least for HNSW codec) on codec level. To implement (on your own risk), create your own ` KnnVectorsFormat` and let it return a different number fog `getMaxDimensions()`. Then construct your own codec from it and index your data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #11507: Increase the number of dims for KNN vectors to 2048 [LUCENE-10471]
uschindler commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-1733230392 @mayya-sharipova: Should we close this issue or are there any plans to also change the default maximum? I don't think so. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase merged pull request #12581: Allow reading / writing binary stored fields as DataInput
iverase merged PR #12581: URL: https://github.com/apache/lucene/pull/12581 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase closed issue #12556: Allow reading binary stored values as DataInput
iverase closed issue #12556: Allow reading binary stored values as DataInput URL: https://github.com/apache/lucene/issues/12556 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.
iverase commented on PR #12528: URL: https://github.com/apache/lucene/pull/12528#issuecomment-1733441807 My recommendation is to use the following method as you are just trying to flag if the visit method needs to keep processing points. ``` /** Similar to {@link IntersectVisitor#visit(int, byte[])} but data is visited in * increasing order on the {@sortedDim}, and in the case of ties, in increasing docID order. * Implementers can stop processing points on the leaf by returning false when for example the * sorted dimension value is too high to be matched by the query. * * @return true if the visitor should continue visiting points on this leaf, otherwise false. * */ default boolean visitWithSortedDim(int docID, byte[] packedValue, int sortedDim) throws IOException { visit(docID, packedValue); return true; } ``` I would remove the "inverse" case as the inverse visitors are implementation details and IMHO should not be part of the API. > just implemented visitWithState in PointRangeQuery, and there is some speed up We should run the performance test in the [lucene benchmarks](lucene benchmarks) to check if there are slowdowns due to the extra check for each visited point. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 opened a new pull request, #12587: Use radix sort to speed up the sorting of terms in TermInSetQuery
gf2121 opened a new pull request, #12587: URL: https://github.com/apache/lucene/pull/12587 ### Description Sort terms in TermInSetQuery with radix sort. This helps TermInSetQueries with a number of terms. ### Benchmark I made a simple benchmark on sorting `BytesRef[]` with random bytes to verify the improvements. https://bytedance.feishu.cn/sheets/G5dwsdvZ7hOxXftyfDkcvUkYnqB"; data-importRangeRawData-range="'Sheet1'!A2:D12"> | timsort ( took nanos ) | radixsort ( took nanos ) | took diff -- | -- | -- | -- 10 terms (16 bytes per term) | 1292 | 1083 | -16.18% 100 terms (16 bytes per term) | 17959 | 11750 | -34.57% 1000 terms (16 bytes per term) | 387916 | 50375 | -87.01% 1 terms (16 bytes per term) | 5407208 | 1062500 | -80.35% 10 terms (16 bytes per term) | 65577084 | 5404958 | -91.76% 10 terms (256 bytes per term) | 3500 | 1750 | -50.00% 100 terms (256 bytes per term) | 18000 | 11708 | -34.96% 1000 terms (256 bytes per term) | 410959 | 52417 | -87.25% 1 terms (256 bytes per term) | 5325666 | 1299125 | -75.61% 10 terms (256 bytes per term) | 71316500 | 11346584 | -84.09% -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12382: Run top-level conjunctions of term queries with a specialized BulkScorer.
jpountz commented on PR #12382: URL: https://github.com/apache/lucene/pull/12382#issuecomment-1733490475 FWIW I ran the benchmark from https://tantivy-search.github.io/bench/ and also observed a speedup on conjunctions, so I think that the speedup is indeed real. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #12382: Run top-level conjunctions of term queries with a specialized BulkScorer.
jpountz merged PR #12382: URL: https://github.com/apache/lucene/pull/12382 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna opened a new pull request, #12588: Shared executor for LuceneTestCase#newSearcher callers
javanna opened a new pull request, #12588: URL: https://github.com/apache/lucene/pull/12588 Until now, LuceneTestCase#newSearcher randomly associates the returned IndexSearcher instance with an executor that is ad-hoc created, which gets shut down when the index reader is closed. This has made us catch a couple of cases where we were not properly closing readers in tests. Most recently, we have been seeing test failures (OOM - unable to create thread) due to too many executor instances created as part of the same test. This is to be attributed to creating too many searcher instance, each one getting its separate executor, which all get shutdown at the end of the entire suite. The main offender for this is QueryUtils which creates a new searcher for each leaf reader, and the top-level reader gets closed in the AfterClass, hence all the executors will stay around for the entire duration of the test suite that relies on QueryUtils. This commit eagerly creates an executor in an additional before class method for LuceneTestCase, and associates that with each searcher that is supposed to get a non null executor. Note that the executor is shutdown in the after class to ensure that no threads leak in tests. This has the additional advantage that it removes the need to close the executor as part of an index reader close listener, which also requires the reader to have an associated reader cache helper. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12588: Shared executor for LuceneTestCase#newSearcher callers
rmuir commented on PR #12588: URL: https://github.com/apache/lucene/pull/12588#issuecomment-1733714024 this is nice, much cleaner. I think i added the original cache-helper-hack. Just one question: can we reduce the number of threads used? I have 2 cores. shouldn't 2 be enough to occasionally find bugs rather than 8? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on pull request #12588: Shared executor for LuceneTestCase#newSearcher callers
javanna commented on PR #12588: URL: https://github.com/apache/lucene/pull/12588#issuecomment-1733731499 Thanks @rmuir for looking, I lowered the number of threads -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #12589: Sometimes intersect the essential clause and the best non-essential clause.
jpountz opened a new pull request, #12589: URL: https://github.com/apache/lucene/pull/12589 The idea behind MAXSCORE is to run disjunctions as `+(essentialClause1 ... essentialClauseM) nonEssentialClause1 ... nonEssentialClauseN`, moving more and more clauses from the essential list to the non-essential list as the minimum competitive score increases. For instance, a query such as `the book of life` which I found in the Tantivy benchmark ends up running as `+book the of life` after some time, ie. with one required clause and other clauses optional. This is because matching `the`, `of` and `life` alone is not good enough for yielding a match. Here some statistics in that case: - min competitive score: 3.4781857 - max_window_score(book): 2.8796153 - max_window_score(life): 2.037863 - max_window_score(the): 0.103848875 - max_window_score(of): 0.19427927 Actually if you look at these statistics, we could do better, because a match may only be competitive if it matches both `book` and `life`, so this query could actually execute as `+book +life the of`, which may help evaluate fewer documents compared to `+book the of life`. Especially if you enable recursive graph bisection. This is what this PR tries to achieve: in the event when there is a single essential clause and matching all clauses but the best non-essential clause cannot produce a competitive match, then the scorer will only evaluate documents that match the intersection of the essential clause and the best non-essential clause. It's worth noting that this optimization would kick in very frequently on 2-clauses disjunctions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12589: Sometimes intersect the essential clause and the best non-essential clause.
jpountz commented on PR #12589: URL: https://github.com/apache/lucene/pull/12589#issuecomment-1733875246 Opening as a draft as I still need to figure out how to test this optimization. I tested on wikibigall where this yielded a good speedup. I would expect an even better speedup if recursive graph bisection is enabled. ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value CountOrHighHigh 59.20 (15.6%) 57.96 (14.8%) -2.1% ( -28% - 33%) 0.671 CountOrHighMed 91.87 (15.6%) 90.04 (14.7%) -2.0% ( -27% - 33%) 0.684 MedTerm 451.03 (9.1%) 446.68 (8.1%) -1.0% ( -16% - 17%) 0.730 LowTerm 751.75 (8.3%) 744.56 (7.1%) -1.0% ( -15% - 15%) 0.703 MedPhrase 40.42 (7.8%) 40.05 (7.9%) -0.9% ( -15% - 16%) 0.717 HighTerm 339.69 (10.6%) 337.52 (9.5%) -0.6% ( -18% - 21%) 0.844 HighPhrase 35.00 (6.4%) 34.82 (6.7%) -0.5% ( -12% - 13%) 0.809 AndHighHigh 63.82 (4.2%) 63.55 (2.9%) -0.4% ( -7% -7%) 0.721 AndHighMed 182.62 (3.6%) 181.94 (2.4%) -0.4% ( -6% -5%) 0.705 LowPhrase 41.82 (4.8%) 41.69 (5.6%) -0.3% ( -10% - 10%) 0.848 HighTermDayOfYearSort 234.81 (1.5%) 234.33 (1.8%) -0.2% ( -3% -3%) 0.697 AndHighLow 823.51 (3.2%) 821.97 (2.6%) -0.2% ( -5% -5%) 0.846 OrHighLow 630.34 (2.9%) 630.52 (3.9%)0.0% ( -6% -6%) 0.979 Respell 66.16 (1.4%) 66.19 (1.7%)0.0% ( -2% -3%) 0.938 Fuzzy1 128.80 (1.4%) 128.97 (1.5%)0.1% ( -2% -3%) 0.777 CountAndHighMed 123.28 (2.8%) 123.60 (3.6%)0.3% ( -5% -6%) 0.803 Fuzzy2 114.67 (1.3%) 114.97 (1.5%)0.3% ( -2% -3%) 0.567 CountPhrase3.38 (9.0%)3.39 (9.3%)0.3% ( -16% - 20%) 0.918 CountAndHighHigh 40.81 (2.8%) 40.96 (3.8%)0.4% ( -6% -7%) 0.742 PKLookup 224.07 (2.3%) 225.12 (2.7%)0.5% ( -4% -5%) 0.566 Wildcard 144.23 (2.4%) 145.22 (3.1%)0.7% ( -4% -6%) 0.451 HighTermMonthSort 5090.41 (2.8%) 5142.25 (4.2%)1.0% ( -5% -8%) 0.384 Prefix3 241.24 (4.3%) 244.33 (4.4%)1.3% ( -7% - 10%) 0.368 CountTerm17070.19 (5.1%)17289.73 (4.9%)1.3% ( -8% - 11%) 0.429 IntNRQ 88.47 (11.5%) 89.76 (13.8%)1.5% ( -21% - 30%) 0.725 OrHighMed 191.58 (3.5%) 206.42 (3.7%)7.7% ( 0% - 15%) 0.000 OrHighHigh 60.88 (4.4%) 68.91 (4.4%) 13.2% ( 4% - 23%) 0.000 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna merged pull request #12588: Shared executor for LuceneTestCase#newSearcher callers
javanna merged PR #12588: URL: https://github.com/apache/lucene/pull/12588 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kaivalnp opened a new pull request, #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results
kaivalnp opened a new pull request, #12590: URL: https://github.com/apache/lucene/pull/12590 ### Context Vector search is performed in [`AbstractKnnVectorQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java), where individual HNSW searches are delegated to sub-classes via [`#approximateSearch`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L174). This is useful to implement custom functionality (like say pro-rating `k` across segments for performance, or requesting some additional results over `k` for higher recall with Exact KNN) While the class in itself is `package-private`, we extend the corresponding sub-classes for [byte](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnByteVectorQuery.java) and [float](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java) vectors (which are `public`) for implementing any custom functionality like above ### Issue After searching across all segments, we retain the index-level `topk` results [here](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L88), and immediately call [`#createRewrittenQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L219) to rewrite them as a [`DocAndScoreQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L297) So the implementing classes do not have access to the final `topK` results anywhere, which may be useful to read / modify like a post-process step (for example metric emission, or counting / only keeping results above a threshold) ### Proposal Can we make [`#createRewrittenQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L219) `protected` to allow sub-classes to override it (and ultimately access the `topK` results)? They can simply delegate to the original function at the end, or even implement a custom `Query` if required I don't how common of a use-case this is, and wanted to get some opinions Closes #12575 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results
benwtrent commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734167969 The new KnnCollector abstraction doesn't already address these needs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kaivalnp commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results
kaivalnp commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734230703 Thanks for the quick response @benwtrent! As far as I understand (please let me know if I'm missing something), the new [`KnnCollector`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnCollector.java#L26) interface and corresponding [`#searchNearestVectors`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/LeafReader.java#L303-L330) API still collects per-leaf results right? Irrespective of whether we pass a custom implementation for this interface in [`#approximateSearch`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java#L81-L82), we will not have access to the final results after [merging across all leaves](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L88) - that is, no `KnnCollector` object holds results *across all segments*? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results
benwtrent commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734248507 @kaivalnp I guess you could have a collector that spans all the segments (created once at the query level). I am not really against this change, I am just wondering if there is a better existing option for what you are trying to achieve. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kaivalnp commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results
kaivalnp commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734321355 > you could have a collector that spans all the segments (created once at the query level). Interesting, so a single `KnnCollector` passed to all [`searchNearestVectors`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java#L82) calls? However, `AbstractKnnVectorQuery` now [executes HNSW searches in parallel](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L104), and the `KnnCollector` is passed all the way to the low-level [`HnswGraphSearcher#searchLevel`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L189-L202), so we'll need some synchronization in collecting individual docs across threads? I'm not sure how much overhead this will cause, but seems like a lot? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results
benwtrent commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734371468 @kaivalnp depends on what you need to do. You can easily get around all this without any expensive locking. The collector has a "topDocs" method that could call some higher level collector. But that aside, you might as well just override the "rewrite" method if you need this fined grained control over doing specific things with top-k. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kaivalnp commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results
kaivalnp commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734432239 > You can easily get around all this without any expensive locking. > The collector has a "topDocs" method that could call some higher level collector. Nice idea! So basically we pass a subclass of `TopKnnCollector` [here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java#L82), where the "topDocs" function is overloaded to pass the final results of each segment to a high-level collector when called? I wonder how this would be any different from passing the final `TopDocs` at the end of [`#approximateSearch`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java#L79) directly to the high-level collector? In any case, even if we pass the `TopDocs` from each segment to a high-level collector, it would have to combine the individual `TopDocs` results across all segments to figure out the index-level `topK` independently of `AbstractKnnVectorQuery` right? (duplicating [this piece of work](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L88) in both places) > you might as well just override the "rewrite" method Overriding `rewrite` (first calling `super.rewrite` and looking at the rewritten `Query`) may not work because the [`DocAndScoreQuery`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L297) class is `package-private` so we cannot read the internal [`docs`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L299) without reflection? Moreover, they are sorted in ascending order of `docid` already (and the ordering along scores is lost - so this may not be as useful) We can always copy the [entire `rewrite` function](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L64-L93) and the [`DocAndScoreQuery`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L297-L452) class into all implementers of `AbstractKnnVectorQuery` to get access to the `topK`, but this seems a bit overkill? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vsop-479 commented on pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.
vsop-479 commented on PR #12528: URL: https://github.com/apache/lucene/pull/12528#issuecomment-1734701754 @iverase Thanks for your recommendation, it may makes code more clear. I will try to implement it, and run the performance test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org