Re: [PR] Fix for changelog verifier and milestone setter automation [lucene]
pseudo-nymous commented on PR #14369: URL: https://github.com/apache/lucene/pull/14369#issuecomment-2739129987 Moved it to draft state to address all the failures first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Remove nonexistent PackedBlockLength reference in document [lucene]
amosbird opened a new pull request, #14377: URL: https://github.com/apache/lucene/pull/14377 ### Description Remove nonexistent `PackedBlockLength` reference in document. This seems to be a documentation artifact from version 912 onward, with no corresponding implementation found in the codebase. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add Issue Tracker Link under 'Editing Content on the Lucene™ Sites' [lucene-site]
DivyanshIITB commented on code in PR #78: URL: https://github.com/apache/lucene-site/pull/78#discussion_r2003517630 ## content/pages/site-instructions.md: ## @@ -3,8 +3,10 @@ URL: site-instructions.html save_as: site-instructions.html template: lucene/tlp/page + ## Editing Content on the Lucene™ sites +The web site is hosted in its own git repository **lucene-site** (see [Github](https://github.com/apache/lucene-site) and [Gitbox](https://gitbox.apache.org/repos/asf?p=lucene-site.git)). -The web site is hosted in its own git repository `lucene-site` (see [Github](https://github.com/apache/lucene-site/) and [Gitbox](https://gitbox.apache.org/repos/asf/lucene-site.git)). +Pushing to the `main` branch will update the staging site while pushing to `production` branch will update the main web site. Read the `README.md` file for further instructions. -Pushing to the `main` branch will update the [staging site](https://lucene.staged.apache.org) while pushing to `production` branch will update the main web site. Read the [README.md](https://github.com/apache/lucene-site/blob/main/README.md) file for further instructions. +For reporting website-related issues or suggesting improvements, please visit our [Issue Tracker](https://issues.apache.org/jira/projects/LUCENE). Review Comment: Thanks for the clarification! I've updated the issue tracker link to point to GitHub Issues instead of JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Handling concurrent search in QueryProfiler [lucene]
jainankitk opened a new issue, #14375: URL: https://github.com/apache/lucene/issues/14375 ### Description Based on the discussion from [this email thread](https://lists.apache.org/thread.html/r7957a2d9ca38af45b1c370753b3c10542fd9faaf9bf95944c5224e12%40%3Cdev.lucene.apache.org%3E), https://github.com/apache/lucene/pull/144 added logic for compiling timings for different pieces of a query or multiple queries. The profiler logic doesn't account for multiple slices of concurrent segment search. We recently introduced [ConcurrentQueryProfiler](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/profile/query/ConcurrentQueryProfiler.java) . It tries to merge the information across multiple slices into avg, min, max for each count/time metric, but the output is rather verbose (pasted at the end). I am wondering if this is something Lucene will benefit from? ``` "query": [ { "type": "TermQuery", "description": "field1:one", "time_in_nanos": 3944209, "max_slice_time_in_nanos": 3944209, "min_slice_time_in_nanos": 3892625, "avg_slice_time_in_nanos": 3918417, "breakdown": { "max_match": 0, "set_min_competitive_score_count": 0, "match_count": 0, "avg_score_count": 4, "shallow_advance_count": 0, "next_doc": 50625, "min_build_scorer": 1327791, "score_count": 8, "compute_max_score_count": 0, "advance": 96583, "min_set_min_competitive_score": 0, "min_advance": 4042, "score": 96751, "avg_set_min_competitive_score_count": 0, "min_match_count": 0, "avg_score": 65438, "max_next_doc_count": 7, "max_compute_max_score_count": 0, "avg_shallow_advance": 0, "max_shallow_advance_count": 0, "set_min_competitive_score": 0, "min_build_scorer_count": 2, "next_doc_count": 8, "min_match": 0, "avg_next_doc": 26250, "compute_max_score": 0, "min_set_min_competitive_score_count": 0, "max_build_scorer": 1722750, "avg_match_count": 0, "avg_advance": 50125, "build_scorer_count": 6, "avg_build_scorer_count": 3, "min_next_doc_count": 1, "min_shallow_advance_count": 0, "max_score_count": 7, "avg_match": 0, "avg_compute_max_score": 0, "max_advance": 96208, "avg_shallow_advance_count": 0, "avg_set_min_competitive_score": 0, "avg_compute_max_score_count": 0, "avg_build_scorer": 1525270, "max_set_min_competitive_score_count": 0, "advance_count": 3, "max_build_scorer_count": 4, "shallow_advance": 0, "min_compute_max_score": 0, "max_match_count": 0, "create_weight_count": 1, "build_scorer": 1830250, "max_set_min_competitive_score": 0, "max_compute_max_score": 0, "min_shallow_advance": 0,
Re: [PR] Add Issue Tracker Link under 'Editing Content on the Lucene™ Sites' [lucene-site]
dweiss merged PR #78: URL: https://github.com/apache/lucene-site/pull/78 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]
gf2121 merged PR #14365: URL: https://github.com/apache/lucene/pull/14365 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add leafReaders() Method to IndexReader and Unit Test [lucene]
DivyanshIITB opened a new pull request, #14370: URL: https://github.com/apache/lucene/pull/14370 This PR introduces leafReaders() in IndexReader for direct access to LeafReader instances, improving usability over leaves(). A corresponding unit test ensures correctness by validating retrieval consistency and resource management. - Provides a convenient way to access leaf readers without manually iterating over leaves(). - Enhances code readability and usability for developers working with Lucene’s indexing system. - Ensures functionality is tested to prevent regressions. Testing & Validation: ✅ Successfully retrieves LeafReader instances. ✅ Ensures the size matches reader.leaves().size(). ✅ Properly closes resources to avoid memory leaks. Fixes #14367 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]
gf2121 commented on PR #14365: URL: https://github.com/apache/lucene/pull/14365#issuecomment-2736218538 I run some benchmarks to find out the major reason: **Baseline**: main branch **Candidate**: collecting docs greater than maxDocVisited into bitset (instead of `DocIdSetBuilder`) ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value CountFilteredIntNRQ 41.78 (3.9%) 41.05 (2.9%) -1.7% ( -8% -5%) 0.111 IntSet 84.83 (2.5%) 84.34 (2.3%) -0.6% ( -5% -4%) 0.441 FilteredIntNRQ 77.52 (3.5%) 78.06 (3.2%)0.7% ( -5% -7%) 0.516 IntNRQ 80.49 (3.1%) 82.13 (3.2%)2.0% ( -4% -8%) 0.041 TermDTSort 59.85 (2.1%) 66.70 (2.4%) 11.4% ( 6% - 16%) 0.000 TermDayOfYearSort 61.19 (2.3%) 68.41 (4.3%) 11.8% ( 4% - 18%) 0.000 ``` **Baseline**: collecting docs greater than maxDocVisited into bitset **Candidate**: collecting all docs into bitset (no `if (doc > maxDocVisited)`) ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value IntNRQ 82.00 (3.5%) 80.69 (2.8%) -1.6% ( -7% -4%) 0.309 IntSet 84.61 (2.0%) 84.22 (2.7%) -0.5% ( -5% -4%) 0.697 FilteredIntNRQ 78.12 (1.5%) 78.13 (1.8%)0.0% ( -3% -3%) 0.991 TermDTSort 66.41 (2.9%) 67.74 (3.2%)2.0% ( -4% -8%) 0.192 CountFilteredIntNRQ 40.68 (4.8%) 41.57 (2.3%)2.2% ( -4% -9%) 0.244 TermDayOfYearSort 69.90 (2.8%) 71.88 (3.3%)2.8% ( -3% -9%) 0.064 ``` It looks like 'more chance to become a bitset' contributes more to the speed up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix for changelog verifier and milestone setter automation [lucene]
pseudo-nymous commented on PR #14369: URL: https://github.com/apache/lucene/pull/14369#issuecomment-2736166568 Yes, fix here would fix the `fatal: bad object` failue. I haven't seen the first failure before, let me address this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]
javanna commented on PR #14364: URL: https://github.com/apache/lucene/pull/14364#issuecomment-2736271001 Cool then I will target this PR at main only, and open a separate PR for `branch_10x`. Out of curiosity, what are the usecases where you'd expect users to call `NRTSuggester#load` directly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]
DivyanshIITB commented on PR #14335: URL: https://github.com/apache/lucene/pull/14335#issuecomment-2735879968 Thankyou for your help @vigyasharma ! I would love to explore small and more focused issues to start with ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]
jpountz commented on PR #14364: URL: https://github.com/apache/lucene/pull/14364#issuecomment-2735857094 I was thinking of keeping the `load(IndexInput, FSTLoadMode)` static method, documenting that the load mode is ignored and deprecating it. Indeed that would require keeping the `FSTLoadMode` enum, which would be deprecated too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]
javanna commented on PR #14364: URL: https://github.com/apache/lucene/pull/14364#issuecomment-2735835683 Thanks @jpountz what's your suggestion around back-compat? Sounds like you are suggesting not backporting the removal of the fst load mode enum but only the switch to off-heap by default, is that accurate? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix for changelog verifier and milestone setter automation [lucene]
stefanvodita commented on PR #14369: URL: https://github.com/apache/lucene/pull/14369#issuecomment-2735852996 I'm happy to try this out, but I have some doubts it addresses the issue we're experiencing now. For example, take the failure [here](https://github.com/apache/lucene/actions/runs/13876712827/job/38829937408). In step 1, we `Could not resolve to a PullRequest with the number of 14301.` The PR definitely exists. Then, in step 2 we get `fatal: bad object`. The fix here is meant to fix that, right? But I wonder if the issue goes deeper. What do you think @pseudo-nymous? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Adjust visibility of NRTSuggester#load [lucene]
javanna opened a new pull request, #14372: URL: https://github.com/apache/lucene/pull/14372 load is a public static method, but its corresponding builder NRTSuggesterBuilder is package private. That means that there is no reason for load to be public. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add issue tracker for website [lucene-site]
dweiss closed issue #72: Add issue tracker for website URL: https://github.com/apache/lucene-site/issues/72 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add Issue Tracker Link under 'Editing Content on the Lucene™ Sites' [lucene-site]
DivyanshIITB commented on PR #78: URL: https://github.com/apache/lucene-site/pull/78#issuecomment-2737508089 Just a gentle reminder @dweiss -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup merging of HNSW graphs [lucene]
mayya-sharipova commented on PR #14331: URL: https://github.com/apache/lucene/pull/14331#issuecomment-2737598747 I've done additional benchmarks with the new Optimized Scalar Quantization format that quantize 32x times to 1 single bit (Lucene102HnswBinaryQuantizedVectorsFormat). And here we can see less improvements but still improvements: ### Experiment 3 new QSQ format: The average speedups from baseline to candidate are: Index Time Speedup: **1.33x** Force Merge Speedup: **1.34x** Evaluation is done with Luceneutil on these datasets: 1. **quora-E5-small**; 522931 docs; 384 dims; 7 bits quantized; cosine metric - baseline: index time: **70.71s**, force merge: **59.38s** - candidate: index time: **58.25s**, force merge: **40.15s** 2. **cohere-wikipedia-v2**; 1M docs; 768 dims; 7 bits quantized; cosine metric - baseline: index time: **203.08s**, force merge: **107.27s** - candidate: index time: **142.27s**, force merge: **85.68s** 3. **gist**; 960 dims, 1M docs; 7 bits quantized; euclidean metric - baseline: index time: **110.35s**, force merge: **323.66s** - candidate: index time: **105.52s**, force merge: **202.20s** 4. **cohere-wikipedia-v3**; 1M docs; 1024 dims; 7 bits quantized; dot_product metric - baseline: index time: **313.43s**, force merge: **165.98s** - candidate: index time: **190.63s,** force merge: **159.95s**     -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]
jpountz commented on PR #14364: URL: https://github.com/apache/lucene/pull/14364#issuecomment-2736470811 Good question. The class is public and looked like a user-facing API hence my comment, but you can't serialize a NRTSuggester yourself since `NRTSuggesterBuilder` is pkg-private. So it looks like this load method should really be pkg-private too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add leafReaders() Method to IndexReader and Unit Test [lucene]
jainankitk commented on PR #14370: URL: https://github.com/apache/lucene/pull/14370#issuecomment-273732 > Eh, I am not sold that this change needs to occur if ever. While, "this is how its always been" isn't a good argument for some things, I think expanding the public, and then backwards compatible API needs careful consideration. I agree with @benwtrent. That's why I had below comment in [the original issue](https://github.com/apache/lucene/issues/14367): _Really minor at this point, and probably not worth going through the pain of deprecating IndexReader#leaves and changing at few hundred places_ Additionally, I don't think this change really addresses the confusion being called out in https://github.com/apache/lucene/issues/14367. Ideally, the default behavior of `IndexReader#leaves` should be what is being achieved here via `IndexReader#leafReaders`, and there should be more explicit method called `IndexReader#leafReaderContexts` for what `IndexReader#leaves` is doing today -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Dummy [lucene]
ogprakash opened a new pull request, #14376: URL: https://github.com/apache/lucene/pull/14376 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]
javanna commented on PR #14364: URL: https://github.com/apache/lucene/pull/14364#issuecomment-2737720139 I merged main in after merging #14372 and added a changelog entry. I believe this is ready to go, and can now be backported to branch_10x as-is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Dummy [lucene]
ogprakash commented on PR #14376: URL: https://github.com/apache/lucene/pull/14376#issuecomment-2737721997 it was meant for a testing on a fork branch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Dummy [lucene]
ogprakash closed pull request #14376: Dummy URL: https://github.com/apache/lucene/pull/14376 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid using time zones that emit warnings (jdk25+) [lucene]
dweiss commented on PR #14328: URL: https://github.com/apache/lucene/pull/14328#issuecomment-273592 I've backported this to branch_10x. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add CHANGES entry for CheckIndex HNSW work [lucene]
javanna commented on PR #14120: URL: https://github.com/apache/lucene/pull/14120#issuecomment-2737700794 Heads up: the original change was backported, but the changelog entry (filed under 10.2) was not, I just backported it to branch_10x. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]
jpountz commented on PR #14365: URL: https://github.com/apache/lucene/pull/14365#issuecomment-2736720587 Interesting. I remember playing with calling `BulkAdder#grow` on the estimated number of matching points (to upgrade to a bitset immediately instead of waiting for docs to be collected) a while back and it didn't help, but maybe it does now with the great speedups that you merged lately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add Issue Tracker Link under 'Editing Content on the Lucene™ Sites' [lucene-site]
DivyanshIITB commented on PR #78: URL: https://github.com/apache/lucene-site/pull/78#issuecomment-2736871034 Just a gentle reminder @sebbASF -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Optimize ParallelLeafReader to improve term vector fetching efficienc [lucene]
DivyanshIITB opened a new pull request, #14373: URL: https://github.com/apache/lucene/pull/14373 This PR optimizes ParallelLeafReader to avoid redundant term vector fetching. - Replaces per-field term vector fetching with a single call per reader. - Reduces complexity from O(n^2) to O(n). - Improves performance when handling large numbers of fields. - Verified via existing tests. Closes #7926 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]
jpountz commented on code in PR #14365: URL: https://github.com/apache/lucene/pull/14365#discussion_r2004331440 ## lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java: ## @@ -47,6 +47,8 @@ public sealed interface BulkAdder permits FixedBitSetAdder, BufferAdder { void add(IntsRef docs); void add(DocIdSetIterator iterator) throws IOException; + +void add(IntsRef docs, int docLowerBoundExclusive); Review Comment: nit: I'd prefer the lower bound to be inclusive, it's more consistent with e.g. DocIdSetIterator#advance(target) where the target is inclusive as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]
javanna merged PR #14364: URL: https://github.com/apache/lucene/pull/14364 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fixing quantization interval initialization for optimized sq [lucene]
benwtrent merged PR #14374: URL: https://github.com/apache/lucene/pull/14374 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup merging of HNSW graphs [lucene]
benwtrent commented on PR #14331: URL: https://github.com/apache/lucene/pull/14331#issuecomment-2737726576 > Experiment 3 new QSQ format: ... These improvements make sense to me. The overall bottleneck of vector ops is way lower here, so simply doing fewer ops isn't going to have as big as an impact, though its nice to see that the impact is measurable and still nice :D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Handling concurrent search in QueryProfiler [lucene]
jpountz commented on issue #14375: URL: https://github.com/apache/lucene/issues/14375#issuecomment-2738214946 This looks like it could be useful. Maybe it tries to do too much by providing min/avg/max aggregates and it should just provide per-slice breakdowns, leaving whether and how to compile aggregates to the application? Out of curiosity, how do you know if two calls to `advance()` come from the same slice or different slices? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]
javanna commented on PR #14364: URL: https://github.com/apache/lucene/pull/14364#issuecomment-2738284353 Thanks @jpountz for all the help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]
vsop-479 opened a new pull request, #14371: URL: https://github.com/apache/lucene/pull/14371 ### Description Similar to https://github.com/apache/lucene/pull/13692. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup merging of HNSW graphs [lucene]
msokolov commented on PR #14331: URL: https://github.com/apache/lucene/pull/14331#issuecomment-2737095119 yes, looks good, I think this is the right tradeoff. We even seem to get improved query performance in some cases. +1 to merge this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]
javanna commented on PR #14364: URL: https://github.com/apache/lucene/pull/14364#issuecomment-2736648444 I opened #14372 to address the visibility issue of `load`, that should simplify this PR and backporting it once merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup merging of HNSW graphs [lucene]
benwtrent commented on code in PR #14331: URL: https://github.com/apache/lucene/pull/14331#discussion_r2003974286 ## lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswMerger.java: ## @@ -51,19 +57,85 @@ protected HnswBuilder createBuilder(KnnVectorValues mergedVectorValues, int maxO OnHeapHnswGraph graph; BitSet initializedNodes = null; -if (initReader == null) { +if (graphReaders.size() == 0) { graph = new OnHeapHnswGraph(M, maxOrd); } else { + graphReaders.sort(Comparator.comparingInt(GraphReader::graphSize).reversed()); + GraphReader initGraphReader = graphReaders.get(0); + KnnVectorsReader initReader = initGraphReader.reader(); + MergeState.DocMap initDocMap = initGraphReader.initDocMap(); + int initGraphSize = initGraphReader.graphSize(); HnswGraph initializerGraph = ((HnswGraphProvider) initReader).getGraph(fieldInfo.name); + if (initializerGraph.size() == 0) { graph = new OnHeapHnswGraph(M, maxOrd); } else { initializedNodes = new FixedBitSet(maxOrd); -int[] oldToNewOrdinalMap = getNewOrdMapping(mergedVectorValues, initializedNodes); +int[] oldToNewOrdinalMap = +getNewOrdMapping( +fieldInfo, +initReader, +initDocMap, +initGraphSize, +mergedVectorValues, +initializedNodes); graph = InitializedHnswGraphBuilder.initGraph(initializerGraph, oldToNewOrdinalMap, maxOrd); } } return new HnswConcurrentMergeBuilder( taskExecutor, numWorker, scorerSupplier, beamWidth, graph, initializedNodes); } + + /** + * Creates a new mapping from old ordinals to new ordinals and returns the total number of vectors + * in the newly merged segment. + * + * @param mergedVectorValues vector values in the merged segment + * @param initializedNodes track what nodes have been initialized + * @return the mapping from old ordinals to new ordinals + * @throws IOException If an error occurs while reading from the merge state + */ + private static final int[] getNewOrdMapping( + FieldInfo fieldInfo, + KnnVectorsReader initReader, + MergeState.DocMap initDocMap, + int initGraphSize, + KnnVectorValues mergedVectorValues, + BitSet initializedNodes) + throws IOException { +KnnVectorValues.DocIndexIterator initializerIterator = null; + +switch (fieldInfo.getVectorEncoding()) { + case BYTE -> initializerIterator = initReader.getByteVectorValues(fieldInfo.name).iterator(); + case FLOAT32 -> + initializerIterator = initReader.getFloatVectorValues(fieldInfo.name).iterator(); +} + +IntIntHashMap newIdToOldOrdinal = new IntIntHashMap(initGraphSize); +int maxNewDocID = -1; +for (int docId = initializerIterator.nextDoc(); +docId != NO_MORE_DOCS; +docId = initializerIterator.nextDoc()) { + int newId = initDocMap.get(docId); + maxNewDocID = Math.max(newId, maxNewDocID); + newIdToOldOrdinal.put(newId, initializerIterator.index()); +} + +if (maxNewDocID == -1) { + return new int[0]; +} +final int[] oldToNewOrdinalMap = new int[initGraphSize]; +KnnVectorValues.DocIndexIterator mergedVectorIterator = mergedVectorValues.iterator(); +for (int newDocId = mergedVectorIterator.nextDoc(); +newDocId <= maxNewDocID; +newDocId = mergedVectorIterator.nextDoc()) { + int hashDocIndex = newIdToOldOrdinal.indexOf(newDocId); + if (newIdToOldOrdinal.indexExists(hashDocIndex)) { Review Comment: Is this stuff around `indexOf` `indexExists`, etc. just performance improvements over a simple `newIdToOldOrdinal.get(...)`? Looking at the `IntIntHashMap` its weird that "does not exist" may actually just be `0`, where `0` is a valid doc id :/. ## lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswMerger.java: ## @@ -51,19 +57,85 @@ protected HnswBuilder createBuilder(KnnVectorValues mergedVectorValues, int maxO OnHeapHnswGraph graph; BitSet initializedNodes = null; -if (initReader == null) { +if (graphReaders.size() == 0) { graph = new OnHeapHnswGraph(M, maxOrd); } else { + graphReaders.sort(Comparator.comparingInt(GraphReader::graphSize).reversed()); + GraphReader initGraphReader = graphReaders.get(0); + KnnVectorsReader initReader = initGraphReader.reader(); + MergeState.DocMap initDocMap = initGraphReader.initDocMap(); + int initGraphSize = initGraphReader.graphSize(); HnswGraph initializerGraph = ((HnswGraphProvider) initReader).getGraph(fieldInfo.name); + if (initializerGraph.size() == 0) { graph = new OnHeapHnswGraph(M, maxOrd); } else { initializedNodes = new FixedBitSet(maxOrd); -int[] oldToNewOrdinalMap = getNewOrdMapping(mergedVectorValues, init
Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]
gf2121 commented on PR #14365: URL: https://github.com/apache/lucene/pull/14365#issuecomment-2737667396 > I remember playing with calling BulkAdder#grow on the estimated number of matching points (to upgrade to a bitset immediately instead of waiting for docs to be collected) a while back and it didn't help. This is a neat idea, I tried the approach just now, seeing less of the improvements: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value CountFilteredIntNRQ 41.25 (1.6%) 40.31 (2.7%) -2.3% ( -6% -2%) 0.040 IntNRQ 81.60 (2.5%) 80.25 (2.9%) -1.7% ( -6% -3%) 0.214 FilteredIntNRQ 77.33 (1.6%) 77.43 (3.1%)0.1% ( -4% -4%) 0.918 IntSet 84.32 (2.3%) 84.86 (2.3%)0.6% ( -3% -5%) 0.584 TermDayOfYearSort 59.13 (3.0%) 60.22 (3.1%)1.8% ( -4% -8%) 0.224 TermDTSort 58.72 (1.2%) 61.36 (3.1%)4.5% ( 0% -8%) 0.000 ``` So: 1. bulk adding without `if` get ~10% faster. 2. adding to a bitset with `if` get ~10% faster. 3. pre-grow `docIdSetBuilder` only get less than 5% faster. This is a bit confusing, I rethink on these cases and it occurs to me that if could probably be the abstraction layer of bulk adder, which not exists in case 1 and case 2, but exists in case 3. So i try to introduce a `void add(IntsRef docs, int docLowerBoundExclusive);`, and it works: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value FilteredIntNRQ 77.52 (2.6%) 76.83 (3.5%) -0.9% ( -6% -5%) 0.496 IntSet 82.79 (1.4%) 82.80 (3.2%)0.0% ( -4% -4%) 0.990 IntNRQ 79.16 (2.2%) 79.76 (4.0%)0.8% ( -5% -7%) 0.580 CountFilteredIntNRQ 40.34 (2.5%) 40.79 (3.0%)1.1% ( -4% -6%) 0.347 TermDTSort 59.16 (2.3%) 66.19 (2.2%) 11.9% ( 7% - 16%) 0.000 TermDayOfYearSort 59.71 (3.0%) 67.65 (3.5%) 13.3% ( 6% - 20%) 0.000 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adjust visibility of NRTSuggester#load [lucene]
javanna merged PR #14372: URL: https://github.com/apache/lucene/pull/14372 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Cover all DataType [lucene]
javanna commented on PR #14091: URL: https://github.com/apache/lucene/pull/14091#issuecomment-2737689624 Heya, the entry in the changelog was filed under 10.2, but the change was never backported. Either we move the changelog entry then, or we backport the change :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org