Re: [PR] Fix for changelog verifier and milestone setter automation [lucene]

2025-03-19 Thread via GitHub


pseudo-nymous commented on PR #14369:
URL: https://github.com/apache/lucene/pull/14369#issuecomment-2739129987

   Moved it to draft state to address all the failures first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Remove nonexistent PackedBlockLength reference in document [lucene]

2025-03-19 Thread via GitHub


amosbird opened a new pull request, #14377:
URL: https://github.com/apache/lucene/pull/14377

   ### Description
   
   Remove nonexistent `PackedBlockLength` reference in document. This seems to 
be a documentation artifact from version 912 onward, with no corresponding 
implementation found in the codebase.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Issue Tracker Link under 'Editing Content on the Lucene™ Sites' [lucene-site]

2025-03-19 Thread via GitHub


DivyanshIITB commented on code in PR #78:
URL: https://github.com/apache/lucene-site/pull/78#discussion_r2003517630


##
content/pages/site-instructions.md:
##
@@ -3,8 +3,10 @@ URL: site-instructions.html
 save_as: site-instructions.html
 template: lucene/tlp/page
 
+
 ## Editing Content on the Lucene™ sites
+The web site is hosted in its own git repository **lucene-site** (see 
[Github](https://github.com/apache/lucene-site) and 
[Gitbox](https://gitbox.apache.org/repos/asf?p=lucene-site.git)).
 
-The web site is hosted in its own git repository `lucene-site` (see 
[Github](https://github.com/apache/lucene-site/) and 
[Gitbox](https://gitbox.apache.org/repos/asf/lucene-site.git)).
+Pushing to the `main` branch will update the staging site while pushing to 
`production` branch will update the main web site. Read the `README.md` file 
for further instructions.
 
-Pushing to the `main` branch will update the [staging 
site](https://lucene.staged.apache.org) while pushing to `production` branch 
will update the main web site. Read the 
[README.md](https://github.com/apache/lucene-site/blob/main/README.md) file for 
further instructions.
+For reporting website-related issues or suggesting improvements, please visit 
our [Issue Tracker](https://issues.apache.org/jira/projects/LUCENE).

Review Comment:
   Thanks for the clarification! I've updated the issue tracker link to point 
to GitHub Issues instead of JIRA.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Handling concurrent search in QueryProfiler [lucene]

2025-03-19 Thread via GitHub


jainankitk opened a new issue, #14375:
URL: https://github.com/apache/lucene/issues/14375

   ### Description
   
   Based on the discussion from [this email 
thread](https://lists.apache.org/thread.html/r7957a2d9ca38af45b1c370753b3c10542fd9faaf9bf95944c5224e12%40%3Cdev.lucene.apache.org%3E),
 https://github.com/apache/lucene/pull/144 added logic for compiling timings 
for different pieces of a query or multiple queries. The profiler logic doesn't 
account for multiple slices of concurrent segment search.
   
   We recently introduced 
[ConcurrentQueryProfiler](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/profile/query/ConcurrentQueryProfiler.java)
 . It tries to merge the information across multiple slices into avg, min, max 
for each count/time metric, but the output is rather verbose (pasted at the 
end). I am wondering if this is something Lucene will benefit from? 
   
   ```
   "query":
   [
   {
   "type": "TermQuery",
   "description": "field1:one",
   "time_in_nanos": 3944209,
   "max_slice_time_in_nanos": 3944209,
   "min_slice_time_in_nanos": 3892625,
   "avg_slice_time_in_nanos": 3918417,
   "breakdown":
   {
   "max_match": 0,
   "set_min_competitive_score_count": 0,
   "match_count": 0,
   "avg_score_count": 4,
   "shallow_advance_count": 0,
   "next_doc": 50625,
   "min_build_scorer": 1327791,
   "score_count": 8,
   "compute_max_score_count": 0,
   "advance": 96583,
   "min_set_min_competitive_score": 0,
   "min_advance": 4042,
   "score": 96751,
   "avg_set_min_competitive_score_count": 0,
   "min_match_count": 0,
   "avg_score": 65438,
   "max_next_doc_count": 7,
   "max_compute_max_score_count": 0,
   "avg_shallow_advance": 0,
   "max_shallow_advance_count": 0,
   "set_min_competitive_score": 0,
   "min_build_scorer_count": 2,
   "next_doc_count": 8,
   "min_match": 0,
   "avg_next_doc": 26250,
   "compute_max_score": 0,
   "min_set_min_competitive_score_count": 0,
   "max_build_scorer": 1722750,
   "avg_match_count": 0,
   "avg_advance": 50125,
   "build_scorer_count": 6,
   "avg_build_scorer_count": 3,
   "min_next_doc_count": 1,
   "min_shallow_advance_count": 0,
   "max_score_count": 7,
   "avg_match": 0,
   "avg_compute_max_score": 0,
   "max_advance": 96208,
   "avg_shallow_advance_count": 0,
   "avg_set_min_competitive_score": 0,
   "avg_compute_max_score_count": 0,
   "avg_build_scorer": 1525270,
   "max_set_min_competitive_score_count": 0,
   "advance_count": 3,
   "max_build_scorer_count": 4,
   "shallow_advance": 0,
   "min_compute_max_score": 0,
   "max_match_count": 0,
   "create_weight_count": 1,
   "build_scorer": 1830250,
   "max_set_min_competitive_score": 0,
   "max_compute_max_score": 0,
   "min_shallow_advance": 0,
  

Re: [PR] Add Issue Tracker Link under 'Editing Content on the Lucene™ Sites' [lucene-site]

2025-03-19 Thread via GitHub


dweiss merged PR #78:
URL: https://github.com/apache/lucene-site/pull/78


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-19 Thread via GitHub


gf2121 merged PR #14365:
URL: https://github.com/apache/lucene/pull/14365


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add leafReaders() Method to IndexReader and Unit Test [lucene]

2025-03-19 Thread via GitHub


DivyanshIITB opened a new pull request, #14370:
URL: https://github.com/apache/lucene/pull/14370

   This PR introduces leafReaders() in IndexReader for direct access to 
LeafReader instances, improving usability over leaves(). A corresponding unit 
test ensures correctness by validating retrieval consistency and resource 
management.
   
   - Provides a convenient way to access leaf readers without manually 
iterating over leaves().
   - Enhances code readability and usability for developers working with 
Lucene’s indexing system.
   - Ensures functionality is tested to prevent regressions.
   
   Testing & Validation:
   ✅ Successfully retrieves LeafReader instances.
   ✅ Ensures the size matches reader.leaves().size().
   ✅ Properly closes resources to avoid memory leaks.
   
   Fixes #14367 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-19 Thread via GitHub


gf2121 commented on PR #14365:
URL: https://github.com/apache/lucene/pull/14365#issuecomment-2736218538

   I run some benchmarks to find out the major reason:
   
   **Baseline**: main branch  
   **Candidate**: collecting docs greater than maxDocVisited into bitset 
(instead of `DocIdSetBuilder`)
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
CountFilteredIntNRQ   41.78  (3.9%)   41.05  
(2.9%)   -1.7% (  -8% -5%) 0.111
 IntSet   84.83  (2.5%)   84.34  
(2.3%)   -0.6% (  -5% -4%) 0.441
 FilteredIntNRQ   77.52  (3.5%)   78.06  
(3.2%)0.7% (  -5% -7%) 0.516
 IntNRQ   80.49  (3.1%)   82.13  
(3.2%)2.0% (  -4% -8%) 0.041
 TermDTSort   59.85  (2.1%)   66.70  
(2.4%)   11.4% (   6% -   16%) 0.000
  TermDayOfYearSort   61.19  (2.3%)   68.41  
(4.3%)   11.8% (   4% -   18%) 0.000
   ```
   
   **Baseline**: collecting docs greater than maxDocVisited into bitset 
   **Candidate**: collecting all docs into bitset (no `if (doc > 
maxDocVisited)`)
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 IntNRQ   82.00  (3.5%)   80.69  
(2.8%)   -1.6% (  -7% -4%) 0.309
 IntSet   84.61  (2.0%)   84.22  
(2.7%)   -0.5% (  -5% -4%) 0.697
 FilteredIntNRQ   78.12  (1.5%)   78.13  
(1.8%)0.0% (  -3% -3%) 0.991
 TermDTSort   66.41  (2.9%)   67.74  
(3.2%)2.0% (  -4% -8%) 0.192
CountFilteredIntNRQ   40.68  (4.8%)   41.57  
(2.3%)2.2% (  -4% -9%) 0.244
  TermDayOfYearSort   69.90  (2.8%)   71.88  
(3.3%)2.8% (  -3% -9%) 0.064
   ```
   
   It looks like 'more chance to become a bitset' contributes more to the speed 
up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix for changelog verifier and milestone setter automation [lucene]

2025-03-19 Thread via GitHub


pseudo-nymous commented on PR #14369:
URL: https://github.com/apache/lucene/pull/14369#issuecomment-2736166568

   Yes, fix here would fix the `fatal: bad object` failue. I haven't seen the 
first failure before, let me address this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]

2025-03-19 Thread via GitHub


javanna commented on PR #14364:
URL: https://github.com/apache/lucene/pull/14364#issuecomment-2736271001

   Cool then I will target this PR at main only, and open a separate PR for 
`branch_10x`. Out of curiosity, what are the usecases where you'd expect users 
to call `NRTSuggester#load` directly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-19 Thread via GitHub


DivyanshIITB commented on PR #14335:
URL: https://github.com/apache/lucene/pull/14335#issuecomment-2735879968

   Thankyou for your help @vigyasharma !
   I would love to explore small and more focused issues to start with !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]

2025-03-19 Thread via GitHub


jpountz commented on PR #14364:
URL: https://github.com/apache/lucene/pull/14364#issuecomment-2735857094

   I was thinking of keeping the `load(IndexInput, FSTLoadMode)` static method, 
documenting that the load mode is ignored and deprecating it. Indeed that would 
require keeping the `FSTLoadMode` enum, which would be deprecated too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]

2025-03-19 Thread via GitHub


javanna commented on PR #14364:
URL: https://github.com/apache/lucene/pull/14364#issuecomment-2735835683

   Thanks @jpountz what's your suggestion around back-compat? Sounds like you 
are suggesting not backporting the removal of the fst load mode enum but only 
the switch to off-heap by default, is that accurate?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix for changelog verifier and milestone setter automation [lucene]

2025-03-19 Thread via GitHub


stefanvodita commented on PR #14369:
URL: https://github.com/apache/lucene/pull/14369#issuecomment-2735852996

   I'm happy to try this out, but I have some doubts it addresses the issue 
we're experiencing now. For example, take the failure 
[here](https://github.com/apache/lucene/actions/runs/13876712827/job/38829937408).
   
   In step 1, we `Could not resolve to a PullRequest with the number of 14301.` 
The PR definitely exists.
   Then, in step 2 we get `fatal: bad object`. The fix here is meant to fix 
that, right? But I wonder if the issue goes deeper.
   
   What do you think @pseudo-nymous?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Adjust visibility of NRTSuggester#load [lucene]

2025-03-19 Thread via GitHub


javanna opened a new pull request, #14372:
URL: https://github.com/apache/lucene/pull/14372

   load is a public static method, but its corresponding builder 
NRTSuggesterBuilder is package private. That means that there is no reason for 
load to be public.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add issue tracker for website [lucene-site]

2025-03-19 Thread via GitHub


dweiss closed issue #72: Add issue tracker for website
URL: https://github.com/apache/lucene-site/issues/72


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Issue Tracker Link under 'Editing Content on the Lucene™ Sites' [lucene-site]

2025-03-19 Thread via GitHub


DivyanshIITB commented on PR #78:
URL: https://github.com/apache/lucene-site/pull/78#issuecomment-2737508089

   Just a gentle reminder
   @dweiss 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup merging of HNSW graphs [lucene]

2025-03-19 Thread via GitHub


mayya-sharipova commented on PR #14331:
URL: https://github.com/apache/lucene/pull/14331#issuecomment-2737598747

   I've done additional benchmarks with the new Optimized Scalar Quantization 
format that quantize 32x times to 1 single bit 
(Lucene102HnswBinaryQuantizedVectorsFormat). And here we can see less 
improvements but still improvements:
   
   ### Experiment 3 new QSQ format:
   The average speedups from baseline to candidate are:
   
   Index Time Speedup: **1.33x**
   Force Merge Speedup: **1.34x**
   
   Evaluation is done with Luceneutil on these datasets:
   
   1. **quora-E5-small**; 522931 docs; 384 dims; 7 bits quantized; cosine metric
   
  - baseline: index time: **70.71s**,  force merge: **59.38s**
   
  - candidate: index time: **58.25s**, force merge: **40.15s**
   
   2. **cohere-wikipedia-v2**; 1M docs; 768 dims; 7 bits quantized; cosine 
metric
   
  - baseline: index time: **203.08s**, force merge: **107.27s**
   
  - candidate: index time: **142.27s**, force merge: **85.68s**
   
   3. **gist**; 960 dims, 1M docs; 7 bits quantized; euclidean metric
   
  - baseline: index time: **110.35s**, force merge: **323.66s**
   
  - candidate: index time: **105.52s**, force merge: **202.20s**
   
   4. **cohere-wikipedia-v3**; 1M docs; 1024 dims; 7 bits quantized; 
dot_product metric
   
  - baseline: index time: **313.43s**, force merge: **165.98s**
   
  - candidate: index time: **190.63s,** force merge: **159.95s**
   
   
   
![10_multiple](https://github.com/user-attachments/assets/8478a24c-6bf7-4601-a3d3-1f927b17f409)
   
   
![10_single](https://github.com/user-attachments/assets/756b694e-4877-4400-8178-36e2755d91b1)
   
   
![100_multiple](https://github.com/user-attachments/assets/9dd06463-d9e5-406e-b78b-516213d9e5cc)
   
   
![100_single](https://github.com/user-attachments/assets/6bb02c94-024f-40a9-b88d-36728c390928)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]

2025-03-19 Thread via GitHub


jpountz commented on PR #14364:
URL: https://github.com/apache/lucene/pull/14364#issuecomment-2736470811

   Good question. The class is public and looked like a user-facing API hence 
my comment, but you can't serialize a NRTSuggester yourself since 
`NRTSuggesterBuilder` is pkg-private. So it looks like this load method should 
really be pkg-private too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add leafReaders() Method to IndexReader and Unit Test [lucene]

2025-03-19 Thread via GitHub


jainankitk commented on PR #14370:
URL: https://github.com/apache/lucene/pull/14370#issuecomment-273732

   > Eh, I am not sold that this change needs to occur if ever. While, "this is 
how its always been" isn't a good argument for some things, I think expanding 
the public, and then backwards compatible API needs careful consideration.
   
   I agree with @benwtrent. That's why I had below comment in [the original 
issue](https://github.com/apache/lucene/issues/14367):
   
   _Really minor at this point, and probably not worth going through the pain 
of deprecating IndexReader#leaves and changing at few hundred places_
   
   Additionally, I don't think this change really addresses the confusion being 
called out in https://github.com/apache/lucene/issues/14367. Ideally, the 
default behavior of `IndexReader#leaves` should be what is being achieved here 
via `IndexReader#leafReaders`, and there should be more explicit method called 
`IndexReader#leafReaderContexts` for what `IndexReader#leaves` is doing today


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Dummy [lucene]

2025-03-19 Thread via GitHub


ogprakash opened a new pull request, #14376:
URL: https://github.com/apache/lucene/pull/14376

   ### Description
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]

2025-03-19 Thread via GitHub


javanna commented on PR #14364:
URL: https://github.com/apache/lucene/pull/14364#issuecomment-2737720139

   I merged main in after merging #14372 and added a changelog entry. I believe 
this is ready to go, and can now be backported to branch_10x as-is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Dummy [lucene]

2025-03-19 Thread via GitHub


ogprakash commented on PR #14376:
URL: https://github.com/apache/lucene/pull/14376#issuecomment-2737721997

   it was meant for a testing on a fork branch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Dummy [lucene]

2025-03-19 Thread via GitHub


ogprakash closed pull request #14376: Dummy
URL: https://github.com/apache/lucene/pull/14376


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid using time zones that emit warnings (jdk25+) [lucene]

2025-03-19 Thread via GitHub


dweiss commented on PR #14328:
URL: https://github.com/apache/lucene/pull/14328#issuecomment-273592

   I've backported this to branch_10x.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add CHANGES entry for CheckIndex HNSW work [lucene]

2025-03-19 Thread via GitHub


javanna commented on PR #14120:
URL: https://github.com/apache/lucene/pull/14120#issuecomment-2737700794

   Heads up: the original change was backported, but the changelog entry (filed 
under 10.2) was not, I just backported it to branch_10x.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-19 Thread via GitHub


jpountz commented on PR #14365:
URL: https://github.com/apache/lucene/pull/14365#issuecomment-2736720587

   Interesting. I remember playing with calling `BulkAdder#grow` on the 
estimated number of matching points (to upgrade to a bitset immediately instead 
of waiting for docs to be collected) a while back and it didn't help, but maybe 
it does now with the great speedups that you merged lately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Issue Tracker Link under 'Editing Content on the Lucene™ Sites' [lucene-site]

2025-03-19 Thread via GitHub


DivyanshIITB commented on PR #78:
URL: https://github.com/apache/lucene-site/pull/78#issuecomment-2736871034

   Just a gentle reminder
   @sebbASF 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Optimize ParallelLeafReader to improve term vector fetching efficienc [lucene]

2025-03-19 Thread via GitHub


DivyanshIITB opened a new pull request, #14373:
URL: https://github.com/apache/lucene/pull/14373

   This PR optimizes ParallelLeafReader to avoid redundant term vector fetching.
   - Replaces per-field term vector fetching with a single call per reader.
   - Reduces complexity from O(n^2) to O(n).
   - Improves performance when handling large numbers of fields.
   - Verified via existing tests.
   
   Closes #7926 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-19 Thread via GitHub


jpountz commented on code in PR #14365:
URL: https://github.com/apache/lucene/pull/14365#discussion_r2004331440


##
lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java:
##
@@ -47,6 +47,8 @@ public sealed interface BulkAdder permits FixedBitSetAdder, 
BufferAdder {
 void add(IntsRef docs);
 
 void add(DocIdSetIterator iterator) throws IOException;
+
+void add(IntsRef docs, int docLowerBoundExclusive);

Review Comment:
   nit: I'd prefer the lower bound to be inclusive, it's more consistent with 
e.g. DocIdSetIterator#advance(target) where the target is inclusive as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]

2025-03-19 Thread via GitHub


javanna merged PR #14364:
URL: https://github.com/apache/lucene/pull/14364


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fixing quantization interval initialization for optimized sq [lucene]

2025-03-19 Thread via GitHub


benwtrent merged PR #14374:
URL: https://github.com/apache/lucene/pull/14374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup merging of HNSW graphs [lucene]

2025-03-19 Thread via GitHub


benwtrent commented on PR #14331:
URL: https://github.com/apache/lucene/pull/14331#issuecomment-2737726576

   > Experiment 3 new QSQ format:
   ...
   
   These improvements make sense to me. The overall bottleneck of vector ops is 
way lower here, so simply doing fewer ops isn't going to have as big as an 
impact, though its nice to see that the impact is measurable and still nice :D


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Handling concurrent search in QueryProfiler [lucene]

2025-03-19 Thread via GitHub


jpountz commented on issue #14375:
URL: https://github.com/apache/lucene/issues/14375#issuecomment-2738214946

   This looks like it could be useful. Maybe it tries to do too much by 
providing min/avg/max aggregates and it should just provide per-slice 
breakdowns, leaving whether and how to compile aggregates to the application?
   
   Out of curiosity, how do you know if two calls to `advance()` come from the 
same slice or different slices?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]

2025-03-19 Thread via GitHub


javanna commented on PR #14364:
URL: https://github.com/apache/lucene/pull/14364#issuecomment-2738284353

   Thanks @jpountz for all the help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]

2025-03-19 Thread via GitHub


vsop-479 opened a new pull request, #14371:
URL: https://github.com/apache/lucene/pull/14371

   ### Description
   
   Similar to https://github.com/apache/lucene/pull/13692.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup merging of HNSW graphs [lucene]

2025-03-19 Thread via GitHub


msokolov commented on PR #14331:
URL: https://github.com/apache/lucene/pull/14331#issuecomment-2737095119

   yes, looks good, I think this is the right tradeoff. We even seem to get 
improved query performance in some cases. +1 to merge this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]

2025-03-19 Thread via GitHub


javanna commented on PR #14364:
URL: https://github.com/apache/lucene/pull/14364#issuecomment-2736648444

   I opened #14372 to address the visibility issue of `load`, that should 
simplify this PR and backporting it once merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup merging of HNSW graphs [lucene]

2025-03-19 Thread via GitHub


benwtrent commented on code in PR #14331:
URL: https://github.com/apache/lucene/pull/14331#discussion_r2003974286


##
lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswMerger.java:
##
@@ -51,19 +57,85 @@ protected HnswBuilder createBuilder(KnnVectorValues 
mergedVectorValues, int maxO
 OnHeapHnswGraph graph;
 BitSet initializedNodes = null;
 
-if (initReader == null) {
+if (graphReaders.size() == 0) {
   graph = new OnHeapHnswGraph(M, maxOrd);
 } else {
+  
graphReaders.sort(Comparator.comparingInt(GraphReader::graphSize).reversed());
+  GraphReader initGraphReader = graphReaders.get(0);
+  KnnVectorsReader initReader = initGraphReader.reader();
+  MergeState.DocMap initDocMap = initGraphReader.initDocMap();
+  int initGraphSize = initGraphReader.graphSize();
   HnswGraph initializerGraph = ((HnswGraphProvider) 
initReader).getGraph(fieldInfo.name);
+
   if (initializerGraph.size() == 0) {
 graph = new OnHeapHnswGraph(M, maxOrd);
   } else {
 initializedNodes = new FixedBitSet(maxOrd);
-int[] oldToNewOrdinalMap = getNewOrdMapping(mergedVectorValues, 
initializedNodes);
+int[] oldToNewOrdinalMap =
+getNewOrdMapping(
+fieldInfo,
+initReader,
+initDocMap,
+initGraphSize,
+mergedVectorValues,
+initializedNodes);
 graph = InitializedHnswGraphBuilder.initGraph(initializerGraph, 
oldToNewOrdinalMap, maxOrd);
   }
 }
 return new HnswConcurrentMergeBuilder(
 taskExecutor, numWorker, scorerSupplier, beamWidth, graph, 
initializedNodes);
   }
+
+  /**
+   * Creates a new mapping from old ordinals to new ordinals and returns the 
total number of vectors
+   * in the newly merged segment.
+   *
+   * @param mergedVectorValues vector values in the merged segment
+   * @param initializedNodes track what nodes have been initialized
+   * @return the mapping from old ordinals to new ordinals
+   * @throws IOException If an error occurs while reading from the merge state
+   */
+  private static final int[] getNewOrdMapping(
+  FieldInfo fieldInfo,
+  KnnVectorsReader initReader,
+  MergeState.DocMap initDocMap,
+  int initGraphSize,
+  KnnVectorValues mergedVectorValues,
+  BitSet initializedNodes)
+  throws IOException {
+KnnVectorValues.DocIndexIterator initializerIterator = null;
+
+switch (fieldInfo.getVectorEncoding()) {
+  case BYTE -> initializerIterator = 
initReader.getByteVectorValues(fieldInfo.name).iterator();
+  case FLOAT32 ->
+  initializerIterator = 
initReader.getFloatVectorValues(fieldInfo.name).iterator();
+}
+
+IntIntHashMap newIdToOldOrdinal = new IntIntHashMap(initGraphSize);
+int maxNewDocID = -1;
+for (int docId = initializerIterator.nextDoc();
+docId != NO_MORE_DOCS;
+docId = initializerIterator.nextDoc()) {
+  int newId = initDocMap.get(docId);
+  maxNewDocID = Math.max(newId, maxNewDocID);
+  newIdToOldOrdinal.put(newId, initializerIterator.index());
+}
+
+if (maxNewDocID == -1) {
+  return new int[0];
+}
+final int[] oldToNewOrdinalMap = new int[initGraphSize];
+KnnVectorValues.DocIndexIterator mergedVectorIterator = 
mergedVectorValues.iterator();
+for (int newDocId = mergedVectorIterator.nextDoc();
+newDocId <= maxNewDocID;
+newDocId = mergedVectorIterator.nextDoc()) {
+  int hashDocIndex = newIdToOldOrdinal.indexOf(newDocId);
+  if (newIdToOldOrdinal.indexExists(hashDocIndex)) {

Review Comment:
   Is this stuff around `indexOf` `indexExists`, etc. just performance 
improvements over a simple `newIdToOldOrdinal.get(...)`?
   
   Looking at the `IntIntHashMap` its weird that "does not exist" may actually 
just be `0`, where `0` is a valid doc id :/.



##
lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswMerger.java:
##
@@ -51,19 +57,85 @@ protected HnswBuilder createBuilder(KnnVectorValues 
mergedVectorValues, int maxO
 OnHeapHnswGraph graph;
 BitSet initializedNodes = null;
 
-if (initReader == null) {
+if (graphReaders.size() == 0) {
   graph = new OnHeapHnswGraph(M, maxOrd);
 } else {
+  
graphReaders.sort(Comparator.comparingInt(GraphReader::graphSize).reversed());
+  GraphReader initGraphReader = graphReaders.get(0);
+  KnnVectorsReader initReader = initGraphReader.reader();
+  MergeState.DocMap initDocMap = initGraphReader.initDocMap();
+  int initGraphSize = initGraphReader.graphSize();
   HnswGraph initializerGraph = ((HnswGraphProvider) 
initReader).getGraph(fieldInfo.name);
+
   if (initializerGraph.size() == 0) {
 graph = new OnHeapHnswGraph(M, maxOrd);
   } else {
 initializedNodes = new FixedBitSet(maxOrd);
-int[] oldToNewOrdinalMap = getNewOrdMapping(mergedVectorValues, 
init

Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-19 Thread via GitHub


gf2121 commented on PR #14365:
URL: https://github.com/apache/lucene/pull/14365#issuecomment-2737667396

   > I remember playing with calling BulkAdder#grow on the estimated number of 
matching points (to upgrade to a bitset immediately instead of waiting for docs 
to be collected) a while back and it didn't help.
   
   This is a neat idea, I tried the approach just now, seeing less of the 
improvements:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
CountFilteredIntNRQ   41.25  (1.6%)   40.31  
(2.7%)   -2.3% (  -6% -2%) 0.040
 IntNRQ   81.60  (2.5%)   80.25  
(2.9%)   -1.7% (  -6% -3%) 0.214
 FilteredIntNRQ   77.33  (1.6%)   77.43  
(3.1%)0.1% (  -4% -4%) 0.918
 IntSet   84.32  (2.3%)   84.86  
(2.3%)0.6% (  -3% -5%) 0.584
  TermDayOfYearSort   59.13  (3.0%)   60.22  
(3.1%)1.8% (  -4% -8%) 0.224
 TermDTSort   58.72  (1.2%)   61.36  
(3.1%)4.5% (   0% -8%) 0.000
   ```
   
   So:
   
   1. bulk adding without `if` get  ~10% faster.
   2. adding to a bitset with `if` get ~10% faster.
   3. pre-grow `docIdSetBuilder` only get less than 5% faster.
   
   This is a bit confusing, I rethink on these cases and it occurs to me that 
if could probably be the abstraction layer of bulk adder, which not exists in 
case 1 and case 2, but exists in case 3. So i try to introduce a `void 
add(IntsRef docs, int docLowerBoundExclusive);`, and it works:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 FilteredIntNRQ   77.52  (2.6%)   76.83  
(3.5%)   -0.9% (  -6% -5%) 0.496
 IntSet   82.79  (1.4%)   82.80  
(3.2%)0.0% (  -4% -4%) 0.990
 IntNRQ   79.16  (2.2%)   79.76  
(4.0%)0.8% (  -5% -7%) 0.580
CountFilteredIntNRQ   40.34  (2.5%)   40.79  
(3.0%)1.1% (  -4% -6%) 0.347
 TermDTSort   59.16  (2.3%)   66.19  
(2.2%)   11.9% (   7% -   16%) 0.000
  TermDayOfYearSort   59.71  (3.0%)   67.65  
(3.5%)   13.3% (   6% -   20%) 0.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adjust visibility of NRTSuggester#load [lucene]

2025-03-19 Thread via GitHub


javanna merged PR #14372:
URL: https://github.com/apache/lucene/pull/14372


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Cover all DataType [lucene]

2025-03-19 Thread via GitHub


javanna commented on PR #14091:
URL: https://github.com/apache/lucene/pull/14091#issuecomment-2737689624

   Heya, the entry in the changelog was filed under 10.2, but the change was 
never backported. Either we move the changelog entry then, or we backport the 
change :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org