Re: [I] Multi-HNSW graphs per segment? [lucene]

2025-03-11 Thread via GitHub
atris commented on issue #14341: URL: https://github.com/apache/lucene/issues/14341#issuecomment-2716520829 Cool. Assigning this for myself -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [I] Make Lucene smarter about long runs of matches [lucene]

2025-03-11 Thread via GitHub
jpountz closed issue #11915: Make Lucene smarter about long runs of matches URL: https://github.com/apache/lucene/issues/11915 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

Re: [PR] Optimize commit retention policy to maintain only the last 5 commits [lucene]

2025-03-11 Thread via GitHub
jpountz commented on PR #14325: URL: https://github.com/apache/lucene/pull/14325#issuecomment-2705668964 @DivyanshIITB Deletion policies are configurable via `IndexWriterConfig#setIndexDeletionPolicy`, see e.g. `SnapshotDeletionPolicy` which allows for finer-grained maintenance of snapshots

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-11 Thread via GitHub
lpld commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2710872661 @benwtrent @mikemccand I really appreciate your help and quick responses. May I also ask about the selection of datasets being used for the benchmarks? How do you choose them? Why I'm

Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

2025-03-11 Thread via GitHub
javanna commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2711741530 > But I'm still questioning if there's actually a use-case for allowing something to be loaded either on-heap or off-heap in our codecs. For all examples that come to mind, I would rathe

Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]

2025-03-11 Thread via GitHub
rmuir merged PR #14311: URL: https://github.com/apache/lucene/pull/14311 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] Speedup slice calculation in IndexSearcher [lucene]

2025-03-11 Thread via GitHub
original-brownbear merged PR #14343: URL: https://github.com/apache/lucene/pull/14343 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...

Re: [PR] Speedup slice calculation in IndexSearcher [lucene]

2025-03-11 Thread via GitHub
original-brownbear commented on PR #14343: URL: https://github.com/apache/lucene/pull/14343#issuecomment-2716102016 Thanks Michael! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific co

Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]

2025-03-11 Thread via GitHub
rmuir commented on PR #14311: URL: https://github.com/apache/lucene/pull/14311#issuecomment-2716086552 I'm just doing final tests. Thanks again @renatoh. I will backport it to 10.2. We can followup to remove the deprecated "sorta-kinda-longest-match" from lucene's `main` branch, and see if

[PR] deduplicate standard BKDConfig records [lucene]

2025-03-11 Thread via GitHub
iverase opened a new pull request, #14338: URL: https://github.com/apache/lucene/pull/14338 In the case you have many BKD readers on the same heap, it feels wasteful to have individual instances of BKDConfig records as most of the time those instances correspond to standard lucene fields. T

[PR] Add venv to rat gitignore [lucene]

2025-03-11 Thread via GitHub
rmuir opened a new pull request, #14346: URL: https://github.com/apache/lucene/pull/14346 For the same reason the aws-jmh venv is ignored. Current rat version will go crazy on this. We should look into the rat version, I think they may have improved .gitignore support in recent relea

Re: [PR] python: enable all linting checks and type-hint the code [lucene]

2025-03-11 Thread via GitHub
rmuir merged PR #14326: URL: https://github.com/apache/lucene/pull/14326 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] Speed up scoring conjunctions a bit. [lucene]

2025-03-11 Thread via GitHub
jpountz commented on PR #14345: URL: https://github.com/apache/lucene/pull/14345#issuecomment-2715856731 It gives a few small speedups (low p-value): ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value

[PR] Speed up scoring conjunctions a bit. [lucene]

2025-03-11 Thread via GitHub
jpountz opened a new pull request, #14345: URL: https://github.com/apache/lucene/pull/14345 We currently pushe FILTER clauses as constant-scoring MUST clauses with a 0 score to `BlockMaxConjunctionBulkScorer`. This change improves efficiency a bit by reducing polymorphism a bit (TermScorer

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-11 Thread via GitHub
DivyanshIITB commented on PR #14335: URL: https://github.com/apache/lucene/pull/14335#issuecomment-2710583795 Thank you for the review, @jpountz! I see your concern regarding equal resource distribution across IndexWriter instances potentially leading to inefficiencies when some write

Re: [I] Multi-HNSW graphs per segment? [lucene]

2025-03-11 Thread via GitHub
jpountz commented on issue #14341: URL: https://github.com/apache/lucene/issues/14341#issuecomment-2715706359 This idea sounds worth exploring to me too. Intuitively, it may help pre-filtering too. E.g. if I think of an e-commerce use-case with a filter on the category field, it is likely t

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-03-11 Thread via GitHub
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2715605453 Thanks for the review @navneet1v! > lucene util branch You can find some (very hacky) changes [here](https://github.com/kaivalnp/luceneutil/tree/faiss). Broad steps to run

[I] TestIndexSortBackwardsCompatibility.testSortedIndexAddDocBlocks fails reproducibly [lucene]

2025-03-11 Thread via GitHub
dweiss opened a new issue, #14344: URL: https://github.com/apache/lucene/issues/14344 ### Description This fails for me every time on main: ``` ./gradlew -p lucene/backward-codecs -Ptests.seed=CF895D81F5B12730 test --tests TestIndexSortBackwardsCompatibility ... > j

Re: [I] Multi-HNSW graphs per segment? [lucene]

2025-03-11 Thread via GitHub
atris commented on issue #14341: URL: https://github.com/apache/lucene/issues/14341#issuecomment-2715539509 @navneet1v Skimming through the issue, I think they refer to different problem statements. What you primarily want in the referenced GH issue is the ability to filter on more m

Re: [I] Multi-HNSW graphs per segment? [lucene]

2025-03-11 Thread via GitHub
navneet1v commented on issue #14341: URL: https://github.com/apache/lucene/issues/14341#issuecomment-2715434628 @benwtrent I was also thinking on similar lines and I created this GH issue which eventually wants to create more than 1 graph at the segment level: https://github.com/apache/luce

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-03-11 Thread via GitHub
mikemccand commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2711223110 > > Net/net I think we ought to be adding some multithreaded test capability to KnnGraphTester. > > Agreed, I think a "num_search_threads" parameter would be beneficial. Then t

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-11 Thread via GitHub
mikemccand commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2710460602 > `fanout` makes the search queue when searching the HNSW graph larger. However, the searcher will still only return `k` results. So, searching for top `k=10` with `fanout=20` indicat

Re: [I] New binary vector format doesn't perform well with small-dimension datasets [lucene]

2025-03-11 Thread via GitHub
benwtrent commented on issue #14342: URL: https://github.com/apache/lucene/issues/14342#issuecomment-2715217336 > FashionMnist784 (60_000 x 784) That one looks weird to me. The others sort of make sense. -- This is an automated message from the Apache Git Service. To respond to the

Re: [I] Multi-HNSW graphs per segment? [lucene]

2025-03-11 Thread via GitHub
benwtrent commented on issue #14341: URL: https://github.com/apache/lucene/issues/14341#issuecomment-2715213230 > Are you actively working on this? Or would you like me to explore more? I am not actively exploring it. A POC is definitely needed to explore if this is worth it at search

Re: [PR] Make single value BKDReader instances lighter [lucene]

2025-03-11 Thread via GitHub
original-brownbear merged PR #14337: URL: https://github.com/apache/lucene/pull/14337 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-03-11 Thread via GitHub
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1982961727 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsWriter.java: ## @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] Make Lucene better at skipping long runs of matches. [lucene]

2025-03-11 Thread via GitHub
jpountz merged PR #14312: URL: https://github.com/apache/lucene/pull/14312 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-03-11 Thread via GitHub
msokolov commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2710427176 right that's what it looked like to me - I was only responding to the earlier message where you said: > The "simplified" version now only has a slight more latency compared to "

Re: [PR] Extract leaf-slice calculation path from IndexSearch#slices [lucene]

2025-03-11 Thread via GitHub
original-brownbear merged PR #14336: URL: https://github.com/apache/lucene/pull/14336 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-03-11 Thread via GitHub
dungba88 commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2710473316 Ah right, sorry it was a typo :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-11 Thread via GitHub
mikemccand commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2710473036 > > Could you please also share other parameters of your benchmark (ndoc, maxConn, beamWidthIndex, fanout, etc.) > > I have lost my test environment and I regrettably didn't wri

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-11 Thread via GitHub
DivyanshIITB commented on PR #14335: URL: https://github.com/apache/lucene/pull/14335#issuecomment-2715055808 Thank you for the clarification, @jpountz! I'll drop the merge throttling aspect from the changes since it's disabled by default. Regarding the fixed thread pool approach (

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-11 Thread via GitHub
jpountz commented on PR #14335: URL: https://github.com/apache/lucene/pull/14335#issuecomment-2714988316 Merge throttling is now disabled by default, IMO it's fine to ignore merge throttling for now. Regarding thread creation, I'm thinking of a shared fixed (e.g. number of processors / 2) t

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-03-11 Thread via GitHub
navneet1v commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2714965745 > @navneet1v I wonder if either of you were able to replicate benchmarks? @kaivalnp can you share your lucene util branch so that I can replicate your results. -- This is an

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-03-11 Thread via GitHub
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1989679548 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,488 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [PR] deduplicate standard BKDConfig records [lucene]

2025-03-11 Thread via GitHub
jpountz commented on code in PR #14338: URL: https://github.com/apache/lucene/pull/14338#discussion_r1989674584 ## lucene/core/src/java/org/apache/lucene/util/bkd/BKDConfig.java: ## @@ -38,6 +39,18 @@ public record BKDConfig(int numDims, int numIndexDims, int bytesPerDim, int m

Re: [PR] deduplicate standard BKDConfig records [lucene]

2025-03-11 Thread via GitHub
iverase commented on code in PR #14338: URL: https://github.com/apache/lucene/pull/14338#discussion_r1989664702 ## lucene/core/src/java/org/apache/lucene/util/bkd/BKDConfig.java: ## @@ -38,6 +39,19 @@ public record BKDConfig(int numDims, int numIndexDims, int bytesPerDim, int m

Re: [PR] deduplicate standard BKDConfig records [lucene]

2025-03-11 Thread via GitHub
iverase commented on code in PR #14338: URL: https://github.com/apache/lucene/pull/14338#discussion_r1989663618 ## lucene/core/src/java/org/apache/lucene/util/bkd/BKDConfig.java: ## @@ -38,6 +39,19 @@ public record BKDConfig(int numDims, int numIndexDims, int bytesPerDim, int m

Re: [PR] deduplicate standard BKDConfig records [lucene]

2025-03-11 Thread via GitHub
jpountz commented on code in PR #14338: URL: https://github.com/apache/lucene/pull/14338#discussion_r1989627096 ## lucene/core/src/java/org/apache/lucene/util/bkd/BKDConfig.java: ## @@ -68,6 +82,16 @@ public record BKDConfig(int numDims, int numIndexDims, int bytesPerDim, int m

Re: [I] Multi-HNSW graphs per segment? [lucene]

2025-03-11 Thread via GitHub
tteofili commented on issue #14341: URL: https://github.com/apache/lucene/issues/14341#issuecomment-2714880037 > it is conceivable that clusters are of common distributions, consequently we can quit searching clusters early and only search a couple of the clusters at a time. I think

[PR] Speedup slice calculation in IndexSearcher [lucene]

2025-03-11 Thread via GitHub
original-brownbear opened a new pull request, #14343: URL: https://github.com/apache/lucene/pull/14343 It's in the title, some obvious speedups. This is fairly expensive logic for Elasticsearch when run over a larger number of shards. No need for streams, creating comparator instances and s

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-03-11 Thread via GitHub
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2714811428 I've added a GH workflow (see [sample output](https://github.com/apache/lucene/actions/runs/13791742930/job/38573182600?pr=14178)) that builds and adds the C_API of Faiss before running

Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

2025-03-11 Thread via GitHub
jpountz commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2714778925 > I do think that this is more generally useful, than just the particular use case of on or off -heap FST in completion postings. I'm curious of what other use-cases you have in mi

Re: [I] Multi-HNSW graphs per segment? [lucene]

2025-03-11 Thread via GitHub
atris commented on issue #14341: URL: https://github.com/apache/lucene/issues/14341#issuecomment-2714751735 It's actually crazy - I was thinking of starting a discussion on this today. One thing that I have been playing with is creating clusters with centroids that are at a certain ra

[I] Multi-HNSW graphs per segment? [lucene]

2025-03-11 Thread via GitHub
benwtrent opened a new issue, #14341: URL: https://github.com/apache/lucene/issues/14341 ### Description What do we think about clustering or grouping documents by centroids, or potentially in chunks of filters and allow multiple graphs per segment. If segments are random sub-samples

[I] New binary vector format doesn't perform well with small-dimension datasets [lucene]

2025-03-11 Thread via GitHub
lpld opened a new issue, #14342: URL: https://github.com/apache/lucene/issues/14342 Hi lucene team. Last week I've been playing with the [quantization format](https://github.com/apache/lucene/pull/14078) that's been recently added to lucene. Main idea was to take the datasets from [ann-ben

Re: [PR] Create vectorized versions of ScalarQuantizer.quantize and recalculateCorrectiveOffset [lucene]

2025-03-11 Thread via GitHub
benwtrent commented on PR #14304: URL: https://github.com/apache/lucene/pull/14304#issuecomment-2713553719 On GCP, there isn't much difference. I wouldn't expect there to be a huge amount of difference as the dominate cost is the vector comparisons not the quantization. I haven't tes

[PR] Reduce Lucene90DocValuesProducer memory footprint [lucene]

2025-03-11 Thread via GitHub
iverase opened a new pull request, #14340: URL: https://github.com/apache/lucene/pull/14340 Lucene90DocValuesProducer holds all the metadata found on the meta file on heap . At runtime, it processes that metadata to produce the right doc value flavour, e.g dense vs sparse. This is a bit w

Re: [PR] Create vectorized versions of ScalarQuantizer.quantize and recalculateCorrectiveOffset [lucene]

2025-03-11 Thread via GitHub
benwtrent commented on code in PR #14304: URL: https://github.com/apache/lucene/pull/14304#discussion_r1988779142 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -907,4 +907,87 @@ public static long int4BitDotProduct128(byte

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-11 Thread via GitHub
benwtrent commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2713454213 Hey @lpld > May I also ask about the selection of datasets being used for the benchmarks? How do you choose them? I haven't tested with SIFT, though be sure to use euclid

Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

2025-03-11 Thread via GitHub
ChrisHegarty commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2713450274 I do think that this is more generally useful, than just the particular use case of on or off -heap FST in completion postings. > If we want to allow configuring how a codec g