Re: [I] TestKnnGraph.testMultiThreadedSearch random test failure [lucene]

2025-03-17 Thread via GitHub
benwtrent commented on issue #14327: URL: https://github.com/apache/lucene/issues/14327#issuecomment-2730201462 Git bisect blames: a6a96cde1c65fddb65363f0090a0202fd6db329c Which, if the scores are the same between docs, makes sense to me. -- This is an automated message from the Apa

Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-17 Thread via GitHub
jpountz commented on code in PR #14365: URL: https://github.com/apache/lucene/pull/14365#discussion_r1999318407 ## lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java: ## @@ -251,6 +252,30 @@ public void visit(int docID, byte[] packedValue) {

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-17 Thread via GitHub
msfroh commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730390727 Instead of a boolean flag, what if we define an interface that specifies the folding rules? It could have two methods: one that folds input characters to a canonical representatio

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-17 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730397885 I think my ask is misunderstood, it is just to follow the Unicode standard. There are two mappings for simple case folding: * Default * Alternate (Turkish/azeri) -- This is an auto

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-17 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730410784 If you want to do fancy romanian accent removal, use an analyzer and normalize your data. That's what a search engine is all about. But if we want to provide some limited runtime ex

Re: [PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]

2025-03-17 Thread via GitHub
rmuir commented on PR #14360: URL: https://github.com/apache/lucene/pull/14360#issuecomment-2730119438 i'll keep the PR up here. Actually as a first step, I'd rather improve existing toDot() and regex toString(). It would help the logic here, too. There's no need to escape codepoints

Re: [PR] Decode doc ids in BKD leaves with auto-vectorized loops [lucene]

2025-03-17 Thread via GitHub
gf2121 commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2729902081 I raised an PR for annotation. https://github.com/mikemccand/luceneutil/pull/354. -- This is an automated message from the Apache Git Service. To respond to the message, please log on t

Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2025-03-17 Thread via GitHub
gf2121 commented on PR #14361: URL: https://github.com/apache/lucene/pull/14361#issuecomment-2729280384 OK i get expected results that multiple of 16 faster than multiple of 8 when i force `-XX:UseAVX=3`, it can be seen AVX3 is slower on this chip, that may be why java disabled it by defaul

Re: [PR] Decode doc ids in BKD leaves with auto-vectorized loops [lucene]

2025-03-17 Thread via GitHub
gf2121 commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2729806648 Nightly benchmark confirmed the speed up https://benchmarks.mikemccandless.com/2025.03.16.18.04.58.html. Thanks again for profile guide and helping figure out simpler and faster co

Re: [I] Multi-threaded vector search over multiple segments can lead to inconsistent results [lucene]

2025-03-17 Thread via GitHub
tteofili commented on issue #14180: URL: https://github.com/apache/lucene/issues/14180#issuecomment-2730194375 > Could you please let me know which future version of Elasticsearch will resolve the vector search consistency problem? we are investigating on a proper solution to this iss

Re: [PR] Completion FSTs to be loaded off-heap by default [lucene]

2025-03-17 Thread via GitHub
javanna commented on code in PR #14364: URL: https://github.com/apache/lucene/pull/14364#discussion_r1999422163 ## lucene/suggest/src/java/org/apache/lucene/search/suggest/document/CompletionPostingsFormat.java: ## @@ -122,11 +122,6 @@ public enum FSTLoadMode { private fina

Re: [PR] Speedup merging of HNSW graphs [lucene]

2025-03-17 Thread via GitHub
mayya-sharipova commented on PR #14331: URL: https://github.com/apache/lucene/pull/14331#issuecomment-2730506974 @msokolov Thanks for the comment. I've experimented setting: beamCandidates0 to `M * 3` increasing it from the previous `M*2` when building merged graphs. Graphs look bette

Re: [PR] Integrating GPU based Vector Search using cuVS [lucene]

2025-03-17 Thread via GitHub
kaivalnp commented on PR #14131: URL: https://github.com/apache/lucene/pull/14131#issuecomment-2730545360 Exciting change! Since this PR adds a new codec for vector search, I wanted to point to #14178 along similar lines -- adding a new Faiss-based KNN format to index and query vectors

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-17 Thread via GitHub
msfroh commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730578138 Okay, got it! That's the piece that I was misunderstanding. I didn't realize that Turkish/Azeri is the **only** other valid folding. I kept thinking of it as just an example where the naï

Re: [PR] Completion FSTs to be loaded off-heap by default [lucene]

2025-03-17 Thread via GitHub
javanna commented on code in PR #14364: URL: https://github.com/apache/lucene/pull/14364#discussion_r1999421774 ## lucene/suggest/src/java/org/apache/lucene/search/suggest/document/Completion101PostingsFormat.java: ## @@ -25,17 +25,9 @@ * @lucene.experimental */ public clas

Re: [I] A little optimization about LZ4 [lucene]

2025-03-17 Thread via GitHub
jainankitk commented on issue #14347: URL: https://github.com/apache/lucene/issues/14347#issuecomment-2730860749 I am not sure if it is good idea to have this as user parameter. But, I am wondering if the default for `BEST_SPEED` should be using preset dict as that compromises speed for co

[PR] Completion FSTs to be loaded off-heap by default [lucene]

2025-03-17 Thread via GitHub
javanna opened a new pull request, #14364: URL: https://github.com/apache/lucene/pull/14364 All the existing completion postings format load their FSTs on-heap. It is possible to customize that behaviour by mainintaing a custom postings format that override the fst load mode. TestSug

Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

2025-03-17 Thread via GitHub
javanna commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2729969364 I opened #14364 to make the suggested change to the completion postings format, let me know what you think. -- This is an automated message from the Apache Git Service. To respond to t

Re: [PR] Address completion fields testing gap and truly allow loading FST off heap [lucene]

2025-03-17 Thread via GitHub
javanna commented on PR #14270: URL: https://github.com/apache/lucene/pull/14270#issuecomment-2729966315 I opened #14364 to still address the testing gap, but also change the default load mode to off heap. -- This is an automated message from the Apache Git Service. To respond to the mess

Re: [PR] Address completion fields testing gap and truly allow loading FST off heap [lucene]

2025-03-17 Thread via GitHub
javanna closed pull request #14270: Address completion fields testing gap and truly allow loading FST off heap URL: https://github.com/apache/lucene/pull/14270 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-17 Thread via GitHub
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2729965328 An aside: `org.apache.lucene.search.DisjunctionScorer.TwoPhase#matches` looks kind of sad, in that each matches() call is going to build a priority queue of "unverified matches" (DisiWr

Re: [PR] Decode doc ids in BKD leaves with auto-vectorized loops [lucene]

2025-03-17 Thread via GitHub
jpountz commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2729975580 Fantastic speedup. Nice to see tasks like `TermDayOfYearSort` also take advantage from this change. -- This is an automated message from the Apache Git Service. To respond to the messa

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-17 Thread via GitHub
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2729988278 The current approach is probably not the fastest indeed. We should add a task to nightly benchmarks if we want to optimize this. Something like a disjunction of phrase queries (possibly

Re: [I] Opening of vector files with ReadAdvice.RANDOM_PRELOAD [lucene]

2025-03-17 Thread via GitHub
jainankitk commented on issue #14348: URL: https://github.com/apache/lucene/issues/14348#issuecomment-2730902349 > Lucene currently uses ReadAdvice.RANDOM when opening these files. I think it would be better to use RANDOM_PRELOAD. As per the documentation for RANDOM_PRELOAD: _

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-17 Thread via GitHub
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2730909459 BTW I don't have plans to explore this further. Anyone should feel free to take over. Or abandon if nobody cares -- I admit it's very unusual to even have a top level disjunction, let

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-17 Thread via GitHub
mccullocht commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r1999677796 ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,72 @@ +package org.apache.lucene.index; + +import java.util.concurrent

Re: [PR] Implement #docIDRunEnd() on `DisjunctionDISIApproximation`. [lucene]

2025-03-17 Thread via GitHub
jpountz merged PR #14363: URL: https://github.com/apache/lucene/pull/14363 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Completion FSTs to be loaded off-heap by default [lucene]

2025-03-17 Thread via GitHub
jpountz commented on code in PR #14364: URL: https://github.com/apache/lucene/pull/14364#discussion_r1999714466 ## lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestSuggestField.java: ## @@ -951,7 +951,16 @@ static IndexWriterConfig iwcWithSuggestField(Analyz

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-17 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2731329809 it is confusing. because unicode case folding algorithm is supposed to work for everyone. But here's the problem: for most of the world: * lowercase i has a dot, uppercase I has n

[PR] Add Issue Tracker Link under 'Editing Content on the Lucene™ Sites' [lucene-site]

2025-03-17 Thread via GitHub
DivyanshIITB opened a new pull request, #78: URL: https://github.com/apache/lucene-site/pull/78 This PR adds a direct link to the [Lucene Issue Tracker](https://issues.apache.org/jira/projects/LUCENE) under the "Editing Content on the Lucene™ sites" section in site-instructions.md. C

[PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-17 Thread via GitHub
gf2121 opened a new pull request, #14365: URL: https://github.com/apache/lucene/pull/14365 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-ma

Re: [I] Opening of vector files with ReadAdvice.RANDOM_PRELOAD [lucene]

2025-03-17 Thread via GitHub
jpountz commented on issue #14348: URL: https://github.com/apache/lucene/issues/14348#issuecomment-2730966937 For what it's worth, it's possible to override the read advice of vectors with something like that: ```java Path path = ...; Directory dir = new FilterDirectory(FSDirect

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-17 Thread via GitHub
DivyanshIITB commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r1998645110 ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,44 @@ +package org.apache.lucene.index; + +import java.util.concurre

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-17 Thread via GitHub
DivyanshIITB commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r1998646386 ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,44 @@ +package org.apache.lucene.index; + +import java.util.concurre

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-17 Thread via GitHub
DivyanshIITB commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r1998645593 ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,44 @@ +package org.apache.lucene.index; + +import java.util.concurre

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-17 Thread via GitHub
DivyanshIITB commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r1998646829 ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,44 @@ +package org.apache.lucene.index; + +import java.util.concurre

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-17 Thread via GitHub
DivyanshIITB commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r1998648834 ## lucene/core/src/test/org/apache/lucene/index/TestMultiTenantMergeScheduler.java: ## @@ -0,0 +1,30 @@ +package org.apache.lucene.index; + +import org.apache.luc

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-17 Thread via GitHub
DivyanshIITB commented on PR #14335: URL: https://github.com/apache/lucene/pull/14335#issuecomment-2729358357 I have a request to you. Kindly ignore the following two deleted files in the "Files Changed" section : "KeepOnlyLastCommitDeletionPolicy.java" "ConcurrentMergeScheduler.java"

Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2025-03-17 Thread via GitHub
gf2121 commented on PR #14361: URL: https://github.com/apache/lucene/pull/14361#issuecomment-2729094239 Results on `wikimediumall`: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value

Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2025-03-17 Thread via GitHub
jpountz commented on PR #14361: URL: https://github.com/apache/lucene/pull/14361#issuecomment-2729147376 Should we floor to a multiple of 16 instead of 8 so that we have a perfect second loop with AVX-512 as well? (By the way, which of your machine produced the above benchmark results?) Oth

Re: [PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]

2025-03-17 Thread via GitHub
jpountz commented on code in PR #14359: URL: https://github.com/apache/lucene/pull/14359#discussion_r1998546217 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -238,9 +296,77 @@ private void scoreWindowUsingBitSet( windowMatches.clear

Re: [PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]

2025-03-17 Thread via GitHub
jpountz commented on code in PR #14359: URL: https://github.com/apache/lucene/pull/14359#discussion_r1998546819 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -238,9 +296,77 @@ private void scoreWindowUsingBitSet( windowMatches.clear

Re: [PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]

2025-03-17 Thread via GitHub
gf2121 commented on code in PR #14359: URL: https://github.com/apache/lucene/pull/14359#discussion_r1998596143 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -238,9 +296,77 @@ private void scoreWindowUsingBitSet( windowMatches.clear(

Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2025-03-17 Thread via GitHub
gf2121 commented on PR #14361: URL: https://github.com/apache/lucene/pull/14361#issuecomment-2729250628 Thanks for feedback, > Should we floor to a multiple of 16 instead of 8 so that we have a perfect second loop with AVX-512 as well? That is what i thought initially. But my A

[PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2025-03-17 Thread via GitHub
gf2121 opened a new pull request, #14361: URL: https://github.com/apache/lucene/pull/14361 This PR tries another way to implement the idea of https://github.com/apache/lucene/pull/13521, taking advantage of auto-vectorized loop to decode ints like we did in for bpv24 in https://github.com/

Re: [PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]

2025-03-17 Thread via GitHub
gf2121 commented on code in PR #14359: URL: https://github.com/apache/lucene/pull/14359#discussion_r1998258104 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -238,9 +296,77 @@ private void scoreWindowUsingBitSet( windowMatches.clear(

Re: [I] TestKnnGraph.testMultiThreadedSearch random test failure [lucene]

2025-03-17 Thread via GitHub
gf2121 commented on issue #14327: URL: https://github.com/apache/lucene/issues/14327#issuecomment-2729377397 Seeing similar failure as well: ``` > java.lang.AssertionError: [doc=0 score=0.990099 shardIndex=-1, doc=3 score=0.49751243 shardIndex=-1, doc=5 score=0.21691975 shardIn

Re: [PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]

2025-03-17 Thread via GitHub
dweiss commented on PR #14360: URL: https://github.com/apache/lucene/pull/14360#issuecomment-2728443809 I've looked at the docs of mermaid and toyed around a bit. I agree that infinite loops are so ugly that one's eyes start to bleed. Maybe we should stick to graphviz. -- This is an auto

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-17 Thread via GitHub
vigyasharma commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r1998041333 ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,44 @@ +package org.apache.lucene.index; + +import java.util.concurren

Re: [PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]

2025-03-17 Thread via GitHub
gf2121 commented on code in PR #14359: URL: https://github.com/apache/lucene/pull/14359#discussion_r1998219212 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -238,9 +296,77 @@ private void scoreWindowUsingBitSet( windowMatches.clear(

[I] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-03-17 Thread via GitHub
guojialiang92 opened a new issue, #14362: URL: https://github.com/apache/lucene/issues/14362 ### Description Can we support modifying `segmentInfos.counter` in `IndexWriter`? This can be used to skip some segment names when writing. In the scenario of enabling `segment replicatio

Re: [PR] Specialise DirectMonotonicReader when it only contains one block [lucene]

2025-03-17 Thread via GitHub
iverase closed pull request #14358: Specialise DirectMonotonicReader when it only contains one block URL: https://github.com/apache/lucene/pull/14358 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to