[I] TestLogMergePolicy#testNoPathologicalMerge reproducible failure [lucene]

2025-02-06 Thread via GitHub
iverase opened a new issue, #14206: URL: https://github.com/apache/lucene/issues/14206 The following seed reproduces the issue: ``` ./gradlew :lucene:core:test --tests "org.apache.lucene.index.TestLogMergePolicy.testNoPathologicalMerges" -Ptests.seed=5C1CAC337454D389 > Ta

[PR] Make Operations#union merge accept states that have no outgoing transition. [lucene]

2025-02-06 Thread via GitHub
jpountz opened a new pull request, #14207: URL: https://github.com/apache/lucene/pull/14207 This helps generate simpler automata, especially when these automata are later combined through other operations such as `Operations#concat`. -- This is an automated message from the Apache Git Ser

Re: [PR] Automatically remove dead states from concatenated automata. [lucene]

2025-02-06 Thread via GitHub
rmuir commented on PR #14205: URL: https://github.com/apache/lucene/pull/14205#issuecomment-2639660109 It's in my PR over there too. I think we should avoid addEpsilon in the test. We don't even need any transitions. -- This is an automated message from the Apache Git Service. To respond

[PR] Automatically remove dead states from concatenated automata. [lucene]

2025-02-06 Thread via GitHub
jpountz opened a new pull request, #14205: URL: https://github.com/apache/lucene/pull/14205 Concatenating automata frequently creates dead states. This PR suggests that `Operations#concatenate` automatically removes these dead states. This is not unseen: `Operations#repeat`, `Operations#uni

Re: [PR] Migrate OpenNLP 'ant train-test-models' to Gradle [lucene]

2025-02-06 Thread via GitHub
dweiss commented on code in PR #14198: URL: https://github.com/apache/lucene/pull/14198#discussion_r1944528365 ## lucene/analysis/opennlp/build.gradle: ## @@ -26,3 +26,33 @@ dependencies { moduleTestImplementation project(':lucene:test-framework') } + +ext { + testModelDa

Re: [PR] Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves [lucene]

2025-02-06 Thread via GitHub
gf2121 commented on PR #14176: URL: https://github.com/apache/lucene/pull/14176#issuecomment-2639462765 Thanks @jpountz ! Updated. Could you please also help review https://github.com/mikemccand/luceneutil/pull/335 ? I'd like to merge it first so that nightly benchmark can ca

Re: [PR] Introduce bpv24 vectorized decoding for DocIdsWriter [lucene]

2025-02-06 Thread via GitHub
gf2121 commented on PR #14176: URL: https://github.com/apache/lucene/pull/14176#issuecomment-2639090793 Thanks @iverase ! For the vectorized decodeing, I benchmarked the decoding method with jmh, the result on my M2 mac: ``` Benchmark Mode Cnt

[PR] Add histogram facet capabilities. [lucene]

2025-02-06 Thread via GitHub
jpountz opened a new pull request, #14204: URL: https://github.com/apache/lucene/pull/14204 This is inspired from a paper by Tencent where the authors describe how they speed up so-called "histogram queries" by sorting the index by timestamp and translating ranges of values corresponding to

Re: [PR] Add Automata.makeCharSet/makeCharClass to optimize regexp [lucene]

2025-02-06 Thread via GitHub
rmuir commented on code in PR #14193: URL: https://github.com/apache/lucene/pull/14193#discussion_r1944626865 ## lucene/core/src/test/org/apache/lucene/util/automaton/TestAutomaton.java: ## @@ -667,11 +667,14 @@ public void testConcatenatePreservesDet() throws Exception { }

Re: [I] Fix Operations.reverse() to not add nondetermistic dead states [lucene]

2025-02-06 Thread via GitHub
rmuir commented on issue #14211: URL: https://github.com/apache/lucene/issues/14211#issuecomment-2640743462 PR: https://github.com/apache/lucene/pull/14212 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] Support DataInput as source for StoredField [lucene]

2025-02-06 Thread via GitHub
Tim-Brooks commented on PR #14213: URL: https://github.com/apache/lucene/pull/14213#issuecomment-2640892230 I am opening this proposed change to support writing a stored field from a byte source which does not require a contiguous array allocation. The reason I am proposing this is because

[PR] Support DataInput as source for StoredField [lucene]

2025-02-06 Thread via GitHub
Tim-Brooks opened a new pull request, #14213: URL: https://github.com/apache/lucene/pull/14213 Allows a StoredField to be created from a DataInput. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] SortedSet DV Multi Range query [lucene]

2025-02-06 Thread via GitHub
mkhludnev commented on PR #13974: URL: https://github.com/apache/lucene/pull/13974#issuecomment-2640847271 Thanks. I'm happy to hear. Here's what I have to work on: - @gsmiller what's your feeling about the [proposed API](https://github.com/apache/lucene/blob/c56caeb26a5af4b0afc5f2cb04a4f

Re: [PR] Add updateable random scorer interface for vector index building [lucene]

2025-02-06 Thread via GitHub
benwtrent commented on PR #14181: URL: https://github.com/apache/lucene/pull/14181#issuecomment-2639727635 OK, benchmarks show pretty much no difference when I was testing on my machine. But, recall numbers all checkout and the API is nicer and makes more sense for vector merging. So

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-06 Thread via GitHub
jpountz commented on PR #14204: URL: https://github.com/apache/lucene/pull/14204#issuecomment-2640553823 I agree that providing a small interval is a bad usage pattern. I don't know how to validate this though, since we can't know the range of values of the docs that match the query up-fron

Re: [PR] Deprecate Operations.concat(a1, a2) and Operations.union(a1, a2) [lucene]

2025-02-06 Thread via GitHub
rmuir commented on PR #14209: URL: https://github.com/apache/lucene/pull/14209#issuecomment-2640553292 To me the deprecation is easy enough, developer is usually responsive to such things. It comes across different than just a hard break in a few ways, usually you question why the p

Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

2025-02-06 Thread via GitHub
benwtrent commented on PR #14160: URL: https://github.com/apache/lucene/pull/14160#issuecomment-2640589065 OK, the current implementation is about as good as I can figure it. - We explore greater than neighbor-neighbors if we gathered < maxConn/4 vectors to score - We will explor

Re: [PR] Add Automata.makeCharSet/makeCharClass to optimize regexp [lucene]

2025-02-06 Thread via GitHub
rmuir merged PR #14193: URL: https://github.com/apache/lucene/pull/14193 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

[I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-06 Thread via GitHub
benwtrent opened a new issue, #14208: URL: https://github.com/apache/lucene/issues/14208 ### Description I am not sure of other structures, but HNSW merges can allocate a pretty large chunk of memory on heap. For example: Let's have the max_conn set to 16. Thus connectio

Re: [I] Fix Operations.reverse() to not add nondetermistic dead states [lucene]

2025-02-06 Thread via GitHub
rmuir commented on issue #14211: URL: https://github.com/apache/lucene/issues/14211#issuecomment-2640589820 We have one use of reverse() outside of tests: but it is an important one (getCommonSuffix) and it already cleans up after reverse by calling removeDeadStates(). So to me, the obvious

Re: [I] Fix Operations.reverse() to not add nondetermistic dead states [lucene]

2025-02-06 Thread via GitHub
rmuir commented on issue #14211: URL: https://github.com/apache/lucene/issues/14211#issuecomment-2640601140 I understand it now: the problem is the Set of new initialStates being populated as a side-effect, which is what brzozowski is using. some of those are dead states, we remove them, bu

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-06 Thread via GitHub
iverase commented on PR #14204: URL: https://github.com/apache/lucene/pull/14204#issuecomment-2640605039 >So I guess that the only option is to fail at runtime. I can do that. What looks like a reasonable cap on the number of returned intervals? 1024? 1024 sounds a good default and ma

[jira] [Updated] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

2025-02-06 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated LUCENE-10471: Labels: pull-request-available (was: ) > Increase the number of dims for KNN vectors to

Re: [I] Increase the number of dims for KNN vectors to 2048 [LUCENE-10471] [lucene]

2025-02-06 Thread via GitHub
jzwolak commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-2640596317 @asfimport I suggest making it easier to change the limit. I appreciate the need to have a limit for optimization and performance. The models that have more than 1024 dimensi

Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

2025-02-06 Thread via GitHub
stefanvodita commented on code in PR #13914: URL: https://github.com/apache/lucene/pull/13914#discussion_r1944983208 ## lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java: ## @@ -202,66 +208,83 @@ public SegmentOutput(int hitsLength) { * is used to

Re: [PR] Deprecate Operations.concat(a1, a2) and Operations.union(a1, a2) [lucene]

2025-02-06 Thread via GitHub
rmuir commented on PR #14209: URL: https://github.com/apache/lucene/pull/14209#issuecomment-2640351185 I will look into the test fail, I did run them many times, but I was also pretty aggressive about trying to cleanup tests, so that we can just remove the deprecations in a followup commit,

Re: [I] flaky test: concatenate turns NFA into a DFA and it causes test fail [lucene]

2025-02-06 Thread via GitHub
rmuir commented on issue #14210: URL: https://github.com/apache/lucene/issues/14210#issuecomment-2640522300 I pushed fix to correct the build. but I will add more tracing to this seed, to see who is adding the "nondetermistic dead states". if it is an automaton method, we need to fix

[I] Fix Operations.reverse() to not add nondetermistic dead states [lucene]

2025-02-06 Thread via GitHub
rmuir opened a new issue, #14211: URL: https://github.com/apache/lucene/issues/14211 ### Description Operations.reverse() doesn't just add dead-states, it adds dead-states with nondeterminism, such that returned automaton `isDetermistic()` becomes `false`. That's really a bit more th

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-06 Thread via GitHub
rmuir commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1945855034 ## lucene/core/src/java/org/apache/lucene/util/automaton/CaseFoldingUtil.java: ## @@ -0,0 +1,338 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2025-02-06 Thread via GitHub
dsmiley commented on issue #13387: URL: https://github.com/apache/lucene/issues/13387#issuecomment-2641899421 Addressing this need would be amazing! Many search architectures (including where I work) always filter to a specific field (say a doc type or tenant/user; it depends). That 50-60

Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-06 Thread via GitHub
Vikasht34 commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2642077599 Hierarchical Merge Execution (Layer-by-Layer Merging): Instead of merging all HNSW layers at once, which leads to high peak memory usage, merges can be executed incrementally, lay

Re: [PR] Migrate OpenNLP 'ant train-test-models' to Gradle [lucene]

2025-02-06 Thread via GitHub
msfroh commented on code in PR #14198: URL: https://github.com/apache/lucene/pull/14198#discussion_r1945585486 ## lucene/analysis/opennlp/build.gradle: ## @@ -26,3 +26,33 @@ dependencies { moduleTestImplementation project(':lucene:test-framework') } + +ext { + testModelDa

Re: [PR] Make Operations#union merge accept states that have no outgoing transition. [lucene]

2025-02-06 Thread via GitHub
rmuir commented on code in PR #14207: URL: https://github.com/apache/lucene/pull/14207#discussion_r1945858708 ## lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java: ## @@ -1052,6 +1052,77 @@ public static Automaton removeDeadStates(Automaton a) { return r

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-06 Thread via GitHub
gsmiller commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1945846299 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) unde

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-06 Thread via GitHub
john-wagster commented on PR #14192: URL: https://github.com/apache/lucene/pull/14192#issuecomment-2641159156 Iterated here a bit after the changes in https://github.com/apache/lucene/pull/14193 went in and also pivoted to using https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt. I

Re: [PR] Migrate OpenNLP 'ant train-test-models' to Gradle [lucene]

2025-02-06 Thread via GitHub
msfroh commented on code in PR #14198: URL: https://github.com/apache/lucene/pull/14198#discussion_r1945713513 ## lucene/analysis/opennlp/build.gradle: ## @@ -26,3 +26,33 @@ dependencies { moduleTestImplementation project(':lucene:test-framework') } + +ext { + testModelDa

[PR] Deprecate Operations.concat(a1, a2) and Operations.union(a1, a2) [lucene]

2025-02-06 Thread via GitHub
rmuir opened a new pull request, #14209: URL: https://github.com/apache/lucene/pull/14209 These algorithms run in linear time: it is trappy to offer two-arg options: they encourage users to use them in a loop and create quadratic time. Send a strong signal to the user's editor/IDE to

Re: [PR] Add updateable random scorer interface for vector index building [lucene]

2025-02-06 Thread via GitHub
benwtrent merged PR #14181: URL: https://github.com/apache/lucene/pull/14181 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [I] Evaluate adding a double addressing vector scorer [lucene]

2025-02-06 Thread via GitHub
benwtrent closed issue #13966: Evaluate adding a double addressing vector scorer URL: https://github.com/apache/lucene/issues/13966 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [PR] Add updateable random scorer interface for vector index building [lucene]

2025-02-06 Thread via GitHub
ChrisHegarty commented on PR #14181: URL: https://github.com/apache/lucene/pull/14181#issuecomment-2640293165 Thank you @benwtrent - let's wait for the next lucene nightly. If no perf improvement, that's ok. There should be a lot less garbage created, and CPU devoted to cleaning young heap

Re: [I] Evaluate adding a double addressing vector scorer [lucene]

2025-02-06 Thread via GitHub
benwtrent commented on issue #13966: URL: https://github.com/apache/lucene/issues/13966#issuecomment-2640281978 Fixed by: https://github.com/apache/lucene/pull/14181 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] Add Automata.makeCharSet/makeCharClass to optimize regexp [lucene]

2025-02-06 Thread via GitHub
dweiss commented on PR #14193: URL: https://github.com/apache/lucene/pull/14193#issuecomment-2640330745 > Finally! The concatenate() issue was an easy fix, it neglected to clean up its dead states. All of its partners in crime do this, but the fact we neglect it for concatenate messes up to

Re: [I] flaky test: concatenate turns NFA into a DFA and it causes test fail [lucene]

2025-02-06 Thread via GitHub
rmuir commented on issue #14210: URL: https://github.com/apache/lucene/issues/14210#issuecomment-2640461459 And of course the automaton is massive, nightmare: ![Image](https://github.com/user-attachments/assets/d61e0f36-b4bf-4cd2-ac4a-4556922b7142) -- This is an automated message f

Re: [I] flaky test: concatenate turns NFA into a DFA and it causes test fail [lucene]

2025-02-06 Thread via GitHub
rmuir commented on issue #14210: URL: https://github.com/apache/lucene/issues/14210#issuecomment-2640481736 If i comment out `removeDeadStates()` then the test passes. If I add the `removeDeadStates()` to the test before doing any concatenate, the test passes. The problem happens b

Re: [I] flaky test: concatenate turns NFA into a DFA and it causes test fail [lucene]

2025-02-06 Thread via GitHub
asfgit closed issue #14210: flaky test: concatenate turns NFA into a DFA and it causes test fail URL: https://github.com/apache/lucene/issues/14210 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

Re: [PR] Deprecate Operations.concat(a1, a2) and Operations.union(a1, a2) [lucene]

2025-02-06 Thread via GitHub
rmuir commented on PR #14209: URL: https://github.com/apache/lucene/pull/14209#issuecomment-2640379859 The problem is not related to this PR from what I can tell, the issue is that `concatenate()` turns an NFA into a DFA and this fails the test. I can reproduce it in main: I will deal

Re: [PR] Deprecate Operations.concat(a1, a2) and Operations.union(a1, a2) [lucene]

2025-02-06 Thread via GitHub
rmuir commented on PR #14209: URL: https://github.com/apache/lucene/pull/14209#issuecomment-2640417498 https://github.com/apache/lucene/issues/14210 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-06 Thread via GitHub
iverase commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1944680152 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [PR] Automatically remove dead states from concatenated automata. [lucene]

2025-02-06 Thread via GitHub
jpountz closed pull request #14205: Automatically remove dead states from concatenated automata. URL: https://github.com/apache/lucene/pull/14205 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] Automatically remove dead states from concatenated automata. [lucene]

2025-02-06 Thread via GitHub
jpountz commented on PR #14205: URL: https://github.com/apache/lucene/pull/14205#issuecomment-2639972179 Superseded by #14193 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Add Automata.makeCharSet/makeCharClass to optimize regexp [lucene]

2025-02-06 Thread via GitHub
jpountz commented on code in PR #14193: URL: https://github.com/apache/lucene/pull/14193#discussion_r1944819744 ## lucene/core/src/test/org/apache/lucene/util/automaton/TestAutomaton.java: ## @@ -667,11 +667,14 @@ public void testConcatenatePreservesDet() throws Exception {

Re: [PR] Integrating GPU based Vector Search using cuVS [lucene]

2025-02-06 Thread via GitHub
chatman commented on code in PR #14131: URL: https://github.com/apache/lucene/pull/14131#discussion_r1945235820 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSKnnFloatVectorQuery.java: ## @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation