date:20250206

Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-06 Thread via GitHub

Vikasht34 commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2642077599 Hierarchical Merge Execution (Layer-by-Layer Merging): Instead of merging all HNSW layers at once, which leads to high peak memory usage, merges can be executed incrementally, lay

Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2025-02-06 Thread via GitHub

dsmiley commented on issue #13387: URL: https://github.com/apache/lucene/issues/13387#issuecomment-2641899421 Addressing this need would be amazing! Many search architectures (including where I work) always filter to a specific field (say a doc type or tenant/user; it depends). That 50-60

Re: [PR] Make Operations#union merge accept states that have no outgoing transition. [lucene]

2025-02-06 Thread via GitHub

rmuir commented on code in PR #14207: URL: https://github.com/apache/lucene/pull/14207#discussion_r1945858708 ## lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java: ## @@ -1052,6 +1052,77 @@ public static Automaton removeDeadStates(Automaton a) { return r

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-06 Thread via GitHub

gsmiller commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1945846299 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) unde

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-06 Thread via GitHub

rmuir commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1945855034 ## lucene/core/src/java/org/apache/lucene/util/automaton/CaseFoldingUtil.java: ## @@ -0,0 +1,338 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

Re: [PR] Migrate OpenNLP 'ant train-test-models' to Gradle [lucene]

2025-02-06 Thread via GitHub

msfroh commented on code in PR #14198: URL: https://github.com/apache/lucene/pull/14198#discussion_r1945713513 ## lucene/analysis/opennlp/build.gradle: ## @@ -26,3 +26,33 @@ dependencies { moduleTestImplementation project(':lucene:test-framework') } + +ext { + testModelDa

Re: [PR] Migrate OpenNLP 'ant train-test-models' to Gradle [lucene]

2025-02-06 Thread via GitHub

msfroh commented on code in PR #14198: URL: https://github.com/apache/lucene/pull/14198#discussion_r1945585486 ## lucene/analysis/opennlp/build.gradle: ## @@ -26,3 +26,33 @@ dependencies { moduleTestImplementation project(':lucene:test-framework') } + +ext { + testModelDa

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-06 Thread via GitHub

john-wagster commented on PR #14192: URL: https://github.com/apache/lucene/pull/14192#issuecomment-2641159156 Iterated here a bit after the changes in https://github.com/apache/lucene/pull/14193 went in and also pivoted to using https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt. I

[PR] Support DataInput as source for StoredField [lucene]

2025-02-06 Thread via GitHub

Tim-Brooks opened a new pull request, #14213: URL: https://github.com/apache/lucene/pull/14213 Allows a StoredField to be created from a DataInput. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Support DataInput as source for StoredField [lucene]

2025-02-06 Thread via GitHub

Tim-Brooks commented on PR #14213: URL: https://github.com/apache/lucene/pull/14213#issuecomment-2640892230 I am opening this proposed change to support writing a stored field from a byte source which does not require a contiguous array allocation. The reason I am proposing this is because

Re: [PR] SortedSet DV Multi Range query [lucene]

2025-02-06 Thread via GitHub

mkhludnev commented on PR #13974: URL: https://github.com/apache/lucene/pull/13974#issuecomment-2640847271 Thanks. I'm happy to hear. Here's what I have to work on: - @gsmiller what's your feeling about the [proposed API](https://github.com/apache/lucene/blob/c56caeb26a5af4b0afc5f2cb04a4f

Re: [I] Fix Operations.reverse() to not add nondetermistic dead states [lucene]

2025-02-06 Thread via GitHub

rmuir commented on issue #14211: URL: https://github.com/apache/lucene/issues/14211#issuecomment-2640743462 PR: https://github.com/apache/lucene/pull/14212 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] Integrating GPU based Vector Search using cuVS [lucene]

2025-02-06 Thread via GitHub

chatman commented on code in PR #14131: URL: https://github.com/apache/lucene/pull/14131#discussion_r1945235820 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSKnnFloatVectorQuery.java: ## @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-06 Thread via GitHub

iverase commented on PR #14204: URL: https://github.com/apache/lucene/pull/14204#issuecomment-2640605039 >So I guess that the only option is to fail at runtime. I can do that. What looks like a reasonable cap on the number of returned intervals? 1024? 1024 sounds a good default and ma

Re: [I] Fix Operations.reverse() to not add nondetermistic dead states [lucene]

2025-02-06 Thread via GitHub

rmuir commented on issue #14211: URL: https://github.com/apache/lucene/issues/14211#issuecomment-2640601140 I understand it now: the problem is the Set of new initialStates being populated as a side-effect, which is what brzozowski is using. some of those are dead states, we remove them, bu

[jira] [Updated] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

2025-02-06 Thread ASF GitHub Bot (Jira)

[ https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated LUCENE-10471: Labels: pull-request-available (was: ) > Increase the number of dims for KNN vectors to

Re: [I] Increase the number of dims for KNN vectors to 2048 [LUCENE-10471] [lucene]

2025-02-06 Thread via GitHub

jzwolak commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-2640596317 @asfimport I suggest making it easier to change the limit. I appreciate the need to have a limit for optimization and performance. The models that have more than 1024 dimensi

Re: [I] Fix Operations.reverse() to not add nondetermistic dead states [lucene]

2025-02-06 Thread via GitHub

rmuir commented on issue #14211: URL: https://github.com/apache/lucene/issues/14211#issuecomment-2640589820 We have one use of reverse() outside of tests: but it is an important one (getCommonSuffix) and it already cleans up after reverse by calling removeDeadStates(). So to me, the obvious

Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

2025-02-06 Thread via GitHub

benwtrent commented on PR #14160: URL: https://github.com/apache/lucene/pull/14160#issuecomment-2640589065 OK, the current implementation is about as good as I can figure it. - We explore greater than neighbor-neighbors if we gathered < maxConn/4 vectors to score - We will explor

Re: [PR] Deprecate Operations.concat(a1, a2) and Operations.union(a1, a2) [lucene]

2025-02-06 Thread via GitHub

rmuir commented on PR #14209: URL: https://github.com/apache/lucene/pull/14209#issuecomment-2640553292 To me the deprecation is easy enough, developer is usually responsive to such things. It comes across different than just a hard break in a few ways, usually you question why the p

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-06 Thread via GitHub

jpountz commented on PR #14204: URL: https://github.com/apache/lucene/pull/14204#issuecomment-2640553823 I agree that providing a small interval is a bad usage pattern. I don't know how to validate this though, since we can't know the range of values of the docs that match the query up-fron

[I] Fix Operations.reverse() to not add nondetermistic dead states [lucene]

2025-02-06 Thread via GitHub

rmuir opened a new issue, #14211: URL: https://github.com/apache/lucene/issues/14211 ### Description Operations.reverse() doesn't just add dead-states, it adds dead-states with nondeterminism, such that returned automaton `isDetermistic()` becomes `false`. That's really a bit more th

Re: [I] flaky test: concatenate turns NFA into a DFA and it causes test fail [lucene]

2025-02-06 Thread via GitHub

rmuir commented on issue #14210: URL: https://github.com/apache/lucene/issues/14210#issuecomment-2640522300 I pushed fix to correct the build. but I will add more tracing to this seed, to see who is adding the "nondetermistic dead states". if it is an automaton method, we need to fix

Re: [I] flaky test: concatenate turns NFA into a DFA and it causes test fail [lucene]

2025-02-06 Thread via GitHub

asfgit closed issue #14210: flaky test: concatenate turns NFA into a DFA and it causes test fail URL: https://github.com/apache/lucene/issues/14210 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

Re: [I] flaky test: concatenate turns NFA into a DFA and it causes test fail [lucene]

2025-02-06 Thread via GitHub

rmuir commented on issue #14210: URL: https://github.com/apache/lucene/issues/14210#issuecomment-2640481736 If i comment out `removeDeadStates()` then the test passes. If I add the `removeDeadStates()` to the test before doing any concatenate, the test passes. The problem happens b

Re: [I] flaky test: concatenate turns NFA into a DFA and it causes test fail [lucene]

2025-02-06 Thread via GitHub

rmuir commented on issue #14210: URL: https://github.com/apache/lucene/issues/14210#issuecomment-2640461459 And of course the automaton is massive, nightmare: ![Image](https://github.com/user-attachments/assets/d61e0f36-b4bf-4cd2-ac4a-4556922b7142) -- This is an automated message f

Re: [PR] Deprecate Operations.concat(a1, a2) and Operations.union(a1, a2) [lucene]

2025-02-06 Thread via GitHub

rmuir commented on PR #14209: URL: https://github.com/apache/lucene/pull/14209#issuecomment-2640417498 https://github.com/apache/lucene/issues/14210 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] Deprecate Operations.concat(a1, a2) and Operations.union(a1, a2) [lucene]

2025-02-06 Thread via GitHub

rmuir commented on PR #14209: URL: https://github.com/apache/lucene/pull/14209#issuecomment-2640379859 The problem is not related to this PR from what I can tell, the issue is that `concatenate()` turns an NFA into a DFA and this fails the test. I can reproduce it in main: I will deal

Re: [PR] Deprecate Operations.concat(a1, a2) and Operations.union(a1, a2) [lucene]

2025-02-06 Thread via GitHub

rmuir commented on PR #14209: URL: https://github.com/apache/lucene/pull/14209#issuecomment-2640351185 I will look into the test fail, I did run them many times, but I was also pretty aggressive about trying to cleanup tests, so that we can just remove the deprecations in a followup commit,

Re: [PR] Add Automata.makeCharSet/makeCharClass to optimize regexp [lucene]

2025-02-06 Thread via GitHub

dweiss commented on PR #14193: URL: https://github.com/apache/lucene/pull/14193#issuecomment-2640330745 > Finally! The concatenate() issue was an easy fix, it neglected to clean up its dead states. All of its partners in crime do this, but the fact we neglect it for concatenate messes up to

Re: [I] Evaluate adding a double addressing vector scorer [lucene]

2025-02-06 Thread via GitHub

benwtrent commented on issue #13966: URL: https://github.com/apache/lucene/issues/13966#issuecomment-2640281978 Fixed by: https://github.com/apache/lucene/pull/14181 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] Add updateable random scorer interface for vector index building [lucene]

2025-02-06 Thread via GitHub

ChrisHegarty commented on PR #14181: URL: https://github.com/apache/lucene/pull/14181#issuecomment-2640293165 Thank you @benwtrent - let's wait for the next lucene nightly. If no perf improvement, that's ok. There should be a lot less garbage created, and CPU devoted to cleaning young heap

Re: [I] Evaluate adding a double addressing vector scorer [lucene]

2025-02-06 Thread via GitHub

benwtrent closed issue #13966: Evaluate adding a double addressing vector scorer URL: https://github.com/apache/lucene/issues/13966 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [PR] Add updateable random scorer interface for vector index building [lucene]

2025-02-06 Thread via GitHub

benwtrent merged PR #14181: URL: https://github.com/apache/lucene/pull/14181 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

[PR] Deprecate Operations.concat(a1, a2) and Operations.union(a1, a2) [lucene]

2025-02-06 Thread via GitHub

rmuir opened a new pull request, #14209: URL: https://github.com/apache/lucene/pull/14209 These algorithms run in linear time: it is trappy to offer two-arg options: they encourage users to use them in a loop and create quadratic time. Send a strong signal to the user's editor/IDE to

Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

2025-02-06 Thread via GitHub

stefanvodita commented on code in PR #13914: URL: https://github.com/apache/lucene/pull/13914#discussion_r1944983208 ## lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java: ## @@ -202,66 +208,83 @@ public SegmentOutput(int hitsLength) { * is used to

[I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-06 Thread via GitHub

benwtrent opened a new issue, #14208: URL: https://github.com/apache/lucene/issues/14208 ### Description I am not sure of other structures, but HNSW merges can allocate a pretty large chunk of memory on heap. For example: Let's have the max_conn set to 16. Thus connectio

Re: [PR] Add Automata.makeCharSet/makeCharClass to optimize regexp [lucene]

2025-02-06 Thread via GitHub

jpountz commented on code in PR #14193: URL: https://github.com/apache/lucene/pull/14193#discussion_r1944819744 ## lucene/core/src/test/org/apache/lucene/util/automaton/TestAutomaton.java: ## @@ -667,11 +667,14 @@ public void testConcatenatePreservesDet() throws Exception {

Re: [PR] Automatically remove dead states from concatenated automata. [lucene]

2025-02-06 Thread via GitHub

jpountz commented on PR #14205: URL: https://github.com/apache/lucene/pull/14205#issuecomment-2639972179 Superseded by #14193 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Automatically remove dead states from concatenated automata. [lucene]

2025-02-06 Thread via GitHub

jpountz closed pull request #14205: Automatically remove dead states from concatenated automata. URL: https://github.com/apache/lucene/pull/14205 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] Add Automata.makeCharSet/makeCharClass to optimize regexp [lucene]

2025-02-06 Thread via GitHub

rmuir merged PR #14193: URL: https://github.com/apache/lucene/pull/14193 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-06 Thread via GitHub

iverase commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1944680152 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [PR] Add updateable random scorer interface for vector index building [lucene]

2025-02-06 Thread via GitHub

benwtrent commented on PR #14181: URL: https://github.com/apache/lucene/pull/14181#issuecomment-2639727635 OK, benchmarks show pretty much no difference when I was testing on my machine. But, recall numbers all checkout and the API is nicer and makes more sense for vector merging. So

Re: [PR] Add Automata.makeCharSet/makeCharClass to optimize regexp [lucene]

2025-02-06 Thread via GitHub

rmuir commented on code in PR #14193: URL: https://github.com/apache/lucene/pull/14193#discussion_r1944626865 ## lucene/core/src/test/org/apache/lucene/util/automaton/TestAutomaton.java: ## @@ -667,11 +667,14 @@ public void testConcatenatePreservesDet() throws Exception { }

Re: [PR] Automatically remove dead states from concatenated automata. [lucene]

2025-02-06 Thread via GitHub

rmuir commented on PR #14205: URL: https://github.com/apache/lucene/pull/14205#issuecomment-2639660109 It's in my PR over there too. I think we should avoid addEpsilon in the test. We don't even need any transitions. -- This is an automated message from the Apache Git Service. To respond

[PR] Make Operations#union merge accept states that have no outgoing transition. [lucene]

2025-02-06 Thread via GitHub

jpountz opened a new pull request, #14207: URL: https://github.com/apache/lucene/pull/14207 This helps generate simpler automata, especially when these automata are later combined through other operations such as `Operations#concat`. -- This is an automated message from the Apache Git Ser

[I] TestLogMergePolicy#testNoPathologicalMerge reproducible failure [lucene]

2025-02-06 Thread via GitHub

iverase opened a new issue, #14206: URL: https://github.com/apache/lucene/issues/14206 The following seed reproduces the issue: ``` ./gradlew :lucene:core:test --tests "org.apache.lucene.index.TestLogMergePolicy.testNoPathologicalMerges" -Ptests.seed=5C1CAC337454D389 > Ta

[PR] Automatically remove dead states from concatenated automata. [lucene]

2025-02-06 Thread via GitHub

jpountz opened a new pull request, #14205: URL: https://github.com/apache/lucene/pull/14205 Concatenating automata frequently creates dead states. This PR suggests that `Operations#concatenate` automatically removes these dead states. This is not unseen: `Operations#repeat`, `Operations#uni

Re: [PR] Migrate OpenNLP 'ant train-test-models' to Gradle [lucene]

2025-02-06 Thread via GitHub

dweiss commented on code in PR #14198: URL: https://github.com/apache/lucene/pull/14198#discussion_r1944528365 ## lucene/analysis/opennlp/build.gradle: ## @@ -26,3 +26,33 @@ dependencies { moduleTestImplementation project(':lucene:test-framework') } + +ext { + testModelDa

Re: [PR] Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves [lucene]

2025-02-06 Thread via GitHub

gf2121 commented on PR #14176: URL: https://github.com/apache/lucene/pull/14176#issuecomment-2639462765 Thanks @jpountz ! Updated. Could you please also help review https://github.com/mikemccand/luceneutil/pull/335 ? I'd like to merge it first so that nightly benchmark can ca

[PR] Add histogram facet capabilities. [lucene]

2025-02-06 Thread via GitHub

jpountz opened a new pull request, #14204: URL: https://github.com/apache/lucene/pull/14204 This is inspired from a paper by Tencent where the authors describe how they speed up so-called "histogram queries" by sorting the index by timestamp and translating ranges of values corresponding to

Re: [PR] Introduce bpv24 vectorized decoding for DocIdsWriter [lucene]

2025-02-06 Thread via GitHub

gf2121 commented on PR #14176: URL: https://github.com/apache/lucene/pull/14176#issuecomment-2639090793 Thanks @iverase ! For the vectorized decodeing, I benchmarked the decoding method with jmh, the result on my M2 mac: ``` Benchmark Mode Cnt

52 matches

Mail list logo