Re: [PR] Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter [lucene]

2025-03-21 Thread via GitHub
msfroh commented on PR #14389: URL: https://github.com/apache/lucene/pull/14389#issuecomment-2744397587 Awesome! Can I go ahead and use this for https://github.com/apache/lucene/pull/14350 once it's merged? -- This is an automated message from the Apache Git Service. To respond to the mes

[PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]

2025-03-21 Thread via GitHub
rmuir opened a new pull request, #14381: URL: https://github.com/apache/lucene/pull/14381 Add optional flag to support case-insensitive ranges. A minimal DFA is always created. This works with Unicode but may have a performance cost. Each codepoint in the range must be iterated, and a

Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]

2025-03-21 Thread via GitHub
rmuir commented on code in PR #14381: URL: https://github.com/apache/lucene/pull/14381#discussion_r2007006500 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -778,6 +786,53 @@ private int[] toCaseInsensitiveChar(int codepoint) { } } + /** +

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-21 Thread via GitHub
alessandrobenedetti commented on code in PR #14173: URL: https://github.com/apache/lucene/pull/14173#discussion_r2007476642 ## lucene/core/src/java/org/apache/lucene/util/hnsw/UpdatableScoreHeap.java: ## Review Comment: For example, what are the benefits of this in comparis

Re: [PR] Adjust equivalent min similarity HNSW exploration logic [lucene]

2025-03-21 Thread via GitHub
benwtrent merged PR #14366: URL: https://github.com/apache/lucene/pull/14366 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [I] TestKnnGraph.testMultiThreadedSearch random test failure [lucene]

2025-03-21 Thread via GitHub
benwtrent closed issue #14327: TestKnnGraph.testMultiThreadedSearch random test failure URL: https://github.com/apache/lucene/issues/14327 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [I] TestKnnGraph.testMultiThreadedSearch random test failure [lucene]

2025-03-21 Thread via GitHub
benwtrent closed issue #14327: TestKnnGraph.testMultiThreadedSearch random test failure URL: https://github.com/apache/lucene/issues/14327 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

[I] Reduce memory usage when merging bkd trees [lucene]

2025-03-21 Thread via GitHub
iverase opened a new issue, #14382: URL: https://github.com/apache/lucene/issues/14382 When building BKD trees, we hold two arrays in memory which sizes grows linearly with the number of leaf nodes. One of the array contains the pointer to the start of a leaf node, and the other containing

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-03-21 Thread via GitHub
benwtrent commented on code in PR #14094: URL: https://github.com/apache/lucene/pull/14094#discussion_r2007365351 ## lucene/core/src/java/org/apache/lucene/util/hnsw/OrdinalTranslatedKnnCollector.java: ## @@ -50,4 +51,11 @@ public TopDocs topDocs() { : TotalHits

Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]

2025-03-21 Thread via GitHub
rmuir commented on PR #14384: URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743662885 This one is pretty easy to understand, the `CaseFolding` class now just gives you `UnicodeSet(ch).closeOver(UnicodeSet.SIMPLE_CASE_INSENSITIVE)` without requiring that you have ICU.

Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-21 Thread via GitHub
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2007802395 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,552 @@ +/* + * Licensed to the Apache Software Foundation (ASF) unde

Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]

2025-03-21 Thread via GitHub
rmuir commented on PR #14384: URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743717079 It was easy because @uschindler already created a similar groovy script before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to G

Re: [I] Case insensitive regex query with character range [lucene]

2025-03-21 Thread via GitHub
rmuir closed issue #14378: Case insensitive regex query with character range URL: https://github.com/apache/lucene/issues/14378 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-21 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2744342290 Maybe this one helps the issue: https://github.com/apache/lucene/pull/14389 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter [lucene]

2025-03-21 Thread via GitHub
john-wagster commented on PR #14389: URL: https://github.com/apache/lucene/pull/14389#issuecomment-2744360276 This is great; helps me progress some of the regex work in ES for why I started that CaseFolding work. Thanks for iterating on this @rmuir. -- This is an automated messa

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-21 Thread via GitHub
alessandrobenedetti commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2743148001 Catching up on this and trying to understand how far we are now from my original idea and implementation: https://github.com/apache/lucene/pull/12314 Obviously, my c

Re: [I] Handling concurrent search in QueryProfiler [lucene]

2025-03-21 Thread via GitHub
jpountz commented on issue #14375: URL: https://github.com/apache/lucene/issues/14375#issuecomment-271819 Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsub

Re: [I] Handle degenerate case where all HNSW search candidates are filtered [lucene]

2025-03-21 Thread via GitHub
benwtrent commented on issue #11787: URL: https://github.com/apache/lucene/issues/11787#issuecomment-2743830018 I think this has been fixed with all our HNSW filtering fixes: - we drop to brute force if we explore too much - we bypass the graph if the filter passes <= `k` docs

Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-21 Thread via GitHub
jpountz commented on PR #14365: URL: https://github.com/apache/lucene/pull/14365#issuecomment-273990 Hurray! - https://benchmarks.mikemccandless.com/TermDayOfYearSort.html - https://benchmarks.mikemccandless.com/TermDTSort.html -- This is an automated message from the Apache Gi

Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]

2025-03-21 Thread via GitHub
rmuir commented on PR #14384: URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743831323 There was something about gradle itself that was upset about dependencies wrt generation tasks, if i recall... cycle detection or something was complaining about it. -- This is an autom

Re: [I] Handle degenerate case where all HNSW search candidates are filtered [lucene]

2025-03-21 Thread via GitHub
benwtrent closed issue #11787: Handle degenerate case where all HNSW search candidates are filtered URL: https://github.com/apache/lucene/issues/11787 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]

2025-03-21 Thread via GitHub
rmuir commented on PR #14384: URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743736828 I will followup with an ICU upgrade PR to this one. I don't expect that this file will change except for the version in the comment though. -- This is an automated message from the Apach

[PR] Implement #docIDRunEnd() on PostingsEnum. [lucene]

2025-03-21 Thread via GitHub
jpountz opened a new pull request, #14390: URL: https://github.com/apache/lucene/pull/14390 This implements `BlockPostingsEnum#docIDRunEnd()` by comparing the delta between doc IDs and between doc counts on the various skip levels. -- This is an automated message from the Apache Git S

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-21 Thread via GitHub
vigyasharma commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2744562872 Thanks for looking into this PR @alessandrobenedetti , this is the latest iteration on multi-vector support. It does build on the same central idea of assigning a unique ordina

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-21 Thread via GitHub
vigyasharma commented on code in PR #14173: URL: https://github.com/apache/lucene/pull/14173#discussion_r2008411867 ## lucene/core/src/java/org/apache/lucene/util/hnsw/UpdatableScoreHeap.java: ## Review Comment: I'd like to keep the logic to update scores for already ingest

Re: [I] TestKnnGraph.testMultiThreadedSearch random test failure [lucene]

2025-03-21 Thread via GitHub
rmuir commented on issue #14327: URL: https://github.com/apache/lucene/issues/14327#issuecomment-2743546449 thank you @benwtrent -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

Re: [I] Address gradle temp file pollution insanity [lucene]

2025-03-21 Thread via GitHub
dweiss commented on issue #14385: URL: https://github.com/apache/lucene/issues/14385#issuecomment-2743927366 Ok, I've added gradle's "user home" tmp cleaning as well. Anything older than 3 hours is removed. This folder may be shared across builds so the time limit is there to prevent accide

[PR] bump antlr 4.11.1 -> 4.13.2 [lucene]

2025-03-21 Thread via GitHub
rmuir opened a new pull request, #14388: URL: https://github.com/apache/lucene/pull/14388 Dependency is outdated, the main changes to generated code avoid warnings in java21+ This one didn't magically work like ICU, I simply force-regenerated. I tried messing around with the gradle d

Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-21 Thread via GitHub
jpountz commented on PR #14365: URL: https://github.com/apache/lucene/pull/14365#issuecomment-2744460409 I pushed an annotation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [PR] Clean up junk from gradle's user home (~/.gradle/.tmp). #14385 [lucene]

2025-03-21 Thread via GitHub
dweiss merged PR #14387: URL: https://github.com/apache/lucene/pull/14387 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

[I] Address gradle temp file pollution insanity [lucene]

2025-03-21 Thread via GitHub
dweiss opened a new issue, #14385: URL: https://github.com/apache/lucene/issues/14385 ### Description Gradle creates temp files it never cleans up. Until this is resolved, let's try to keep some housekeeping ourselves. Related issues: * #10215 * #10510 * https://githu

Re: [I] Address gradle temp file pollution insanity [lucene]

2025-03-21 Thread via GitHub
dweiss commented on issue #14385: URL: https://github.com/apache/lucene/issues/14385#issuecomment-2743998381 There are also *.log files to wipe clean. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] Clean up junk from gradle's user home (~/.gradle/.tmp). #14385 [lucene]

2025-03-21 Thread via GitHub
dweiss commented on PR #14387: URL: https://github.com/apache/lucene/pull/14387#issuecomment-2743985270 I'll merge this in. Low risk and we can always revert if needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use t

Re: [I] Address gradle temp file pollution insanity [lucene]

2025-03-21 Thread via GitHub
dweiss closed issue #14385: Address gradle temp file pollution insanity URL: https://github.com/apache/lucene/issues/14385 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To un

Re: [I] gradle build leaks tons of gradle-worker-classpath* files in tmpdir [LUCENE-9175] [lucene]

2025-03-21 Thread via GitHub
dweiss closed issue #10215: gradle build leaks tons of gradle-worker-classpath* files in tmpdir [LUCENE-9175] URL: https://github.com/apache/lucene/issues/10215 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]

2025-03-21 Thread via GitHub
rmuir merged PR #14381: URL: https://github.com/apache/lucene/pull/14381 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] bump antlr 4.11.1 -> 4.13.2 [lucene]

2025-03-21 Thread via GitHub
dweiss commented on code in PR #14388: URL: https://github.com/apache/lucene/pull/14388#discussion_r2008140343 ## lucene/expressions/src/generated/checksums/generateAntlr.json: ## @@ -1,7 +1,8 @@ { "lucene/expressions/src/java/org/apache/lucene/expressions/js/Javascript.g

Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]

2025-03-21 Thread via GitHub
dweiss commented on PR #14384: URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743758629 I mean the entire structure of tasks that are used in regenerate. It's complex. I remember I couldn't do it in any easier way before - maybe something has changed that would allow it to b

Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]

2025-03-21 Thread via GitHub
rmuir commented on PR #14381: URL: https://github.com/apache/lucene/pull/14381#issuecomment-2743798573 after fixing the turkish here's the (correct) automaton for `/[a-z]/`: the only special cases are long-s and kelvin sign as you expect: ![graphviz (6)](https://github.com/user-attac

Re: [I] Init HNSW merge with graph containing deleted documents [lucene]

2025-03-21 Thread via GitHub
benwtrent commented on issue #12533: URL: https://github.com/apache/lucene/issues/12533#issuecomment-2743826644 I think in addition to the recent merge improvements (https://github.com/apache/lucene/pull/14331), the ability to "fix up" the individual graphs that have deletions and THEN doin

Re: [PR] upgrade icu dependency from 74.2 -> 77.1 [lucene]

2025-03-21 Thread via GitHub
rmuir commented on PR #14386: URL: https://github.com/apache/lucene/pull/14386#issuecomment-2743954863 @dweiss i know you dislike the complexity, but the `gradlew regenerate` really saves a metric ton of human time and prevents mistakes for updates like these. -- This is an automated mes

Re: [I] Case insensitive regex query with character range [lucene]

2025-03-21 Thread via GitHub
rmuir closed issue #14378: Case insensitive regex query with character range URL: https://github.com/apache/lucene/issues/14378 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]

2025-03-21 Thread via GitHub
rmuir merged PR #14384: URL: https://github.com/apache/lucene/pull/14384 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [I] Address gradle temp file pollution insanity [lucene]

2025-03-21 Thread via GitHub
dweiss commented on issue #14385: URL: https://github.com/apache/lucene/issues/14385#issuecomment-2743756362 It's this commit that moved the temp folder from java.io.tmpdir, which we redirected and cleaned up. https://github.com/gradle/gradle/commit/8c2f6b7db50ab071a289fb5c4cbb9b2125

Re: [PR] BlockJoinBulkScorer could check for parent deletions (not children) [lucene]

2025-03-21 Thread via GitHub
jimczi closed pull request #14067: BlockJoinBulkScorer could check for parent deletions (not children) URL: https://github.com/apache/lucene/pull/14067 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] upgrade icu dependency from 74.2 -> 77.1 [lucene]

2025-03-21 Thread via GitHub
dweiss commented on PR #14386: URL: https://github.com/apache/lucene/pull/14386#issuecomment-2743977198 I know, I know. I don't think we should remove it - I just hope it can be implemented in a less hairy way. -- This is an automated message from the Apache Git Service. To respond to th

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-03-21 Thread via GitHub
tteofili commented on code in PR #14094: URL: https://github.com/apache/lucene/pull/14094#discussion_r2007923461 ## lucene/core/src/java/org/apache/lucene/search/HnswQueueSaturationCollector.java: ## @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [PR] Clean up junk from gradle's user home (~/.gradle/.tmp). #14385 [lucene]

2025-03-21 Thread via GitHub
rmuir commented on PR #14387: URL: https://github.com/apache/lucene/pull/14387#issuecomment-2743932179 `./gradlew -XX:UseDweissTempFileGC` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [I] Handling concurrent search in QueryProfiler [lucene]

2025-03-21 Thread via GitHub
jainankitk commented on issue #14375: URL: https://github.com/apache/lucene/issues/14375#issuecomment-2744045551 @jpountz - Can you assign this issue to me? I don't have permissions to do that myself -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [PR] Optimize ParallelLeafReader to improve term vector fetching efficienc [lucene]

2025-03-21 Thread via GitHub
vigyasharma commented on code in PR #14373: URL: https://github.com/apache/lucene/pull/14373#discussion_r2008678599 ## lucene/core/src/java/org/apache/lucene/index/ParallelLeafReader.java: ## @@ -348,15 +348,24 @@ public void prefetch(int docID) throws IOException { @Over

Re: [PR] bump antlr 4.11.1 -> 4.13.2 [lucene]

2025-03-21 Thread via GitHub
dweiss commented on PR #14388: URL: https://github.com/apache/lucene/pull/14388#issuecomment-2744155828 > This one didn't magically work like ICU I've pushed a commit that should do the trick. ICU version wasn't in the inputs so the build didn't know it'd been updated. -- This is a

Re: [PR] bump antlr 4.11.1 -> 4.13.2 [lucene]

2025-03-21 Thread via GitHub
dweiss commented on code in PR #14388: URL: https://github.com/apache/lucene/pull/14388#discussion_r2008129712 ## lucene/expressions/src/generated/checksums/generateAntlr.json: ## @@ -1,7 +1,13 @@ { + "../../../../../.gradle/caches/modules-2/files-2.1/com.ibm.icu/icu4j/72.1

Re: [PR] Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter [lucene]

2025-03-21 Thread via GitHub
rmuir commented on PR #14389: URL: https://github.com/apache/lucene/pull/14389#issuecomment-2744704384 I will straighten out the build, this one is kinda draftish as it needs more tests etc. just wanted to toss out the idea. If it is autogenerated we can easily maintain some cohesive