Re: [PR] Add Automata.makeCharSet/makeCharClass to optimize regexp [lucene]

2025-02-04 Thread via GitHub
rmuir commented on PR #14193: URL: https://github.com/apache/lucene/pull/14193#issuecomment-2635997545 My example for this one, if you have something like `[^a-gklM-O\s]`, with the case-insensitive flag maybe, it just calls the new `makeCharClass(int[],int[])` method and you get minimal aut

Re: [PR] Add Automata.makeCharSet/makeCharClass to optimize regexp [lucene]

2025-02-04 Thread via GitHub
rmuir commented on PR #14193: URL: https://github.com/apache/lucene/pull/14193#issuecomment-2635939281 anyway, I think this is the right path, rather than fight with union(), let's just get it out of our way. with this change union() is only used for union operator (`|`) and not internally.

Re: [PR] Add Automata.makeCharSet/makeCharClass to optimize regexp [lucene]

2025-02-04 Thread via GitHub
rmuir commented on PR #14193: URL: https://github.com/apache/lucene/pull/14193#issuecomment-2635936648 That's error-prone that's broke trying to do some null analysis :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] Add Automata.makeCharSet/makeCharClass to optimize regexp [lucene]

2025-02-04 Thread via GitHub
rmuir commented on PR #14193: URL: https://github.com/apache/lucene/pull/14193#issuecomment-2635933607 I generalized this to `makeCharClass(int[],int[])`, added a "character class" node to use it instead of unioning many nodes, replaced the pre-built class functionality with it too.

[PR] Update package-info.java [lucene]

2025-02-04 Thread via GitHub
MaruHyl opened a new pull request, #14199: URL: https://github.com/apache/lucene/pull/14199 fix typo -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-m

Re: [PR] supports force merge based on specified segments. [lucene]

2025-02-04 Thread via GitHub
cheng66551 commented on PR #14163: URL: https://github.com/apache/lucene/pull/14163#issuecomment-2635560399 > I don't think we should merge this change, but it's good that you were able to use it to confirm that merging would reclaim these deleted docs. > > Can you add your data about

Re: [PR] supports force merge based on specified segments. [lucene]

2025-02-04 Thread via GitHub
cheng66551 commented on PR #14163: URL: https://github.com/apache/lucene/pull/14163#issuecomment-2635558626 > I don't think we should merge this change, but it's good that you were able to use it to confirm that merging would reclaim these deleted docs. > > Can you add your data about

Re: [PR] supports force merge based on specified segments. [lucene]

2025-02-04 Thread via GitHub
cheng66551 commented on PR #14163: URL: https://github.com/apache/lucene/pull/14163#issuecomment-2635548335 > I don't think we should merge this change, but it's good that you were able to use it to confirm that merging would reclaim these deleted docs. > > Can you add your data about

Re: [PR] supports force merge based on specified segments. [lucene]

2025-02-04 Thread via GitHub
cheng66551 commented on PR #14163: URL: https://github.com/apache/lucene/pull/14163#issuecomment-2635542291 > If you are able to turn on `InfoStream` for the ES shard that won't merge segments with so many deletions, and post a chunk here, I can have a look and see if there are clues.

Re: [PR] supports force merge based on specified segments. [lucene]

2025-02-04 Thread via GitHub
cheng66551 closed pull request #14163: supports force merge based on specified segments. URL: https://github.com/apache/lucene/pull/14163 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] supports force merge based on specified segments. [lucene]

2025-02-04 Thread via GitHub
cheng66551 commented on PR #14163: URL: https://github.com/apache/lucene/pull/14163#issuecomment-2635519801 > It's terrible that `TieredMergePolicy` was not merging these segments, naturally or under `forceMerge` -- let's understand why it's failing to do so? It's like we need an `explain`

Re: [PR] supports force merge based on specified segments. [lucene]

2025-02-04 Thread via GitHub
cheng66551 commented on PR #14163: URL: https://github.com/apache/lucene/pull/14163#issuecomment-2635518397 > It's terrible that `TieredMergePolicy` was not merging these segments, naturally or under `forceMerge` -- let's understand why it's failing to do so? It's like we need an `explain`

Re: [PR] supports force merge based on specified segments. [lucene]

2025-02-04 Thread via GitHub
cheng66551 commented on PR #14163: URL: https://github.com/apache/lucene/pull/14163#issuecomment-2635516819 > It's terrible that `TieredMergePolicy` was not merging these segments, naturally or under `forceMerge` -- let's understand why it's failing to do so? It's like we need an `explain`

[PR] Migrate OpenNLP 'ant train-test-models' to Gradle [lucene]

2025-02-04 Thread via GitHub
msfroh opened a new pull request, #14198: URL: https://github.com/apache/lucene/pull/14198 ### Description This resurrects the OpenNLP model training task from Ant (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/analysis/opennlp/build.xml#L52-L84) to Gr

Re: [PR] Fix acceptOrds in EmptyOffHeapVectorValues to match no bits [lucene]

2025-02-04 Thread via GitHub
github-actions[bot] commented on PR #14119: URL: https://github.com/apache/lucene/pull/14119#issuecomment-2635430815 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-04 Thread via GitHub
rmuir commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1942013537 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -696,17 +896,52 @@ private Automaton toAutomaton( return a; } - private Automaton

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-04 Thread via GitHub
rmuir commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1942009773 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -436,6 +478,160 @@ public enum Kind { */ @Deprecated public static final int DEPRECATE

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-04 Thread via GitHub
john-wagster commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1941963110 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -696,17 +896,52 @@ private Automaton toAutomaton( return a; } - private Aut

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-04 Thread via GitHub
john-wagster commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1941960613 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -436,6 +478,160 @@ public enum Kind { */ @Deprecated public static final int DE

[PR] Correct bug with seeded vector queries with incorrect entrypoint ids [lucene]

2025-02-04 Thread via GitHub
benwtrent opened a new pull request, #14197: URL: https://github.com/apache/lucene/pull/14197 The tests caught a bug! Good thing! The code wasn't taking account of the underlying leaf context doc base when creating the top doc iterator for a segment. No changes entry as its a b

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-04 Thread via GitHub
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2635120100 Build failure seems unrelated, created #14196 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-04 Thread via GitHub
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2635103176 ### Some more points / thoughts - Built for Faiss `v1.10.0` (version is validated at runtime) - Can be compiled with lower versions of Java, and run with 22+ (using an MR-JAR) -

[I] TestSeededKnnFloatVectorQuery.testSeedWithTimeout fails reproducibly [lucene]

2025-02-04 Thread via GitHub
kaivalnp opened a new issue, #14196: URL: https://github.com/apache/lucene/issues/14196 ### Description [`TestSeededKnnFloatVectorQuery.testSeedWithTimeout`](https://github.com/apache/lucene/blob/e4321619bba8669e93311ffb9456fa043d519b21/lucene/core/src/test/org/apache/lucene/search/Te

[I] TestSeededKnn[Byte|Float]VectorQuery.testWithTimeout failure [lucene]

2025-02-04 Thread via GitHub
benwtrent opened a new issue, #14195: URL: https://github.com/apache/lucene/issues/14195 ### Description java.lang.IllegalArgumentException: The number of entry points provided is less than the number of entry points requested ``` java.lang.IllegalArgumentException: The numb

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-04 Thread via GitHub
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1941904975 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,268 @@ +/* + * Licensed to the Apache Software Foundation (ASF) unde

Re: [PR] Add posTagFormat parameter for OpenNLPPOSFilter [lucene]

2025-02-04 Thread via GitHub
msfroh commented on PR #14194: URL: https://github.com/apache/lucene/pull/14194#issuecomment-2635079150 Test failure: ``` Reproduce with: gradlew :lucene:core:test --tests "org.apache.lucene.search.TestSeededKnnByteVectorQuery.testSeedWithTimeout" -Ptests.jvms=1 -Ptests.jvmargs= -

[PR] Add posTagFormat parameter for OpenNLPPOSFilter [lucene]

2025-02-04 Thread via GitHub
msfroh opened a new pull request, #14194: URL: https://github.com/apache/lucene/pull/14194 ### Description This allows users to use either a Penn or UD part-of-speech tagging model, but output tags in the other format. This allows users to combine a Penn POS tagging model with a lemm

Re: [PR] Consistent KNN query results with multiple leafs [lucene]

2025-02-04 Thread via GitHub
tteofili commented on PR #14191: URL: https://github.com/apache/lucene/pull/14191#issuecomment-2634636630 I'm going to try a more promising way of slicing segments to threads -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Consistent KNN query results with multiple leafs [lucene]

2025-02-04 Thread via GitHub
tteofili commented on PR #14191: URL: https://github.com/apache/lucene/pull/14191#issuecomment-2634633824 my previous luceneutil runs were useless, now with changes in luceneutil (`NoMergePolicy` and no force merge on index side, using the `ExecutorService` on the search side), I get far di

Re: [I] Add easier segment tracing / verbosity / transparency to `IndexWriter` [lucene]

2025-02-04 Thread via GitHub
rmuir commented on issue #14182: URL: https://github.com/apache/lucene/issues/14182#issuecomment-2634632482 @mikemccand rather than mess with info stream logging could we consider adding some counters to indexwriter to give visibility? Eg if you have a flush count with a simple int getter,

Re: [PR] Consistent KNN query results with multiple leafs [lucene]

2025-02-04 Thread via GitHub
tteofili commented on PR #14191: URL: https://github.com/apache/lucene/pull/14191#issuecomment-2634489270 I've adjusted `AbstractKnnVectorQuery` to pick the largest `LeafReaderContext` (largest `#reader().numDocs()`) for the first search, this introduces an additive O(|leafReaderContexts|)

Re: [I] Add easier segment tracing / verbosity / transparency to `IndexWriter` [lucene]

2025-02-04 Thread via GitHub
mikemccand closed issue #14182: Add easier segment tracing / verbosity / transparency to `IndexWriter` URL: https://github.com/apache/lucene/issues/14182 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

2025-02-04 Thread via GitHub
benchaplin commented on PR #14160: URL: https://github.com/apache/lucene/pull/14160#issuecomment-2634270282 @benwtrent Yep, everything's in the PR. I ran on 1M docs, 100 queries to keep the benchmark under an hour. -- This is an automated message from the Apache Git Service. To respond t

Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

2025-02-04 Thread via GitHub
benwtrent commented on PR #14160: URL: https://github.com/apache/lucene/pull/14160#issuecomment-2634245107 @benchaplin I found another bug. The recall numbers were indeed way too good to be true. I was returning duplicate documents 🤦 . So, recall was great because we contained a valid docum

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-02-04 Thread via GitHub
benwtrent commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2634182175 > Java limits the size of arrays (and lists) to 'int max' and does not allow 'long' array indices. These will need to be changed to use a different data structure. Yeah, I don't

Re: [I] Add easier segment tracing / verbosity / transparency to `IndexWriter` [lucene]

2025-02-04 Thread via GitHub
mikemccand commented on issue #14182: URL: https://github.com/apache/lucene/issues/14182#issuecomment-2634071572 I have been tinkering with fun little Python tools in [luceneutil](https://github.com/mikemccand/luceneutil) to 1) [parse a full `InfoStream` log](https://github.com/mikemccand/

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-04 Thread via GitHub
benwtrent commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1941120908 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,268 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Add Automata.makeCharSet(int[]) to optimize caseless matching. [lucene]

2025-02-04 Thread via GitHub
rmuir commented on PR #14193: URL: https://github.com/apache/lucene/pull/14193#issuecomment-2633741911 I considered it but then didn't use any varargs after Dawid's email about compiler performance problems coming from them. -- This is an automated message from the Apache Git Service. To

Re: [PR] Introduce bpv24 vectorized decoding for DocIdsWriter [lucene]

2025-02-04 Thread via GitHub
jpountz commented on code in PR #14176: URL: https://github.com/apache/lucene/pull/14176#discussion_r1940875523 ## lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java: ## @@ -115,30 +117,24 @@ void writeDocIds(int[] docIds, int start, int count, DataOutput out) th

Re: [PR] Integrating GPU based Vector Search using cuVS [lucene]

2025-02-04 Thread via GitHub
ChrisHegarty commented on PR #14131: URL: https://github.com/apache/lucene/pull/14131#issuecomment-2633256419 I've made the cuvs-java api Java 21 friendly, with an spi and a java-22 specific impl in the versioned section of an mrjar - MemorySegment and Arena have been removed from the api,

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-04 Thread via GitHub
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1940738982 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,268 @@ +/* + * Licensed to the Apache Software Foundation (ASF) unde