Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

2025-02-03 Thread via GitHub
benchaplin commented on PR #14160: URL: https://github.com/apache/lucene/pull/14160#issuecomment-2632762867 Baseline: ``` recall latency (ms) nDoc topK fanout maxConn beamWidth visited selectivity correlation filterType 1.000 9.020 100 100 100

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-03 Thread via GitHub
rmuir commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1940403543 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -436,6 +478,160 @@ public enum Kind { */ @Deprecated public static final int DEPRECATE

[PR] Add Automata.makeCharSet(int[]) to optimize caseless matching. [lucene]

2025-02-03 Thread via GitHub
rmuir opened a new pull request, #14193: URL: https://github.com/apache/lucene/pull/14193 Previously caseless matching was implemented via code such as this: ```java Operations.union(Automata.makeChar('x'), Automata.makeChar('X')) ``` Proposed unicode caseless matching (

Re: [PR] Add Automata.makeCharSet(int[]) to optimize caseless matching. [lucene]

2025-02-03 Thread via GitHub
rmuir commented on PR #14193: URL: https://github.com/apache/lucene/pull/14193#issuecomment-2632590044 also `makeCharUnion()` comes to mind as a compelling alternative name, since there is already a `makeStringUnion()`. Naming is hard. just want to get the idea out there, since caseless reg

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-03 Thread via GitHub
rmuir commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1940371117 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -696,17 +896,52 @@ private Automaton toAutomaton( return a; } - private Automaton

Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]

2025-02-03 Thread via GitHub
github-actions[bot] commented on PR #13747: URL: https://github.com/apache/lucene/pull/13747#issuecomment-2632478843 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-03 Thread via GitHub
rmuir commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1940264096 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -436,6 +478,160 @@ public enum Kind { */ @Deprecated public static final int DEPRECATE

Re: [PR] Add new Directory implementation for AWS S3 [lucene]

2025-02-03 Thread via GitHub
dsmiley commented on PR #13949: URL: https://github.com/apache/lucene/pull/13949#issuecomment-2632403618 By the way, the Apache Solr project has an impressive "[BlockCache](https://github.com/apache/solr/blob/07943e87fb762b69a66932f777d56eb14cc72e78/solr/modules/hdfs/src/java/org/apache/solr

Re: [PR] Add new Directory implementation for AWS S3 [lucene]

2025-02-03 Thread via GitHub
dsmiley commented on PR #13949: URL: https://github.com/apache/lucene/pull/13949#issuecomment-2632388607 Couldn't S3 and other file storage be implemented as an NIO FileSystem instead? AKA JSR-203. Would the Lucene Directory abstraction level have certain advantages (what)? Ideally we'd

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-03 Thread via GitHub
john-wagster commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1940026321 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -436,6 +478,160 @@ public enum Kind { */ @Deprecated public static final int DE

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-03 Thread via GitHub
john-wagster commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1940023844 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -424,6 +426,46 @@ public enum Kind { /** Allows case insensitive matching of ASCII

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-03 Thread via GitHub
rmuir commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1939974637 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -436,6 +478,160 @@ public enum Kind { */ @Deprecated public static final int DEPRECATE

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-03 Thread via GitHub
rmuir commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1939944485 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -424,6 +426,46 @@ public enum Kind { /** Allows case insensitive matching of ASCII charact

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-03 Thread via GitHub
rmuir commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1939946041 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -436,6 +478,160 @@ public enum Kind { */ @Deprecated public static final int DEPRECATE

Re: [PR] Add UnwrappingReuseStrategy for AnalyzerWrapper [lucene]

2025-02-03 Thread via GitHub
benwtrent commented on code in PR #14154: URL: https://github.com/apache/lucene/pull/14154#discussion_r1939910069 ## lucene/core/src/java/org/apache/lucene/analysis/AnalyzerWrapper.java: ## @@ -151,4 +157,78 @@ protected final Reader initReaderForNormalization(String fieldName,

Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

2025-02-03 Thread via GitHub
benwtrent commented on PR #14160: URL: https://github.com/apache/lucene/pull/14160#issuecomment-2631840332 > I think this 'correlation' is important to test as I imagine many real world filters involve some correlation, rather than the random filters we get in luceneutil benchmarks.

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-03 Thread via GitHub
john-wagster commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1939889352 ## lucene/core/src/test/org/apache/lucene/util/automaton/TestRegExp.java: ## @@ -35,6 +43,320 @@ public void testSmoke() { assertFalse(run.run("ad")); }

[PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-03 Thread via GitHub
john-wagster opened a new pull request, #14192: URL: https://github.com/apache/lucene/pull/14192 About four years ago ASCII-only case insensitive matching (https://github.com/apache/lucene-solr/pull/1541) was added to Lucene. In the past couple of a years a couple of requests have been mad

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-03 Thread via GitHub
john-wagster commented on PR #14192: URL: https://github.com/apache/lucene/pull/14192#issuecomment-2631833927 @jpountz, @jimczi, @mayya-sharipova ya'll may be interested in this PR so just tagging you here in case you are interested. -- This is an automated message from the Apache Git S

Re: [PR] Consistent KNN query results with multiple leafs [lucene]

2025-02-03 Thread via GitHub
mayya-sharipova commented on code in PR #14191: URL: https://github.com/apache/lucene/pull/14191#discussion_r1939834681 ## lucene/core/src/java/org/apache/lucene/search/knn/MultiLeafKnnCollector.java: ## @@ -89,6 +91,24 @@ public MultiLeafKnnCollector( this.nonCompetitiveQu

Re: [PR] Wrap Executor in TaskExecutor to never reject [lucene]

2025-02-03 Thread via GitHub
original-brownbear commented on PR #13622: URL: https://github.com/apache/lucene/pull/13622#issuecomment-2631740301 I agree @dsmiley , I actually did continue to work on this on the ES side lately in https://github.com/elastic/elasticsearch/pull/120024. What I did there was introduce log

Re: [PR] Wrap Executor in TaskExecutor to never reject [lucene]

2025-02-03 Thread via GitHub
dsmiley commented on PR #13622: URL: https://github.com/apache/lucene/pull/13622#issuecomment-2631616345 Looking back at this, might it have been better to instead wrap `TaskExecutor.invokeAll`'s call of `executor.execute` in a loop to catch `RejectedExecutionException` and then don't both

Re: [PR] Bump floor segment size to 16MB. [lucene]

2025-02-03 Thread via GitHub
jpountz commented on PR #14189: URL: https://github.com/apache/lucene/pull/14189#issuecomment-2631518128 For reference, this is roughly a 10x increase of the floor segment size, so given that `TieredMergePolicy` defaults to 10 segments per tier, indexes should have about 10 fewer segments a

Re: [PR] Bump floor segment size to 16MB. [lucene]

2025-02-03 Thread via GitHub
jpountz commented on PR #14189: URL: https://github.com/apache/lucene/pull/14189#issuecomment-2631514489 Thanks for the feedback, I was hesitating. Let's pull this in 10.2 then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Consistent KNN query results with multiple leafs [lucene]

2025-02-03 Thread via GitHub
tteofili commented on PR #14191: URL: https://github.com/apache/lucene/pull/14191#issuecomment-2631453289 preliminary tests with _luceneutil_ on Cohere-768. **with force-merge=true** _baseline_ ``` recall latency (ms)nDoc topK fanout maxConn beamWidth quantized vis

[PR] Consistent KNN query results with multiple leafs [lucene]

2025-02-03 Thread via GitHub
tteofili opened a new pull request, #14191: URL: https://github.com/apache/lucene/pull/14191 This is a first attempt at fixing https://github.com/apache/lucene/issues/14180. It's based on @jpountz idea mentioned [here](https://github.com/apache/lucene/pull/14167#issuecomment-2616408185).

Re: [PR] Adjusts Seeded knn searches to clean up user and internal interfaces [lucene]

2025-02-03 Thread via GitHub
benwtrent merged PR #14170: URL: https://github.com/apache/lucene/pull/14170 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [PR] Disable the query cache by default. [lucene]

2025-02-03 Thread via GitHub
mikemccand commented on code in PR #14187: URL: https://github.com/apache/lucene/pull/14187#discussion_r1939310871 ## lucene/CHANGES.txt: ## @@ -30,6 +30,10 @@ Bug Fixes * GITHUB#14075: Remove duplicate and add missing entry on brazilian portuguese stopwords list. (Arthur Ca

Re: [I] Create a bot to add milestones to new PRs [lucene]

2025-02-03 Thread via GitHub
mikemccand commented on issue #14190: URL: https://github.com/apache/lucene/issues/14190#issuecomment-2630817713 It might in theory check the `lucene/CHANGES.txt` to look for an entry (with the PR/issue number) summarizing the PR? Then it could see which Lucene release the issue is under.

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-03 Thread via GitHub
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1935407529 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,268 @@ +/* + * Licensed to the Apache Software Foundation (ASF) unde