Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-12 Thread via GitHub
msfroh commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r1992783083 ## lucene/core/src/java/org/apache/lucene/util/automaton/StringsToAutomaton.java: ## @@ -209,7 +209,25 @@ private static int convert( int i = 0; int[] labels

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-12 Thread via GitHub
msfroh commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r1992783083 ## lucene/core/src/java/org/apache/lucene/util/automaton/StringsToAutomaton.java: ## @@ -209,7 +209,25 @@ private static int convert( int i = 0; int[] labels

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-12 Thread via GitHub
rmuir commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r1992524713 ## lucene/core/src/java/org/apache/lucene/util/automaton/StringsToAutomaton.java: ## @@ -209,7 +209,25 @@ private static int convert( int i = 0; int[] labels =

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-12 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2719555494 OG paper: https://aclanthology.org/J00-1002.pdf -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-12 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2719541789 To me it seems potentially safe and practical addition. The idea would be that, we can add transition "alternatives" (e.g. `A` vs `a`) and it doesn't break the high-level algorithm, due to

Re: [PR] Case-insensitive TermInSetQuery Implementation (Proof of Concept) [lucene]

2025-03-12 Thread via GitHub
rmuir commented on PR #14349: URL: https://github.com/apache/lucene/pull/14349#issuecomment-2719519004 In my head, that's what we need. There is a crazy difference in construction and execution time between a "native" union and using the efficient linear-time algorithm, as opposed to going

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-12 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2719524190 @dweiss understands this one the best, he implemented it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [PR] Disable the query cache by default. [lucene]

2025-03-12 Thread via GitHub
msfroh commented on code in PR #14187: URL: https://github.com/apache/lucene/pull/14187#discussion_r1992514802 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -77,7 +77,8 @@ public class IndexSearcher { static int maxClauseCount = 1024; - privat

Re: [PR] Case-insensitive TermInSetQuery Implementation (Proof of Concept) [lucene]

2025-03-12 Thread via GitHub
msfroh commented on PR #14349: URL: https://github.com/apache/lucene/pull/14349#issuecomment-2719491949 @willdickerson -- I took a stab at modifying `StringsToAutomaton`, to support case-insensitive matching: https://github.com/apache/lucene/pull/14350 -- This is an automated message from

[PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-12 Thread via GitHub
msfroh opened a new pull request, #14350: URL: https://github.com/apache/lucene/pull/14350 ### Description This is a rough attempt to make `StringsToAutomaton` support case-insensitive strings. -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] Add support for querying multiple fields to QueryBuilder. [lucene]

2025-03-12 Thread via GitHub
github-actions[bot] commented on PR #14262: URL: https://github.com/apache/lucene/pull/14262#issuecomment-2719429972 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] Disable the query cache by default. [lucene]

2025-03-12 Thread via GitHub
github-actions[bot] commented on PR #14187: URL: https://github.com/apache/lucene/pull/14187#issuecomment-2719430102 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] Case-insensitive TermInSetQuery Implementation (Proof of Concept) [lucene]

2025-03-12 Thread via GitHub
msfroh commented on code in PR #14349: URL: https://github.com/apache/lucene/pull/14349#discussion_r1992372705 ## lucene/core/src/test/org/apache/lucene/search/TestCaseInsensitiveTermInSetQuery.java: ## @@ -0,0 +1,377 @@ +/* + * Licensed to the Apache Software Foundation (ASF) u

Re: [PR] Case-insensitive TermInSetQuery Implementation (Proof of Concept) [lucene]

2025-03-12 Thread via GitHub
msfroh commented on code in PR #14349: URL: https://github.com/apache/lucene/pull/14349#discussion_r1992424195 ## lucene/core/src/java/org/apache/lucene/search/CaseInsensitiveTermInSetQuery.java: ## @@ -81,58 +89,95 @@ public void visit(QueryVisitor visitor) { visitor.con

Re: [PR] Case-insensitive TermInSetQuery Implementation (Proof of Concept) [lucene]

2025-03-12 Thread via GitHub
willdickerson commented on code in PR #14349: URL: https://github.com/apache/lucene/pull/14349#discussion_r1992405437 ## lucene/core/src/test/org/apache/lucene/search/TestCaseInsensitiveTermInSetQuery.java: ## @@ -0,0 +1,377 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] Case-insensitive TermInSetQuery Implementation (Proof of Concept) [lucene]

2025-03-12 Thread via GitHub
rmuir commented on code in PR #14349: URL: https://github.com/apache/lucene/pull/14349#discussion_r1992333636 ## lucene/core/src/java/org/apache/lucene/search/CaseInsensitiveTermInSetQuery.java: ## @@ -0,0 +1,185 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [PR] Case-insensitive TermInSetQuery Implementation (Proof of Concept) [lucene]

2025-03-12 Thread via GitHub
rmuir commented on code in PR #14349: URL: https://github.com/apache/lucene/pull/14349#discussion_r1992332504 ## lucene/core/src/java/org/apache/lucene/search/CaseInsensitiveTermInSetQuery.java: ## @@ -0,0 +1,185 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [I] [DISCUSS] Could we have a different ANN algorithm for Learned Sparse Vectors? [lucene]

2025-03-12 Thread via GitHub
atris commented on issue #13675: URL: https://github.com/apache/lucene/issues/13675#issuecomment-2719042595 Yes, been playing with some stuff. Should be able to get something related up for review soon -- This is an automated message from the Apache Git Service. To respond to the message,

[PR] Case-insensitive TermInSetQuery Implementation (Proof of Concept) [lucene]

2025-03-12 Thread via GitHub
willdickerson opened a new pull request, #14349: URL: https://github.com/apache/lucene/pull/14349 ## Overview This PR introduces a proof of concept for a case-insensitive variant of TermInSetQuery. The implementation provides an efficient way to search for terms regardless of case with

Re: [I] Opening of vector files with ReadAdvice.RANDOM_PRELOAD [lucene]

2025-03-12 Thread via GitHub
viliam-durina commented on issue #14348: URL: https://github.com/apache/lucene/issues/14348#issuecomment-2718850463 I have the necessary change ready in my fork of Lucene, and it works for us. I wanted input from maintainers whether they think this is a good idea in general for Lucene. --

Re: [I] Incorrect use of fsync [lucene]

2025-03-12 Thread via GitHub
rmuir commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2717869141 I also want to point out here, that current usage is not "incorrect". The idea that there is a "correct" way that will always work is 100% broken. look at what fsync() does on m

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-03-12 Thread via GitHub
svilen-mihaylov-elastic commented on code in PR #14094: URL: https://github.com/apache/lucene/pull/14094#discussion_r1991707917 ## lucene/core/src/test/org/apache/lucene/search/TestPatienceFloatVectorQuery.java: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundat

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-03-12 Thread via GitHub
svilen-mihaylov-elastic commented on code in PR #14094: URL: https://github.com/apache/lucene/pull/14094#discussion_r1991703342 ## lucene/core/src/java/org/apache/lucene/search/HnswKnnCollector.java: ## @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) un

Re: [I] Opening of vector files with ReadAdvice.RANDOM_PRELOAD [lucene]

2025-03-12 Thread via GitHub
DivyanshIITB commented on issue #14348: URL: https://github.com/apache/lucene/issues/14348#issuecomment-2717999626 Hi @viliam-durina, I find this issue interesting and would like to work on it. I see that Lucene currently uses ReadAdvice.RANDOM when opening vector files (.vec and .ve

Re: [I] Incorrect use of fsync [lucene]

2025-03-12 Thread via GitHub
rmuir commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2717565686 > we still need the fsync on the parent directory to persist the file metadata on Linux Blows a giant hole in your argument, that it is ok to write to this file and separately

Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]

2025-03-12 Thread via GitHub
rmuir commented on PR #14311: URL: https://github.com/apache/lucene/pull/14311#issuecomment-2717885448 @renatoh we can clean up `main` at any time as it is marked deprecated for 10.2 now -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]

2025-03-12 Thread via GitHub
renatoh commented on PR #14311: URL: https://github.com/apache/lucene/pull/14311#issuecomment-2717827401 > I'm just doing final tests. Thanks again @renatoh. I will backport it to 10.2. We can followup to remove the deprecated "sorta-kinda-longest-match" from lucene's `main` branch, and see

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-03-12 Thread via GitHub
dungba88 commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2717762620 I published the simplified version here for reference: https://github.com/dungba88/lucene/commit/278d7c919bc6ca6e1618868a892bcf3d4970cea5 -- This is an automated message from the Apac

[I] Opening of vector files with ReadAdvice.RANDOM_PRELOAD [lucene]

2025-03-12 Thread via GitHub
viliam-durina opened a new issue, #14348: URL: https://github.com/apache/lucene/issues/14348 ### Description Vector similarity search using HNSW accesses the vectors very heavily during the search (the `vec` or `veq` files). Even more than the HNSW graph itself (the `vex` file). If t

Re: [I] Incorrect use of fsync [lucene]

2025-03-12 Thread via GitHub
viliam-durina commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2716978401 > personally I think we should just simply fsync the files before we close them: nothing more fancy than that. If we rely on the file being ever durably stored, then the

Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

2025-03-12 Thread via GitHub
hanbj commented on PR #14267: URL: https://github.com/apache/lucene/pull/14267#issuecomment-2716944118 The previous failed test case was org.apache.Lucene.index TestKnnGraph.testMultiThreadedSearch. I have confirmed the testMultiThreadedSearch method, which uses KnnFloatVectorQuery for s