date:20250318

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-18 Thread via GitHub



vigyasharma commented on code in PR #14335:
URL: https://github.com/apache/lucene/pull/14335#discussion_r2001826532


##
lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java:
##
@@ -0,0 +1,70 @@
+package org.apache.lucene.index;
+
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.io.IOException;
+
+/**
+ * A multi-tenant merge scheduler that shares a global thread pool across 
multiple IndexWriters.
+ */
+public class MultiTenantMergeScheduler extends MergeScheduler {
+
+// Shared global thread pool
+private static final ExecutorService MERGE_THREAD_POOL = 
Executors.newFixedThreadPool(
+Runtime.getRuntime().availableProcessors() / 2
+);
+
+// Track active merges per writer
+private final List> activeMerges = 
Collections.synchronizedList(new ArrayList<>());
+
+@Override
+public void merge(MergeScheduler.MergeSource mergeSource, MergeTrigger 
trigger) throws IOException {
+while (true) {
+MergePolicy.OneMerge merge = mergeSource.getNextMerge();
+if (merge == null) break;  // No more merges
+
+// Submit merge task and track future
+Future future = MERGE_THREAD_POOL.submit(() -> {
+try {
+mergeSource.merge(merge);
+} catch (IOException e) {
+throw new RuntimeException("Merge operation failed", e);
+}
+});
+
+activeMerges.add(future);
+
+// Cleanup completed merges
+activeMerges.removeIf(Future::isDone);
+}
+}
+
+private final ConcurrentHashMap> activeMerges = 
new ConcurrentHashMap<>();
+
+@Override
+public void close() throws IOException {
+IndexWriter currentWriter = getCurrentIndexWriter();  // Method to get 
the calling writer
+List merges = activeMerges.getOrDefault(currentWriter, 
Collections.emptyList());
+
+for (Merge merge : merges) {
+merge.waitForCompletion(); // Only wait for merges related to this 
writer
+}
+
+activeMerges.remove(currentWriter); // Cleanup after closing

Review Comment:
   This doesn't line up, `activeMerges` is a List, you're calling Map APIs on 
it. Did this compile?



##
lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java:
##
@@ -0,0 +1,70 @@
+package org.apache.lucene.index;
+
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.io.IOException;
+
+/**
+ * A multi-tenant merge scheduler that shares a global thread pool across 
multiple IndexWriters.
+ */
+public class MultiTenantMergeScheduler extends MergeScheduler {
+
+// Shared global thread pool
+private static final ExecutorService MERGE_THREAD_POOL = 
Executors.newFixedThreadPool(
+Runtime.getRuntime().availableProcessors() / 2
+);
+
+// Track active merges per writer
+private final List> activeMerges = 
Collections.synchronizedList(new ArrayList<>());
+
+@Override
+public void merge(MergeScheduler.MergeSource mergeSource, MergeTrigger 
trigger) throws IOException {
+while (true) {
+MergePolicy.OneMerge merge = mergeSource.getNextMerge();
+if (merge == null) break;  // No more merges
+
+// Submit merge task and track future
+Future future = MERGE_THREAD_POOL.submit(() -> {
+try {
+mergeSource.merge(merge);
+} catch (IOException e) {
+throw new RuntimeException("Merge operation failed", e);
+}
+});
+
+activeMerges.add(future);
+
+// Cleanup completed merges
+activeMerges.removeIf(Future::isDone);
+}
+}
+
+private final ConcurrentHashMap> activeMerges = 
new ConcurrentHashMap<>();
+
+@Override
+public void close() throws IOException {
+IndexWriter currentWriter = getCurrentIndexWriter();  // Method to get 
the calling writer

Review Comment:
   This is not defined anywhere!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Align comments with upgraded postings format [lucene]

2025-03-18 Thread via GitHub



jpountz commented on PR #14368:
URL: https://github.com/apache/lucene/pull/14368#issuecomment-2734638261

   Good catch!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] New binary vector format doesn't perform well with small-dimension datasets [lucene]

2025-03-18 Thread via GitHub



benwtrent commented on issue #14342:
URL: https://github.com/apache/lucene/issues/14342#issuecomment-2733531171

   First, thank you @lpld for digging in and running these benchmarks!
   
   OK, I think I see the weirdness with the `mnist` data set. Its not about it 
being a transformer model, it has to do with the distribution of the 
components. 
   
   I think we can significantly improve performance here for non-normally 
distributed vector components. 
   
   Let me illustrate. 
   
   here is the centroid centered distribution of e5-small over the quora 
dataset:
   
   
![Image](https://github.com/user-attachments/assets/4187d470-f109-4c3b-b447-b28dbdf4541f)
   
   Here is the centroid centered distribution of fashion-minst:
   
   
![Image](https://github.com/user-attachments/assets/695c232d-b4b8-4e93-b8c3-dbf20f33f038)
   
   Not normal at all. 
   
   GIST-1M is an example of a dataset that isn't "optimal", but still works:
   
   
![Image](https://github.com/user-attachments/assets/b7253633-9af1-4650-8c4a-9e416c28e7e7)
   
   
   
   The initialization parameters for optimized scalar quantization makes an 
assumption around the distribution of vector components. However, I think we 
can improve this by:
   
   Option 0:
   
   There might just be a bug...I will spend some time seeing if I can find 
one...
   
   Option 1:
- testing the distribution of the components to verify normality. This can 
be done safely over a sample size of the vector set without too much compute 
power
- Adjust the initialization parameters for the anisotropic loss 
optimizations.
   
   Option 2:
   There might be something simpler by just allowing folks to provide a static 
confidence as the initialization parameter. This would by-pass our 
initialization parameters and do anisotropic loss from the calculated 
intervals. 
   
   Option 3 (really not an option with HNSW i think):
   
   Another option is to utilize multiple centroids, however, using multiple 
centroids without HNSW actually knowing about them is incredibly inefficient 
and will cause compute issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]

2025-03-18 Thread via GitHub



javanna commented on code in PR #14364:
URL: https://github.com/apache/lucene/pull/14364#discussion_r2000898675


##
lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestSuggestField.java:
##
@@ -951,7 +951,16 @@ static IndexWriterConfig iwcWithSuggestField(Analyzer 
analyzer, final Set

Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-18 Thread via GitHub



gf2121 commented on PR #14365:
URL: https://github.com/apache/lucene/pull/14365#issuecomment-2733285724

   I'm seeing even results on `wikimediumall`
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 TermDTSort   59.46  (2.8%)   59.15  
(1.9%)   -0.5% (  -5% -4%) 0.503
 IntSet   84.77  (2.8%)   85.17  
(2.3%)0.5% (  -4% -5%) 0.563
CountFilteredIntNRQ   40.75  (2.5%)   41.33  
(3.0%)1.4% (  -4% -7%) 0.106
  TermDayOfYearSort   60.69  (2.8%)   61.62  
(2.8%)1.5% (  -3% -7%) 0.085
 IntNRQ   81.34  (3.2%)   82.78  
(3.4%)1.8% (  -4% -8%) 0.091
 FilteredIntNRQ   77.49  (2.6%)   78.92  
(3.7%)1.8% (  -4% -8%) 0.068
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-18 Thread via GitHub



dweiss commented on code in PR #14350:
URL: https://github.com/apache/lucene/pull/14350#discussion_r2000465944


##
lucene/core/src/java/org/apache/lucene/util/automaton/Automata.java:
##
@@ -608,7 +608,24 @@ public static Automaton makeStringUnion(Iterable 
utf8Strings) {
 if (utf8Strings.iterator().hasNext() == false) {
   return makeEmpty();
 } else {
-  return StringsToAutomaton.build(utf8Strings, false);
+  return StringsToAutomaton.build(utf8Strings, false, false, false);
+}
+  }
+
+  /**
+   * Returns a new (deterministic and minimal) automaton that accepts the 
union of the given
+   * collection of {@link BytesRef}s representing UTF-8 encoded strings.
+   *
+   * @param utf8Strings The input strings, UTF-8 encoded. The collection must 
be in sorted order.
+   * @return An {@link Automaton} accepting all input strings. The resulting 
automaton is codepoint
+   * based (full unicode codepoints on transitions).
+   */
+  public static Automaton makeCaseInsensitiveStringUnion(

Review Comment:
   turkic parameter is missing from the javadoc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2025-03-18 Thread via GitHub



gf2121 closed pull request #13521: Introduce new encoding of BPV 21 for 
DocIdsWriter used in BKD Tree
URL: https://github.com/apache/lucene/pull/13521


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2025-03-18 Thread via GitHub



gf2121 merged PR #14361:
URL: https://github.com/apache/lucene/pull/14361


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2025-03-18 Thread via GitHub



gf2121 commented on PR #13521:
URL: https://github.com/apache/lucene/pull/13521#issuecomment-2732037641

   Closing this in favor of https://github.com/apache/lucene/pull/14361.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-03-18 Thread via GitHub



vigyasharma commented on issue #14362:
URL: https://github.com/apache/lucene/issues/14362#issuecomment-2731919938

   Hi @guojialiang92, Could you elaborate more on how you plan to use this 
capability? It's not immediately obvious why modifying `segmentInfos.counter` 
will help with peer recovery or segment replication conflicts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-18 Thread via GitHub



rmuir commented on code in PR #14350:
URL: https://github.com/apache/lucene/pull/14350#discussion_r2002272398


##
lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java:
##
@@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) {
 
 return alts;
   }
+
+  /**
+   * Folds the case of the given character according to {@link 
Character#toLowerCase(int)}, but with
+   * exceptions if the turkic flag is set.
+   *
+   * @param codepoint to code point for the character to fold
+   * @param turkic if true, then apply tr/az folding rules
+   * @return the folded character
+   */
+  static int foldCase(int codepoint, boolean turkic) {
+if (turkic) {
+  if (codepoint == 0x00130) { // İ [LATIN CAPITAL LETTER I WITH DOT ABOVE]
+return 0x00069; // i [LATIN SMALL LETTER I]
+  } else if (codepoint == 0x49) { //  I [LATIN CAPITAL LETTER I]
+return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I]
+  }
+}
+return Character.toLowerCase(codepoint);

Review Comment:
   Maybe, depending what we are going to do with it? if done correctly we could 
replace `LowerCaseFilter`, `GreekLowerCaseFilter`, etc in analysis chain. Of 
course "correctly" there is a difficult bar, as it would impact 100% of users 
in a very visible way and could easily bottleneck indexing / waste resources if 
not done correctly. For example large-arrays-of-objects or even primitives is a 
big no here. See https://www.strchr.com/multi-stage_tables and look at what JDK 
and ICU do already.
   
   But for the purpose of this PR, we may want to start simpler (this is the 
same approach I mentioned on regex caseless PR). We should avoid huge arrays 
and large data files in lucene-core, just for adding more inefficient user 
regular expressions that isn't really related to searching. On the other hand, 
if we are going to get serious benefit everywhere (e.g. improve all analyzers), 
then maybe the tradeoff makes sense.
   
   And I don't understand why we'd parse text files versus just write any 
generator itself to use ICU... especially since we already use such an approach 
in the build already: 
https://github.com/apache/lucene/blob/main/gradle/generation/icu/GenerateUnicodeProps.groovy
   
   Still I wouldn't immediately jump to generation as a start, it is a lot of 
work, and we should iterate. First i'd compare 
`Character.toLowerCase(Character.toUpperCase(x))` to `UCharacter.foldCase(int, 
false)` to see what the delta really needs to be as far as data. I'd expect 
this to be very small. You can start prototyping with that instead of investing 
a ton of up-front time.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-18 Thread via GitHub



vigyasharma commented on code in PR #14335:
URL: https://github.com/apache/lucene/pull/14335#discussion_r2000345662


##
lucene/core/src/test/org/apache/lucene/index/TestMultiTenantMergeScheduler.java:
##
@@ -0,0 +1,73 @@
+package org.apache.lucene.index;
+
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.RAMDirectory;
+import org.apache.lucene.analysis.standard.StandardAnalyzer;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.TextField;
+import org.apache.lucene.util.LuceneTestCase;
+
+public class TestMultiTenantMergeScheduler extends LuceneTestCase {
+
+public void testMultiTenantMergeScheduler() throws Exception {
+Directory dir = new RAMDirectory();
+IndexWriterConfig config = new IndexWriterConfig(new 
MockAnalyzer(random()));
+MultiTenantMergeScheduler scheduler = new MultiTenantMergeScheduler();
+config.setMergeScheduler(scheduler);
+
+IndexWriter writer1 = new IndexWriter(dir, config);
+IndexWriter writer2 = new IndexWriter(dir, config);
+
+// Add documents and trigger merges
+for (int i = 0; i < 50; i++) {
+writer1.addDocument(new Document());
+writer2.addDocument(new Document());
+if (i % 10 == 0) {
+writer1.commit();
+writer2.commit();
+}
+}
+
+writer1.forceMerge(1);
+writer2.forceMerge(1);
+
+// Close writers at different times
+writer1.close();
+Thread.sleep(500);
+writer2.close();
+
+// Ensure scheduler is properly closed
+scheduler.close();
+MultiTenantMergeScheduler.shutdownThreadPool();
+
+dir.close();
+}
+
+public void testConcurrentMerging() throws Exception {
+Directory dir = new RAMDirectory();
+IndexWriterConfig config = new IndexWriterConfig(new 
MockAnalyzer(random()));
+MultiTenantMergeScheduler scheduler = new MultiTenantMergeScheduler();
+config.setMergeScheduler(scheduler);
+
+IndexWriter writer = new IndexWriter(dir, config);
+
+// Add documents
+for (int i = 0; i < 100; i++) {
+writer.addDocument(new Document());
+}
+writer.commit();
+
+long startTime = System.currentTimeMillis();
+writer.forceMerge(1);
+long endTime = System.currentTimeMillis();
+
+writer.close();
+scheduler.close();
+MultiTenantMergeScheduler.shutdownThreadPool();
+
+// Check if merging took less time than sequential execution would
+assertTrue("Merges did not happen concurrently!", (endTime - 
startTime) < 5000);

Review Comment:
   It's not reliable to assert or depend on elapsed time. Lucene is run of many 
different machines and this timing will differ.



##
lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java:
##
@@ -0,0 +1,72 @@
+package org.apache.lucene.index;
+
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.io.IOException;
+
+/**
+ * A multi-tenant merge scheduler that shares a global thread pool across 
multiple IndexWriters.
+ */
+public class MultiTenantMergeScheduler extends MergeScheduler {
+
+// Shared global thread pool with lazy initialization
+private static class LazyHolder {
+static final ExecutorService MERGE_THREAD_POOL = 
+
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() / 2);
+}
+
+private static ExecutorService getMergeThreadPool() {
+return LazyHolder.MERGE_THREAD_POOL;
+}
+
+// Use getMergeThreadPool() instead of direct access

Review Comment:
   We can skip the `getMergeThreadPool()` function since we only need to access 
the threadpool internally. Something like:
   ```java
 private static class MergeThreadPool {
 private static final ExecutorService INSTANCE = 
 
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() / 2);
 }
 // and then just use MergeThreadPool.INSTANCE inside merge() and other 
functions.
 // ...
   }
   ```



##
lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java:
##
@@ -0,0 +1,72 @@
+package org.apache.lucene.index;
+
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.io.IOException;
+
+/**
+ * A multi-tenant merge scheduler that shares a global thread pool across 
multiple IndexWriters.
+ */
+public class MultiTenantMergeScheduler extends MergeScheduler {
+
+// Shared global thread pool with lazy initialization
+private static class LazyHolder {
+static final ExecutorService MERGE_THREAD_POOL = 
+
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() / 2);
+}
+
+priv

Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]

2025-03-18 Thread via GitHub



stefanvodita commented on issue #13898:
URL: https://github.com/apache/lucene/issues/13898#issuecomment-2733970610

   Just to clarify - the restriction @dweiss mentioned applies to the 
`changelog-enforcer` action, but not to the `checkout` action we are using.
   
   @pseudo-nymous - I'm seeing new failures now, e.g. a 
[fatal](https://github.com/apache/lucene/actions/runs/13876712827/job/388299374080).
   It's frustrating that the tests we perform on our forks aren't reflective of 
the behaviour we see in the main repo.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-18 Thread via GitHub



jpountz commented on PR #14365:
URL: https://github.com/apache/lucene/pull/14365#issuecomment-2734597577

   Maybe we should stop only adding doc IDs to the `BulkAdder` if they are 
greater than the max collected doc so far. Skipping these doc IDs looks like it 
hurts vectorization, I played with disabling these if statements locally and 
get a good speedup on queries sorted by numeric field:
   
   ```
 TermDTSort  503.38  (2.7%)  529.86  
(2.6%)5.3% (   0% -   10%) 0.001
  TermDayOfYearSort  700.49  (2.0%)  772.06  
(1.2%)   10.2% (   6% -   13%) 0.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-18 Thread via GitHub



msfroh commented on code in PR #14350:
URL: https://github.com/apache/lucene/pull/14350#discussion_r2002184400


##
lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java:
##
@@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) {
 
 return alts;
   }
+
+  /**
+   * Folds the case of the given character according to {@link 
Character#toLowerCase(int)}, but with
+   * exceptions if the turkic flag is set.
+   *
+   * @param codepoint to code point for the character to fold
+   * @param turkic if true, then apply tr/az folding rules
+   * @return the folded character
+   */
+  static int foldCase(int codepoint, boolean turkic) {
+if (turkic) {
+  if (codepoint == 0x00130) { // İ [LATIN CAPITAL LETTER I WITH DOT ABOVE]
+return 0x00069; // i [LATIN SMALL LETTER I]
+  } else if (codepoint == 0x49) { //  I [LATIN CAPITAL LETTER I]
+return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I]
+  }
+}
+return Character.toLowerCase(codepoint);

Review Comment:
   Got it. 
   
   Checking https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt, indeed I 
can see those entries:
   
   ```
   03A3; C; 03C3; # GREEK CAPITAL LETTER SIGMA
   ...
   03C2; C; 03C3; # GREEK SMALL LETTER FINAL SIGMA
   ```
   
   Ideally, I'd love to just use those folding rules.
   
   I could get them from `UCharacter.foldCase(int, bool)`, but that involves 
pulling in icu4j as a dependency, which is an extra 12MB jar.
   
   Would it be worthwhile to write a generator that pulls 
https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt (updated to whatever 
the current Unicode spec is) and generates a `foldCase` method that's 
functionally equivalent to `UCharacter.foldcase(int, bool)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] New binary vector format doesn't perform well with small-dimension datasets [lucene]

2025-03-18 Thread via GitHub



benwtrent commented on issue #14342:
URL: https://github.com/apache/lucene/issues/14342#issuecomment-2734567137

   OK, a colleague and I spent some time digging into this and Option 0 (a bug) 
turned out to be the case. Its a 5 character change (like all good bugs), but 
here are the new recall numbers for fashion-minst:
   
   Still not mid 90s at 5x oversampling, but WAY better than the abysmal 
results from before.
   
   ```
   FLAT Results:
   recall  latency(ms)   nDoc  topK  fanout  quantized  index(s)  index_docs/s  
num_segments  index_size(MB)  overSample  vec_disk(MB)  vec_RAM(MB)  indexType
0.4444.110  610  50 1 bits  0.00  Infinity  
   1  186.43   1.000   181.6462.203   FLAT
0.6294.383  610  50 1 bits  0.00  Infinity  
   1  186.43   2.000   181.6462.203   FLAT
0.7304.437  610  50 1 bits  0.00  Infinity  
   1  186.43   3.000   181.6462.203   FLAT
0.7924.455  610  50 1 bits  0.00  Infinity  
   1  186.43   4.000   181.6462.203   FLAT
0.8334.445  610  50 1 bits  0.00  Infinity  
   1  186.43   5.000   181.6462.203   FLAT
0.9264.607  610  50 1 bits  0.00  Infinity  
   1  186.43  10.000   181.6462.203   FLAT
   ```
   
   ```
   HNSW
   recall  latency(ms)   nDoc  topK  fanout  maxConn  beamWidth  quantized  
index(s)  index_docs/s  num_segments  index_size(MB)  overSample  vec_disk(MB)  
vec_RAM(MB)  indexType
0.4430.188  610  50   64250 1 bits  
0.00  Infinity 1  189.55   1.000   181.646  
  2.203   HNSW
0.6290.274  610  50   64250 1 bits  
0.00  Infinity 1  189.55   2.000   181.646  
  2.203   HNSW
0.7300.349  610  50   64250 1 bits  
0.00  Infinity 1  189.55   3.000   181.646  
  2.203   HNSW
0.7920.471  610  50   64250 1 bits  
0.00  Infinity 1  189.55   4.000   181.646  
  2.203   HNSW
0.8330.479  610  50   64250 1 bits  
0.00  Infinity 1  189.55   5.000   181.646  
  2.203   HNSW
0.9260.786  610  50   64250 1 bits  
0.00  Infinity 1  189.55  10.000   181.646  
  2.203   HNSW
   ```
   
   For the curious, it had to do with shifting the normal distribution 
initialization parameters correctly given the standard deviation of the actual 
vector distribution. We had the mean & std flipped. When these are well 
behaved, this sort of bug has a tiny effect (which is why we never caught it), 
but minst isn't well behaved and brought this nasty little bug to light.
   
   I am gonna run some more benchmarks and will open a PR soon with the fix. 
   
   
   As an aside, there is likely even more gains for non-normal distribution 
vectors like minst, but they will take more time and effort.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Fix for changelog verifier and milestone setter automation [lucene]

2025-03-18 Thread via GitHub



pseudo-nymous opened a new pull request, #14369:
URL: https://github.com/apache/lucene/pull/14369

   ### Description
   This pull request contains a fix for changelog automation that has been 
added recently. We have seen failures where either diff calculation wrt base 
commit was wrong or base commit was not even available leading to fatal error. 
   
   This happens because we are using `pull_request_target` event type which 
checks out `merge base commit` where as we want to checkout `head commit` to do 
proper diff calculation wrt changelog file. My initial testing was based on 
`pull_request` which worked as expected since it checkouts `merge head commit`.
   
   We tried a fix for this here: https://github.com/apache/lucene/pull/14355. 
But we didn't fetch the full history appropriately. This PR adds a fix for this 
by pulling all the associated history using `git pull --unshallow origin ${{ 
github.event.pull_request.head.ref }}`.
   
   Tested all the changes here: https://github.com/pseudo-nymous/lucene/pull/22
   
   Also, added optimization with `ref: ${{ github.event.pull_request.head.sha 
}}`, where it will only pull the head commit instead of pulling all the 
references.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]

2025-03-18 Thread via GitHub



pseudo-nymous commented on issue #13898:
URL: https://github.com/apache/lucene/issues/13898#issuecomment-2735279420

   @stefanvodita  I have added a fix for it. Please take a look. 
https://github.com/apache/lucene/pull/14369


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Fix for changelog verifier and milestone setter automation [lucene]

2025-03-18 Thread via GitHub



pseudo-nymous commented on PR #14369:
URL: https://github.com/apache/lucene/pull/14369#issuecomment-2735285383

   We can fetch all history using checkout actions itself using flag 
`fetch-depth: 0`. But it fetches all the history for all branches and tags 
which is not required here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-18 Thread via GitHub



gf2121 commented on PR #14365:
URL: https://github.com/apache/lucene/pull/14365#issuecomment-2735314110

   Thanks for running benchmark, the speed up is great!
   
   > Skipping these doc IDs looks like it hurts vectorization, I played with 
disabling these if statements locally and get a good speedup on queries sorted 
by numeric field.
   
   I try it after seeing your comment and see similar speed up. I think there 
might be more reason contributing to the speedup:
   
   * We are collecting more docs into the `DocIdSetBuilder` and the 
competitiveIterator has more chance to be a bitset iterator, which reduces the 
overhead of `LSBRadixSorter` and speed up `competitiveIterator#intoBitset`. 
This seems more like a performance-memory trade off for me.
   
   * The DayOfYear field has medium cardinality (365) so some blocks stored as 
bitset, bulk adding `DocBaseBitsetIterator` might also help the performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Align comments with upgraded postings format [lucene]

2025-03-18 Thread via GitHub



amosbird opened a new pull request, #14368:
URL: https://github.com/apache/lucene/pull/14368

   ### Description
   
   Update prefetch heuristic comments to reflect that skip data is now inlined 
into postings lists.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Align comments with upgraded postings format [lucene]

2025-03-18 Thread via GitHub



jpountz merged PR #14368:
URL: https://github.com/apache/lucene/pull/14368


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

Re: [PR] Align comments with upgraded postings format [lucene]

Re: [I] New binary vector format doesn't perform well with small-dimension datasets [lucene]

Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]

Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

Re: [I] Support modifying segmentInfos.counter in IndexWriter [lucene]

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]

Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

Re: [I] New binary vector format doesn't perform well with small-dimension datasets [lucene]

[PR] Fix for changelog verifier and milestone setter automation [lucene]

Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]

Re: [PR] Fix for changelog verifier and milestone setter automation [lucene]

Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

[PR] Align comments with upgraded postings format [lucene]

Re: [PR] Align comments with upgraded postings format [lucene]

22 matches

Site Navigation

Mail list logo

Footer information