[PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


bruno-roustant opened a new pull request, #13422:
URL: https://github.com/apache/lucene/pull/13422

   - Add the required classes to the hppc fork.
   - Remove the hppc dependency from the facet, join, spatial modules.
   - Remove the hppc version from versions.props and versions.lock.
   
   It remains to move the hppc fork to oal.internal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


uschindler commented on PR #13422:
URL: https://github.com/apache/lucene/pull/13422#issuecomment-2132806647

   Ah I see you mentioned the missing move in the PR description, sorry!
   
   So except the `@lucene.internal` javadocs (should be added always, also when 
in private package), +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


dweiss commented on PR #13422:
URL: https://github.com/apache/lucene/pull/13422#issuecomment-2132824844

   I'll do it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


bruno-roustant commented on PR #13422:
URL: https://github.com/apache/lucene/pull/13422#issuecomment-2132834604

   Ok, Dawid I let you move the fork.
   I added the @lucene.internal annotations. I like that this hppc integration 
is cleaner and well defined now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


bruno-roustant commented on PR #13422:
URL: https://github.com/apache/lucene/pull/13422#issuecomment-2132844966

   Do you want to move in this PR, or in another PR once this one is merged?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move bulkScorer() from Weight to ScorerSupplier [lucene]

2024-05-27 Thread via GitHub


jpountz merged PR #13408:
URL: https://github.com/apache/lucene/pull/13408


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


dweiss commented on PR #13422:
URL: https://github.com/apache/lucene/pull/13422#issuecomment-2132876361

   I'll push to this PR, if you don't mind?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid SegmentTermsEnumFrame reload block. [lucene]

2024-05-27 Thread via GitHub


vsop-479 commented on code in PR #13253:
URL: https://github.com/apache/lucene/pull/13253#discussion_r1615645890


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -287,6 +287,68 @@ void rewind() {
 */
   }
 
+  // Only rewind, don't force reload block.
+  // Reset readers' position, don't read, decompress.
+  // Current term greater than target, reduce endCount.
+  void rewindWithoutReload() {
+// Set nextEnt to 0, to prevent force load.
+nextEnt = 0;
+suffixesReader.setPosition(0);
+suffixLengthsReader.setPosition(0);
+statsReader.setPosition(0);
+bytesReader.setPosition(0);
+
+// TODO: Since we only rewind without reload for fist 
floor(currentFrame.fp ==

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid SegmentTermsEnumFrame reload block. [lucene]

2024-05-27 Thread via GitHub


vsop-479 commented on code in PR #13253:
URL: https://github.com/apache/lucene/pull/13253#discussion_r1615650665


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##
@@ -710,8 +761,28 @@ public SeekStatus seekCeil(BytesRef target) throws 
IOException {
 // System.out.println("  target is before current (shares prefixLen=" 
+ targetUpto + ");
 // rewind frame ord=" + lastFrame.ord);
 // }
+
+// We got lastFrame by comparing target and term, and target less than 
last seeked term in
+// currentFrame. If lastFrame's fp is same with currentFrame's fp, we 
can reduce entCount to
+// nextEnt.

Review Comment:
   > We don't seem to reduce entCount anymore?
   
   We can reduce entCount to nextEnt if we finally seek the same block.
   This has been already implemented in: 
   
   
   // We still stay on withOutReload frame, reduce entCount to nextEnt.
   int origEntCount = -1;
   if (currentFrame.fp == withOutReloadFp && origNextEnt != 0) {
 origEntCount = currentFrame.entCount;
 currentFrame.entCount = origNextEnt;
   }
   



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##
@@ -518,7 +541,20 @@ public boolean seekExact(BytesRef target) throws 
IOException {
 
 currentFrame.loadBlock();
 
+// We still stay on withOutReload frame, reduce entCount to nextEnt.
+int origEntCount = -1;
+if (currentFrame.fp == withOutReloadFp && origNextEnt != 0) {
+  origEntCount = currentFrame.entCount;
+  currentFrame.entCount = origNextEnt;
+}
+
 final SeekStatus result = currentFrame.scanToTerm(target, true);
+
+// Revert entCount to origEntCount.

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid SegmentTermsEnumFrame reload block. [lucene]

2024-05-27 Thread via GitHub


vsop-479 commented on code in PR #13253:
URL: https://github.com/apache/lucene/pull/13253#discussion_r1615651266


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##
@@ -573,7 +609,20 @@ public boolean seekExact(BytesRef target) throws 
IOException {
 
 currentFrame.loadBlock();
 
+// We still stay on withOutReload frame, reduce entCount to nextEnt.
+int origEntCount = -1;
+if (currentFrame.fp == withOutReloadFp && origNextEnt != 0) {
+  origEntCount = currentFrame.entCount;
+  currentFrame.entCount = origNextEnt;
+}
+
 final SeekStatus result = currentFrame.scanToTerm(target, true);
+
+// Revert entCount to origEntCount.

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid SegmentTermsEnumFrame reload block. [lucene]

2024-05-27 Thread via GitHub


vsop-479 commented on code in PR #13253:
URL: https://github.com/apache/lucene/pull/13253#discussion_r1615655198


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##
@@ -434,8 +436,29 @@ public boolean seekExact(BytesRef target) throws 
IOException {
 //   System.out.println("  target is before current (shares 
prefixLen=" + targetUpto + ");
 // rewind frame ord=" + lastFrame.ord);
 // }
+
+// We got lastFrame by comparing target and term, and target less than 
last seeked term in
+// currentFrame. If lastFrame's fp is same with currentFrame's fp, we 
can reduce entCount to
+// nextEnt.
+boolean currentIsLast = currentFrame.fp == lastFrame.fp;
 currentFrame = lastFrame;
-currentFrame.rewind();
+
+// Only rewindWithoutReload for non-floor block or first floor block.
+// TODO: We need currentFrame's first entry to judge whether we can 
rewindWithoutReload for
+// non-first floor blocks.
+if (currentFrame.fp != currentFrame.fpOrig

Review Comment:
   > Could you add comment for each of these reasons to still reload the frame?
   
   Added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use `IndexInput#prefetch` for terms dictionary lookups. [lucene]

2024-05-27 Thread via GitHub


jpountz commented on PR #13359:
URL: https://github.com/apache/lucene/pull/13359#issuecomment-2132902792

   
   
   Now that #13408 has been merged, I could update the benchmark to simply call 
IndexSearcher#search.
   
   
   ```java
   import java.io.IOException;
   import java.io.UncheckedIOException;
   import java.nio.file.Path;
   import java.nio.file.Paths;
   import java.util.ArrayList;
   import java.util.List;
   import java.util.Random;
   import java.util.concurrent.ExecutorService;
   import java.util.concurrent.Executors;
   import java.util.concurrent.ThreadLocalRandom;
   import java.util.concurrent.TimeUnit;
   import java.util.concurrent.atomic.AtomicLong;
   
   import org.apache.lucene.document.Document;
   import org.apache.lucene.document.Field.Store;
   import org.apache.lucene.document.StringField;
   import org.apache.lucene.index.DirectoryReader;
   import org.apache.lucene.index.IndexReader;
   import org.apache.lucene.index.IndexWriter;
   import org.apache.lucene.index.IndexWriterConfig;
   import org.apache.lucene.index.Term;
   import org.apache.lucene.index.TieredMergePolicy;
   import org.apache.lucene.search.BooleanClause.Occur;
   import org.apache.lucene.search.BooleanQuery;
   import org.apache.lucene.search.IndexSearcher;
   import org.apache.lucene.search.Sort;
   import org.apache.lucene.search.TermQuery;
   import org.apache.lucene.store.Directory;
   import org.apache.lucene.store.FSDirectory;
   
   public class TermsEnumPrefetchBench {
   
 private static final int NUM_TERMS = 3;
 public static int DUMMY;
 
 public static void main(String[] args) throws Exception {
   Path dirPath = Paths.get(args[0]);
   Directory dir = FSDirectory.open(dirPath);
   if (DirectoryReader.indexExists(dir) == false) {
 TieredMergePolicy mp = new TieredMergePolicy();
 mp.setSegmentsPerTier(100);
 mp.setMaxMergeAtOnce(100);
 mp.setMaxMergedSegmentMB(1024);
 try (IndexWriter w = new IndexWriter(dir, new IndexWriterConfig()
 .setMergePolicy(mp)
 .setRAMBufferSizeMB(1024))) {
   ExecutorService executor = 
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
   AtomicLong indexed = new AtomicLong(0);
   for (int task = 0; task < 1000; ++task) {
 executor.execute(() -> {
   Random r = ThreadLocalRandom.current();
   for (int i = 0; i < 1_000; ++i) {
 Document doc = new Document();
 for (int j = 0; j < 10_000; ++j) {
   doc.add(new StringField("f", 
Long.toString(r.nextLong(20_000_000_000L)), Store.NO));
 }
 try {
   w.addDocument(doc);
 } catch (IOException e) {
   throw new UncheckedIOException(e);
 }
 final long actualIndexed = indexed.incrementAndGet(); 
 if (actualIndexed % 10_000 == 0) {
   System.out.println("Indexed: " + actualIndexed);
 }
   }
 });
   }
   
   executor.shutdown();
   executor.awaitTermination(1, TimeUnit.DAYS);
   w.commit();
   System.out.println("Start force merging");
   w.forceMerge(1);
   System.out.println("Done force merging");
   w.commit();
 }
   }
   List latencies = new ArrayList<>();
   try (IndexReader reader = DirectoryReader.open(dir)) {
 IndexSearcher searcher = new IndexSearcher(reader);
   
 Random r = ThreadLocalRandom.current();
 for (int i = 0; i < 10_000; ++i) {
   long start = System.nanoTime();
   BooleanQuery.Builder query = new BooleanQuery.Builder();
   for (int t = 0; t < NUM_TERMS; ++t) {
 query.add(new TermQuery(new Term("f", 
Long.toString(r.nextLong(20_000_000_000L, Occur.SHOULD);
   }
   DUMMY += searcher.search(query.build(), 1, 
Sort.INDEXORDER).totalHits.value;
   long end = System.nanoTime();
   latencies.add((end - start) / 1000);
 }
   }
   latencies.sort(null);
   System.out.println("P50: " + latencies.get(latencies.size() / 2));
   System.out.println("P90: " + latencies.get(latencies.size() * 9 / 10));
   System.out.println("P99: " + latencies.get(latencies.size() * 99 / 100));
 }
   
   }
   
   ```
   
   
   
   Results still look good.
   
   Before the change:
   P50: 282
   P90: 387
   P99: 537
   
   After the change:
   P50: 161
   P90: 253
   P99: 379


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Remove Scorer#getWeight. [lucene]

2024-05-27 Thread via GitHub


jpountz commented on issue #13410:
URL: https://github.com/apache/lucene/issues/13410#issuecomment-2132905389

   @navneet1v Since a `Scorer` must be created from a `Weight`, callers would 
be expected to keep track of the `Weight` that created the `Scorer` if they 
need it instead of relying on Lucene to do this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


dweiss commented on PR #13422:
URL: https://github.com/apache/lucene/pull/13422#issuecomment-2132910636

   So compilation works, but tests won't fly with this change. Recall we run 
tests in "classpath mode" so even though modules compile, for tests they're on 
classpath and then can't access internal packages from the core. Here's an 
example:
   ```
  > java.lang.IllegalAccessError: class 
org.apache.lucene.search.join.DiversifyingNearestChildrenKnnCollector$NodeIdCachingHeap
 (in unnamed module @0x68472863) cannot access class 
org.apache.lucene.util.hppc.IntIntHashMap (in module org.apache.lucene.core) 
because module org.apache.lucene.core does not export 
org.apache.lucene.util.hppc to unnamed module @0x68472863
  > at 
__randomizedtesting.SeedInfo.seed([9A5633A49FE14BF7:FC682151F0E8CD34]:0)
  > at 
org.apache.lucene.search.join.DiversifyingNearestChildrenKnnCollector$NodeIdCachingHeap.(DiversifyingNearestChildrenKnnCollector.java:132)
  > at 
org.apache.lucene.search.join.DiversifyingNearestChildrenKnnCollector.(DiversifyingNearestChildrenKnnCollector.java:47)
  > at 
org.apache.lucene.search.join.DiversifyingNearestChildrenKnnCollectorManager.newCollector(DiversifyingNearestChildrenKnnCollectorManager.java:68)
  > at 
org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.search.TimeLimitingKnnCollectorManager.newCollector(TimeLimitingKnnCollectorManager.java:41)
   ```
   
   Fixing this requires going down the rabbit hole of compiling and running 
tests with module patching and/or moving tests to a separate module and 
reorganizing their internals so that there are no split packages. Invasive...
   
   I wonder if the first baby step shouldn't be to move the hppc forked classes 
to .internal. (with proper annotation), but - sadly - export that package until 
we figure out how to compile and run tests in full modular mode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid SegmentTermsEnumFrame reload block. [lucene]

2024-05-27 Thread via GitHub


vsop-479 commented on code in PR #13253:
URL: https://github.com/apache/lucene/pull/13253#discussion_r1615706277


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##
@@ -434,8 +436,29 @@ public boolean seekExact(BytesRef target) throws 
IOException {
 //   System.out.println("  target is before current (shares 
prefixLen=" + targetUpto + ");
 // rewind frame ord=" + lastFrame.ord);
 // }
+
+// We got lastFrame by comparing target and term, and target less than 
last seeked term in
+// currentFrame. If lastFrame's fp is same with currentFrame's fp, we 
can reduce entCount to
+// nextEnt.
+boolean currentIsLast = currentFrame.fp == lastFrame.fp;
 currentFrame = lastFrame;
-currentFrame.rewind();
+
+// Only rewindWithoutReload for non-floor block or first floor block.
+// TODO: We need currentFrame's first entry to judge whether we can 
rewindWithoutReload for
+// non-first floor blocks.

Review Comment:
   > Let's leave it as a future TODO?
   
   OK.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use SPI instead of Enum for VectorSimilarityFunctions [lucene]

2024-05-27 Thread via GitHub


ChrisHegarty commented on PR #13401:
URL: https://github.com/apache/lucene/pull/13401#issuecomment-2132962554

   > The main motivation behind this change was to get rid of ENUM 
implementation which is tightly coupled to field-info. This has caused 
inconvenience in deprecating the COSINE function from the list. Not sure what's 
the best approach then. Will take a look at your PR and comments to understand 
more about this.
   
   This is not a very compelling reason supporting the proposed change. I agree 
with @benwtrent, there are several challenges that would need to be ironed out 
before we could consider moving this forward.
   
   The recent addition of FlatVectorsScorer has certainly improved the 
extensibility in this area. For now at least, it appears to offer what is 
needed to plugin in new implementations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fixes failing test case for TestOrdinalMap.testRamBytesUsed [lucene]

2024-05-27 Thread via GitHub


pseudo-nymous commented on code in PR #13421:
URL: https://github.com/apache/lucene/pull/13421#discussion_r1615714149


##
lucene/core/src/test/org/apache/lucene/index/TestOrdinalMap.java:
##
@@ -52,7 +53,9 @@ public long accumulateObject(
 long shallowSize,
 java.util.Map fieldValues,
 java.util.Collection queue) {
-  if (o == LongValues.ZEROES || o == LongValues.IDENTITY) {
+  if (o == LongValues.ZEROES
+  || o == LongValues.IDENTITY
+  || o == 
PackedInts.NullReader.DEFAULT_PACKED_LONG_VALUES_PAGE_SIZE) {

Review Comment:
   Sure. Thanks for quick review.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


bruno-roustant commented on PR #13422:
URL: https://github.com/apache/lucene/pull/13422#issuecomment-2132983280

   I agree with the baby step, and this PR is already massive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid SegmentTermsEnumFrame reload block. [lucene]

2024-05-27 Thread via GitHub


vsop-479 commented on code in PR #13253:
URL: https://github.com/apache/lucene/pull/13253#discussion_r1615722587


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##
@@ -434,8 +436,29 @@ public boolean seekExact(BytesRef target) throws 
IOException {
 //   System.out.println("  target is before current (shares 
prefixLen=" + targetUpto + ");
 // rewind frame ord=" + lastFrame.ord);
 // }
+
+// We got lastFrame by comparing target and term, and target less than 
last seeked term in
+// currentFrame. If lastFrame's fp is same with currentFrame's fp, we 
can reduce entCount to
+// nextEnt.
+boolean currentIsLast = currentFrame.fp == lastFrame.fp;
 currentFrame = lastFrame;
-currentFrame.rewind();
+
+// Only rewindWithoutReload for non-floor block or first floor block.
+// TODO: We need currentFrame's first entry to judge whether we can 
rewindWithoutReload for
+// non-first floor blocks.
+if (currentFrame.fp != currentFrame.fpOrig

Review Comment:
   > Maybe currentFrame.nextEnt == -1 is enough for both unloaded block, I will 
confirm that.
   
   We may meet a block `nextEnt` is -1 and `entCount` is not 0( set by 
`SegmentTermsEnumFrame.scanToFloorFrame`), but we should not meet a block 
`entCount` is 0 and `nextEnt` is not -1. 
   
   So `currentFrame.nextEnt == -1` is enough for unloaded block.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


dweiss commented on PR #13422:
URL: https://github.com/apache/lucene/pull/13422#issuecomment-2133003069

   I think the only thing missing is perhaps a changes entry and migration 
entry warning people that the dependency is gone and that the currently exposed 
internal package will be gone without a warning in the future (if we want to 
apply it to 9x as well).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


dweiss commented on PR #13422:
URL: https://github.com/apache/lucene/pull/13422#issuecomment-2133074149

   I can't remember what the problem was, to be honest. Will have to return to 
it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


bruno-roustant commented on PR #13422:
URL: https://github.com/apache/lucene/pull/13422#issuecomment-2133113969

   Thank you Dawid!
   I can work on the backport to 9x.
   
   How long should we wait for this PR before merging it?
   Given it is large, it has higher probabilities of conflicts, in main and 9x, 
so I would prefer to merge it quickly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


dweiss commented on PR #13422:
URL: https://github.com/apache/lucene/pull/13422#issuecomment-2133132538

   I'd merge it right away, I don't see the point in waiting. If there are 
conflicts - they should be easy to resolve.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


bruno-roustant merged PR #13422:
URL: https://github.com/apache/lucene/pull/13422


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


uschindler commented on PR #13422:
URL: https://github.com/apache/lucene/pull/13422#issuecomment-2133150944

   Go ahead!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]

2024-05-27 Thread via GitHub


bruno-roustant commented on PR #13422:
URL: https://github.com/apache/lucene/pull/13422#issuecomment-2133194059

   Backport to branch_9x complete!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-05-27 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2133354180

   > I agree that better organizing data across segments yields significant 
benefits, I'm only advocating for doing this by maintaining a separate 
IndexWriter for each group instead of doing it inside of DocumentsWriter.
   
   Sorry missed answering this part in my earlier response. We did explore this 
approach of creating an IndexWriter/Lucene Index (or OpenSearch shard) for each 
group. However, implementing this approach would require significant changes on 
the client (OpenSearch) side. Besides, having multiple additional shards will 
also lead to huge overhead due to task like metadata management. On the other 
hand, maintaining separate DWPT pools for different groups would require 
minimal changes inside Lucene. The overhead will be lesser here as Lucene shard 
will still be maintained as a single physical unit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add prefetching support to stored fields. [lucene]

2024-05-27 Thread via GitHub


jpountz opened a new pull request, #13424:
URL: https://github.com/apache/lucene/pull/13424

   This adds `StoredFields#prefetch(int)`, which mostly delegates to 
`IndexInput#prefetch`. Callers can take advantage of this API to parallelize 
I/O across multiple stored documents by first calling `StoredFields#prefetch` 
on all doc IDs before calling `StoredFields#document` on all doc IDs.
   
   I added a cache of recently prefetched blocks to the default codec, in order 
to avoid prefetching the same block multiple times in a short period of time. 
This felt sensible given that doc ID reordering via recursive graph bisection 
or index sorting are likely to result in search results being clustered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Rewrite newSlowRangeQuery to MatchNoDocsQuery when upper > lower [lucene]

2024-05-27 Thread via GitHub


ioanatia opened a new pull request, #13425:
URL: https://github.com/apache/lucene/pull/13425

   We already have a check in place that rewrites 
`SortedNumericDocValuesRangeQuery` to a `FieldExistsQuery` when the lookup 
range interval is `[ Long.MIN_VALUE, Long.MAX_VALUE]`.
   
   This PR adds another check to see whether the upper bound is smaller than 
the lower bound.
   If it is not, we can confidently rewrite `SortedNumericDocValuesRangeQuery` 
to a `MatchNoDocsQuery`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add prefetching support to stored fields. [lucene]

2024-05-27 Thread via GitHub


jpountz commented on PR #13424:
URL: https://github.com/apache/lucene/pull/13424#issuecomment-2133420215

   Like for previous changes, I wrote a synthetic benchmark to make sure that 
this new API actually helps.
   
   
   This benchmark simulates fetching 20 random stored documents in 
parallel. The index it creates is 39GB while my page cache only has a capacity 
of 25GB.
   
   ```java
   import java.io.IOException;
   import java.io.UncheckedIOException;
   import java.nio.file.Path;
   import java.nio.file.Paths;
   import java.util.ArrayList;
   import java.util.List;
   import java.util.Random;
   import java.util.concurrent.ExecutorService;
   import java.util.concurrent.Executors;
   import java.util.concurrent.ThreadLocalRandom;
   import java.util.concurrent.TimeUnit;
   import java.util.concurrent.atomic.AtomicLong;
   
   import org.apache.lucene.document.Document;
   import org.apache.lucene.document.StoredField;
   import org.apache.lucene.index.DirectoryReader;
   import org.apache.lucene.index.FilterMergePolicy;
   import org.apache.lucene.index.IndexReader;
   import org.apache.lucene.index.IndexWriter;
   import org.apache.lucene.index.IndexWriterConfig;
   import org.apache.lucene.index.MergePolicy;
   import org.apache.lucene.index.SegmentCommitInfo;
   import org.apache.lucene.index.SegmentInfos;
   import org.apache.lucene.index.StoredFields;
   import org.apache.lucene.index.TieredMergePolicy;
   import org.apache.lucene.store.Directory;
   import org.apache.lucene.store.FSDirectory;
   
   public class StoredFieldsPrefetchBench {
   
 public static int DUMMY;
   
 public static void main(String[] args) throws Exception {
   Path dirPath = Paths.get(args[0]);
   Directory dir = FSDirectory.open(dirPath);
   if (DirectoryReader.indexExists(dir) == false) {
 MergePolicy mergePolicy = new FilterMergePolicy(new 
TieredMergePolicy()) {
   @Override
   public boolean useCompoundFile(SegmentInfos infos, SegmentCommitInfo 
mergedInfo, MergeContext mergeContext)
   throws IOException {
 return false;
   }
 };
 try (IndexWriter w = new IndexWriter(dir, new 
IndexWriterConfig().setUseCompoundFile(false).setMergePolicy(mergePolicy))) {
   ExecutorService executor = Executors.newFixedThreadPool(4);
   AtomicLong indexed = new AtomicLong(0);
   for (int task = 0; task < 1000; ++task) {
 executor.execute(() -> {
   Random r = ThreadLocalRandom.current();
   for (int i = 0; i < 40_000; ++i) {
 Document doc = new Document();
 byte[] bytes = new byte[1024];
 r.nextBytes(bytes);
 doc.add(new StoredField("content", bytes));
 try {
   w.addDocument(doc);
 } catch (IOException e) {
   throw new UncheckedIOException(e);
 }
 final long actualIndexed = indexed.incrementAndGet();
 if (actualIndexed % 1_000_000 == 0) {
   System.out.println("Indexed: " + actualIndexed);
   try {
 DirectoryReader.open(w).close();
   } catch (IOException e) {
 throw new UncheckedIOException(e);
   }
 }
   }
 });
   }
   
   executor.shutdown();
   executor.awaitTermination(1, TimeUnit.DAYS);
   w.commit();
   System.out.println("Done indexing");
 }
   }
   List latencies = new ArrayList<>();
   try (IndexReader reader = DirectoryReader.open(dir)) {
   
 Random r = ThreadLocalRandom.current();
 for (int i = 0; i < 10_000; ++i) {
   StoredFields storedFields = reader.storedFields();
   
   long start = System.nanoTime();
   
   int[] ids = new int[20];
   for (int j = 0; j < ids.length; ++j) {
 ids[j] = r.nextInt(reader.maxDoc());
   }
   for (int doc : ids) {
 storedFields.prefetch(doc);
   }
   for (int doc : ids) {
 DUMMY += 
storedFields.document(doc).getBinaryValue("content").hashCode();
   }
   
   long end = System.nanoTime();
   latencies.add((end - start) / 1000);
 }
   }
   latencies.sort(null);
   System.out.println("P50: " + latencies.get(latencies.size() / 2));
   System.out.println("P90: " + latencies.get(latencies.size() * 9 / 10));
   System.out.println("P99: " + latencies.get(latencies.size() * 99 / 100));
 }
   
   }
   ```
   
   
   Before the change:
   P50: 2942
   P90: 3900
   P99: 4726
   
   After the change:
   P50: 650
   P90: 801
   P99: 970


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.


[I] Examine assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


ChrisHegarty opened a new issue, #13426:
URL: https://github.com/apache/lucene/issues/13426

   A recent change, #13406 , added an assertion that may be incorrect.
   
   The assertion asserts that the number of entries matches that of the number 
of inputs processed. This may not be the case then a duplicate entry is passed 
in. For example,
   
   
   Add a duplicate entry:
   ```
   $ git diff
   diff --git 
a/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt 
b/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt
   index 045b64eaa07..4513885a36b 100644
   --- 
a/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt
   +++ 
b/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt
   @@ -5,6 +5,7 @@ C샤프
세종시 세종 시
대한민국날씨
대한민국
   +대한민국
날씨
21세기대한민국
세기
   \ No newline at end of file
   ```
   
   ```
   $ ./gradlew :lucene:analysis:nori:test --tests 
"org.apache.lucene.analysis.ko.TestKoreanTokenizer"
   ```
   
   ```
   reproduce with: gradlew test --tests 
TestKoreanTokenizer.testPartOfSpeechsWithCompound -Dtests.seed=6445594235429961 
-Dtests.locale=bs-BA -Dtests.timezone=Atlantic/Faeroe -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
  > java.lang.AssertionError
  > at 
__randomizedtesting.SeedInfo.seed([6445594235429961:9CEA448B4AB5C1F2]:0)
  > at 
org.apache.lucene.analysis.ko.dict.UserDictionary.(UserDictionary.java:137)
  > at 
org.apache.lucene.analysis.ko.dict.UserDictionary.open(UserDictionary.java:69)
  > at 
org.apache.lucene.analysis.ko.TestKoreanTokenizer.readDict(TestKoreanTokenizer.java:51)
  > at 
org.apache.lucene.analysis.ko.TestKoreanTokenizer.setUp(TestKoreanTokenizer.java:63)
  > at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  > at java.base/java.lang.reflect.Method.invoke(Method.java:580)
  > at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  > at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:980)
  > at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  > at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  > at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  > at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  > at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  > at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  > at 
junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
  > at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  > at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  > at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  > at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  > at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  > at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  > at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  > at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  > at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   ...
   ```
   
   I encountered this assertion firing when testing a snapshot of the Lucene 
branch with Elasticsearch. The 
testNoriAnalyzerDuplicateUserDictRuleWithLegacyVersion test fails (hits the 
assertion), see 
https://github.com/elastic/elasticsearch/blob/main/plugins/analysis-nori/src/test/java/org/elasti

Re: [I] Examine assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


ChrisHegarty commented on issue #13426:
URL: https://github.com/apache/lucene/issues/13426#issuecomment-2133707953

   @bruno-roustant Unless I'm mistaken, this issue relates to your change in 
#13406. I think that the assertion is just incorrect?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Examine assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


bruno-roustant commented on issue #13426:
URL: https://github.com/apache/lucene/issues/13426#issuecomment-2133732763

   Thanks for the heads up. I look at that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove incorrect assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


ChrisHegarty commented on PR #13427:
URL: https://github.com/apache/lucene/pull/13427#issuecomment-2133741247

   Unless I'm mistaken, this recently added assertion is incorrect, so should 
simply be removed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove incorrect assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


bruno-roustant commented on code in PR #13427:
URL: https://github.com/apache/lucene/pull/13427#discussion_r1616229789


##
lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/UserDictionary.java:
##
@@ -134,7 +134,6 @@ private UserDictionary(List entries) throws 
IOException {
 this.fst =
 new TokenInfoFST(FST.fromFSTReader(fstCompiler.compile(), 
fstCompiler.getFSTReader()));
 int[][] segmentations = _segmentations.toArray(new 
int[_segmentations.size()][]);
-assert entryIndex == rightIds.length;

Review Comment:
   The assertion was incorrect, but it detected a problem that needs a code fix.
   Can you replace the assertion by
   ```
   if (entryIndex < rightIds.length) {
 rightIds = ArrayUtil.copyOfSubArray(rightIds, 0, entryIndex);
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove incorrect assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


ChrisHegarty commented on code in PR #13427:
URL: https://github.com/apache/lucene/pull/13427#discussion_r1616231384


##
lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/UserDictionary.java:
##
@@ -134,7 +134,6 @@ private UserDictionary(List entries) throws 
IOException {
 this.fst =
 new TokenInfoFST(FST.fromFSTReader(fstCompiler.compile(), 
fstCompiler.getFSTReader()));
 int[][] segmentations = _segmentations.toArray(new 
int[_segmentations.size()][]);
-assert entryIndex == rightIds.length;

Review Comment:
   ha! snap! I just realised same!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove incorrect assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


bruno-roustant commented on PR #13427:
URL: https://github.com/apache/lucene/pull/13427#issuecomment-2133778754

   The recent code change replaced the previous List by a primitive 
structure.
   To avoid yet another hppc primitive list in the Lucene fork, this time I 
replaced by a basic short array. But I missed the duplicate case, and it was 
not tested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove incorrect assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


bruno-roustant commented on PR #13427:
URL: https://github.com/apache/lucene/pull/13427#issuecomment-2133781188

   I'm glad to have added this assertion :) It caught a misunderstanding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


uschindler commented on PR #13427:
URL: https://github.com/apache/lucene/pull/13427#issuecomment-2133787102

   Do we have the same issue in kuromoji's UserDictionary? I just want to make 
sure about this because both nori and kuromoji use similar code.
   
   In @bruno-roustant 's PR where he changed this, he only touched Nor, but 
just to be sure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


ChrisHegarty commented on PR #13427:
URL: https://github.com/apache/lucene/pull/13427#issuecomment-2133790710

   org.apache.lucene.analysis.ja.dict.UserDictionary looks ok to me - no issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] MemorySegment scorer should ensure that the values is of the correct type [lucene]

2024-05-27 Thread via GitHub


ChrisHegarty merged PR #13423:
URL: https://github.com/apache/lucene/pull/13423


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


bruno-roustant commented on PR #13427:
URL: https://github.com/apache/lucene/pull/13427#issuecomment-2133794952

   The Japanese UserDictionary is quite different, and I didn't touch it. 
Although I plan later to replace the TreeMap of Integer if possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


uschindler commented on code in PR #13427:
URL: https://github.com/apache/lucene/pull/13427#discussion_r1616243372


##
lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/UserDictionary.java:
##
@@ -134,7 +135,9 @@ private UserDictionary(List entries) throws 
IOException {
 this.fst =
 new TokenInfoFST(FST.fromFSTReader(fstCompiler.compile(), 
fstCompiler.getFSTReader()));
 int[][] segmentations = _segmentations.toArray(new 
int[_segmentations.size()][]);
-assert entryIndex == rightIds.length;
+if (entryIndex < rightIds.length) {

Review Comment:
   I would move that up to behind line 103, because that's where rightIds is 
built and the shorter version should be there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


uschindler commented on code in PR #13427:
URL: https://github.com/apache/lucene/pull/13427#discussion_r1616243372


##
lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/UserDictionary.java:
##
@@ -134,7 +135,9 @@ private UserDictionary(List entries) throws 
IOException {
 this.fst =
 new TokenInfoFST(FST.fromFSTReader(fstCompiler.compile(), 
fstCompiler.getFSTReader()));
 int[][] segmentations = _segmentations.toArray(new 
int[_segmentations.size()][]);
-assert entryIndex == rightIds.length;
+if (entryIndex < rightIds.length) {

Review Comment:
   I would move that up to behind line 103, because that's where rightIds is 
built and the (possibly) shorter array should be allocated there. After that 
all is done with rightIds.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


ChrisHegarty commented on code in PR #13427:
URL: https://github.com/apache/lucene/pull/13427#discussion_r1616246358


##
lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/UserDictionary.java:
##
@@ -134,7 +135,9 @@ private UserDictionary(List entries) throws 
IOException {
 this.fst =
 new TokenInfoFST(FST.fromFSTReader(fstCompiler.compile(), 
fstCompiler.getFSTReader()));
 int[][] segmentations = _segmentations.toArray(new 
int[_segmentations.size()][]);
-assert entryIndex == rightIds.length;
+if (entryIndex < rightIds.length) {

Review Comment:
   I moved it up a couple of lines ( which makes sense ), but it cannot go any 
further. We only wanna copy once after iterating the values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add prefetching support to stored fields. [lucene]

2024-05-27 Thread via GitHub


rmuir commented on code in PR #13424:
URL: https://github.com/apache/lucene/pull/13424#discussion_r1616250356


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java:
##
@@ -609,6 +622,23 @@ public void skipBytes(long numBytes) throws IOException {
 }
   }
 
+  @Override
+  public void prefetch(int docID) throws IOException {
+final long blockID = indexReader.getBlockID(docID);
+
+for (long prefetchedBlockID : prefetchedBlockIDCache) {
+  if (prefetchedBlockID == blockID) {
+return;
+  }
+}

Review Comment:
   can we use Arrays.binarysearch instead of scanning?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add prefetching support to stored fields. [lucene]

2024-05-27 Thread via GitHub


rmuir commented on code in PR #13424:
URL: https://github.com/apache/lucene/pull/13424#discussion_r1616256454


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java:
##
@@ -609,6 +622,23 @@ public void skipBytes(long numBytes) throws IOException {
 }
   }
 
+  @Override
+  public void prefetch(int docID) throws IOException {
+final long blockID = indexReader.getBlockID(docID);
+
+for (long prefetchedBlockID : prefetchedBlockIDCache) {
+  if (prefetchedBlockID == blockID) {
+return;
+  }
+}

Review Comment:
   or maybe even simpler, is it enough to just cache the last blockid?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


uschindler commented on code in PR #13427:
URL: https://github.com/apache/lucene/pull/13427#discussion_r1616257319


##
lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/UserDictionary.java:
##
@@ -134,7 +135,9 @@ private UserDictionary(List entries) throws 
IOException {
 this.fst =
 new TokenInfoFST(FST.fromFSTReader(fstCompiler.compile(), 
fstCompiler.getFSTReader()));
 int[][] segmentations = _segmentations.toArray(new 
int[_segmentations.size()][]);
-assert entryIndex == rightIds.length;
+if (entryIndex < rightIds.length) {

Review Comment:
   Ah correct. I thought the loop ended after 103.
   
   The original code looked like this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


ChrisHegarty merged PR #13427:
URL: https://github.com/apache/lucene/pull/13427


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Examine assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]

2024-05-27 Thread via GitHub


ChrisHegarty closed issue #13426: Examine assertion in 
org.apache.lucene.analysis.ko.dict.UserDictionary
URL: https://github.com/apache/lucene/issues/13426


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for reloading the SPI for KnnVectorsFormat class [lucene]

2024-05-27 Thread via GitHub


navneet1v commented on PR #13394:
URL: https://github.com/apache/lucene/pull/13394#issuecomment-2133829356

   @ChrisHegarty , @uschindler can merge the code and add the backport label to 
backport the code to branch_9x


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for reloading the SPI for KnnVectorsFormat class [lucene]

2024-05-27 Thread via GitHub


uschindler merged PR #13394:
URL: https://github.com/apache/lucene/pull/13394


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for reloading the SPI for KnnVectorsFormat class [lucene]

2024-05-27 Thread via GitHub


uschindler commented on PR #13394:
URL: https://github.com/apache/lucene/pull/13394#issuecomment-2133833594

   Done! Thanks, @navneet1v 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add support for reloading the SPI for KnnVectorsFormat class [lucene]

2024-05-27 Thread via GitHub


uschindler commented on issue #13393:
URL: https://github.com/apache/lucene/issues/13393#issuecomment-2133837367

   Thanks for bringing this up.
   
   Unfortunately we have no way to automatically enforce this for our NamedSPI 
provides (as it is static methods), but we should keep all of them in sync.
   
   At moment there's also missing the getter for list of names. I argued about 
this in the other PR already. The testframework uses expensive extra walk 
though ServiceLoader to do get those. This should be also brought in line with 
the other codec/analysis components. I will open a PR for this. The code in 
test-framework is inconsistent and a duplicates existing logic in a different 
way (the one that randomly chooses a KNN format).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add support for reloading the SPI for KnnVectorsFormat class [lucene]

2024-05-27 Thread via GitHub


uschindler closed issue #13393: Add support for reloading the SPI for 
KnnVectorsFormat class
URL: https://github.com/apache/lucene/issues/13393


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Remove incorrect/expensive use of ServiceLoader for choosing random format [lucene]

2024-05-27 Thread via GitHub


uschindler opened a new pull request, #13428:
URL: https://github.com/apache/lucene/pull/13428

   This cleans up the code:
   - Remove expensive usage of ServiceLoader which might be inconsistent with 
the NamedSPI code
   - Add a better readable version with comment about the special case 
regarding bit vectors


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add support for reloading the SPI for KnnVectorsFormat class [lucene]

2024-05-27 Thread via GitHub


uschindler commented on issue #13393:
URL: https://github.com/apache/lucene/issues/13393#issuecomment-2133868504

   see #13428 
   
   This also fixed the concern in other PR with SPI named similarity functions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove incorrect/expensive use of ServiceLoader for choosing random format [lucene]

2024-05-27 Thread via GitHub


uschindler commented on PR #13428:
URL: https://github.com/apache/lucene/pull/13428#issuecomment-2133868897

   This also make the method static to be in conformance with other 
LuceneTestCase methods


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use SPI instead of Enum for VectorSimilarityFunctions [lucene]

2024-05-27 Thread via GitHub


uschindler commented on code in PR #13401:
URL: https://github.com/apache/lucene/pull/13401#discussion_r1616290305


##
lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java:
##
@@ -3216,10 +3215,11 @@ public static BytesRef newBytesRef(byte[] bytesIn, int 
offset, int length) {
   }
 
   protected KnnVectorsFormat randomVectorFormat(VectorEncoding vectorEncoding) 
{

Review Comment:
   I fixed this already in https://github.com/apache/lucene/pull/13428



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use SPI instead of Enum for VectorSimilarityFunctions [lucene]

2024-05-27 Thread via GitHub


uschindler commented on code in PR #13401:
URL: https://github.com/apache/lucene/pull/13401#discussion_r1616291449


##
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##
@@ -16,104 +16,88 @@
  */
 package org.apache.lucene.index;
 
-import static org.apache.lucene.util.VectorUtil.cosine;
-import static org.apache.lucene.util.VectorUtil.dotProduct;
-import static org.apache.lucene.util.VectorUtil.dotProductScore;
-import static org.apache.lucene.util.VectorUtil.scaleMaxInnerProductScore;
-import static org.apache.lucene.util.VectorUtil.squareDistance;
+import java.util.Iterator;
+import java.util.List;
+import org.apache.lucene.util.NamedSPILoader;
 
 /**
  * Vector similarity function; used in search to return top K most similar 
vectors to a target
- * vector. This is a label describing the method used during indexing and 
searching of the vectors
- * in order to determine the nearest neighbors.
+ * vector.
  */
-public enum VectorSimilarityFunction {
+public abstract class VectorSimilarityFunction implements 
NamedSPILoader.NamedSPI {
 
-  /** Euclidean distance */
-  EUCLIDEAN {
-@Override
-public float compare(float[] v1, float[] v2) {
-  return 1 / (1 + squareDistance(v1, v2));
-}
-
-@Override
-public float compare(byte[] v1, byte[] v2) {
-  return 1 / (1f + squareDistance(v1, v2));
-}
-  },
+  private static class Holder {
+private static final NamedSPILoader LOADER =
+new NamedSPILoader<>(VectorSimilarityFunction.class);
 
-  /**
-   * Dot product. NOTE: this similarity is intended as an optimized way to 
perform cosine
-   * similarity. In order to use it, all vectors must be normalized, including 
both document and
-   * query vectors. Using dot product with vectors that are not normalized can 
result in errors or
-   * poor search results. Floating point vectors must be normalized to be of 
unit length, while byte
-   * vectors should simply all have the same norm.
-   */
-  DOT_PRODUCT {
-@Override
-public float compare(float[] v1, float[] v2) {
-  return Math.max((1 + dotProduct(v1, v2)) / 2, 0);
+static NamedSPILoader getLoader() {
+  if (LOADER == null) {
+throw new IllegalStateException(
+"You tried to lookup a VectorSimilarityFunction by name before all 
formats could be initialize. "
++ "This likely happens if you call 
VectorSimilarityFunction#forName "
++ "from a VectorSimilarityFunction's ctor.");
+  }
+  return LOADER;
 }
+  }
 
-@Override
-public float compare(byte[] v1, byte[] v2) {
-  return dotProductScore(v1, v2);
-}
-  },
+  /** Holds name of Vector Similarity Function */
+  public final String name;
 
   /**
-   * Cosine similarity. NOTE: the preferred way to perform cosine similarity 
is to normalize all
-   * vectors to unit length, and instead use {@link 
VectorSimilarityFunction#DOT_PRODUCT}. You
-   * should only use this function if you need to preserve the original 
vectors and cannot normalize
-   * them in advance. The similarity score is normalised to assure it is 
positive.
+   * Holds integer value of Vector Similarity Function to be used while 
reading and writing
+   * field-info in the index
*/
-  COSINE {
-@Override
-public float compare(float[] v1, float[] v2) {
-  return Math.max((1 + cosine(v1, v2)) / 2, 0);
-}
+  public final int ordinal;
 
-@Override
-public float compare(byte[] v1, byte[] v2) {
-  return (1 + cosine(v1, v2)) / 2;
-}
-  },
+  /** Construct object with function name and ordinal value */
+  protected VectorSimilarityFunction(String name, int ordinal) {
+NamedSPILoader.checkServiceName(name);
+this.name = name;
+this.ordinal = ordinal;
+  }
 
-  /**
-   * Maximum inner product. This is like {@link 
VectorSimilarityFunction#DOT_PRODUCT}, but does not
-   * require normalization of the inputs. Should be used when the embedding 
vectors store useful
-   * information within the vector magnitude
-   */
-  MAXIMUM_INNER_PRODUCT {
-@Override
-public float compare(float[] v1, float[] v2) {
-  return scaleMaxInnerProductScore(dotProduct(v1, v2));
-}
+  /** Get name of VectorSimilarityFunction used by the object */
+  @Override
+  public String getName() {
+return name;
+  }
 
-@Override
-public float compare(byte[] v1, byte[] v2) {
-  return scaleMaxInnerProductScore(dotProduct(v1, v2));
-}
-  };
+  /** Get ordinal of VectorSimilarityFunction used by the object */
+  public int getOrdinal() {
+return ordinal;
+  }
 
-  /**
-   * Calculates a similarity score between the two vectors with a specified 
function. Higher
-   * similarity scores correspond to closer vectors.
-   *
-   * @param v1 a vector
-   * @param v2 another vector, of the same dimension
-   * @return the value of the similarity function applied to the two vectors
-   */
+  /** Compares two float vector */
   publ

Re: [PR] Use SPI instead of Enum for VectorSimilarityFunctions [lucene]

2024-05-27 Thread via GitHub


uschindler commented on code in PR #13401:
URL: https://github.com/apache/lucene/pull/13401#discussion_r1616292043


##
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##
@@ -16,104 +16,88 @@
  */
 package org.apache.lucene.index;
 
-import static org.apache.lucene.util.VectorUtil.cosine;
-import static org.apache.lucene.util.VectorUtil.dotProduct;
-import static org.apache.lucene.util.VectorUtil.dotProductScore;
-import static org.apache.lucene.util.VectorUtil.scaleMaxInnerProductScore;
-import static org.apache.lucene.util.VectorUtil.squareDistance;
+import java.util.Iterator;
+import java.util.List;
+import org.apache.lucene.util.NamedSPILoader;
 
 /**
  * Vector similarity function; used in search to return top K most similar 
vectors to a target
- * vector. This is a label describing the method used during indexing and 
searching of the vectors
- * in order to determine the nearest neighbors.
+ * vector.
  */
-public enum VectorSimilarityFunction {
+public abstract class VectorSimilarityFunction implements 
NamedSPILoader.NamedSPI {
 
-  /** Euclidean distance */
-  EUCLIDEAN {
-@Override
-public float compare(float[] v1, float[] v2) {
-  return 1 / (1 + squareDistance(v1, v2));
-}
-
-@Override
-public float compare(byte[] v1, byte[] v2) {
-  return 1 / (1f + squareDistance(v1, v2));
-}
-  },
+  private static class Holder {
+private static final NamedSPILoader LOADER =
+new NamedSPILoader<>(VectorSimilarityFunction.class);
 
-  /**
-   * Dot product. NOTE: this similarity is intended as an optimized way to 
perform cosine
-   * similarity. In order to use it, all vectors must be normalized, including 
both document and
-   * query vectors. Using dot product with vectors that are not normalized can 
result in errors or
-   * poor search results. Floating point vectors must be normalized to be of 
unit length, while byte
-   * vectors should simply all have the same norm.
-   */
-  DOT_PRODUCT {
-@Override
-public float compare(float[] v1, float[] v2) {
-  return Math.max((1 + dotProduct(v1, v2)) / 2, 0);
+static NamedSPILoader getLoader() {
+  if (LOADER == null) {
+throw new IllegalStateException(
+"You tried to lookup a VectorSimilarityFunction by name before all 
formats could be initialize. "
++ "This likely happens if you call 
VectorSimilarityFunction#forName "
++ "from a VectorSimilarityFunction's ctor.");
+  }
+  return LOADER;
 }
+  }
 
-@Override
-public float compare(byte[] v1, byte[] v2) {
-  return dotProductScore(v1, v2);
-}
-  },
+  /** Holds name of Vector Similarity Function */
+  public final String name;
 
   /**
-   * Cosine similarity. NOTE: the preferred way to perform cosine similarity 
is to normalize all
-   * vectors to unit length, and instead use {@link 
VectorSimilarityFunction#DOT_PRODUCT}. You
-   * should only use this function if you need to preserve the original 
vectors and cannot normalize
-   * them in advance. The similarity score is normalised to assure it is 
positive.
+   * Holds integer value of Vector Similarity Function to be used while 
reading and writing
+   * field-info in the index
*/
-  COSINE {
-@Override
-public float compare(float[] v1, float[] v2) {
-  return Math.max((1 + cosine(v1, v2)) / 2, 0);
-}
+  public final int ordinal;
 
-@Override
-public float compare(byte[] v1, byte[] v2) {
-  return (1 + cosine(v1, v2)) / 2;
-}
-  },
+  /** Construct object with function name and ordinal value */
+  protected VectorSimilarityFunction(String name, int ordinal) {
+NamedSPILoader.checkServiceName(name);
+this.name = name;
+this.ordinal = ordinal;
+  }
 
-  /**
-   * Maximum inner product. This is like {@link 
VectorSimilarityFunction#DOT_PRODUCT}, but does not
-   * require normalization of the inputs. Should be used when the embedding 
vectors store useful
-   * information within the vector magnitude
-   */
-  MAXIMUM_INNER_PRODUCT {
-@Override
-public float compare(float[] v1, float[] v2) {
-  return scaleMaxInnerProductScore(dotProduct(v1, v2));
-}
+  /** Get name of VectorSimilarityFunction used by the object */
+  @Override
+  public String getName() {
+return name;
+  }
 
-@Override
-public float compare(byte[] v1, byte[] v2) {
-  return scaleMaxInnerProductScore(dotProduct(v1, v2));
-}
-  };
+  /** Get ordinal of VectorSimilarityFunction used by the object */
+  public int getOrdinal() {
+return ordinal;
+  }
 
-  /**
-   * Calculates a similarity score between the two vectors with a specified 
function. Higher
-   * similarity scores correspond to closer vectors.
-   *
-   * @param v1 a vector
-   * @param v2 another vector, of the same dimension
-   * @return the value of the similarity function applied to the two vectors
-   */
+  /** Compares two float vector */
   publ

Re: [PR] Add support for reloading the SPI for KnnVectorsFormat class [lucene]

2024-05-27 Thread via GitHub


navneet1v commented on PR #13394:
URL: https://github.com/apache/lucene/pull/13394#issuecomment-2133879878

   @uschindler do i need to create a separate PR for porting this change to 
branch_9x or there is any automated way to do this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] hunspell: speed up "compress"; minimize the number of the generated entries; don't even consider "forbidden" entries anymore [lucene]

2024-05-27 Thread via GitHub


donnerpeter commented on PR #13429:
URL: https://github.com/apache/lucene/pull/13429#issuecomment-2133893031

   > I admit I've no idea what improvement I'm looking at here but I trust you 
know!
   
   I can add some comments with explanations :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for reloading the SPI for KnnVectorsFormat class [lucene]

2024-05-27 Thread via GitHub


uschindler commented on PR #13394:
URL: https://github.com/apache/lucene/pull/13394#issuecomment-2133895659

   thats already merged (see above @ "asfgit pushed").


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] hunspell: speed up "compress"; minimize the number of the generated entries; don't even consider "forbidden" entries anymore [lucene]

2024-05-27 Thread via GitHub


dweiss commented on PR #13429:
URL: https://github.com/apache/lucene/pull/13429#issuecomment-2133896366

   > I can add some comments with explanations :)
   
   No need. I glanced through the code and it doesn't seem to be a backdoor. :) 
You know better what works here anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] hunspell: speed up "compress"; minimize the number of the generated entries; don't even consider "forbidden" entries anymore [lucene]

2024-05-27 Thread via GitHub


uschindler commented on PR #13429:
URL: https://github.com/apache/lucene/pull/13429#issuecomment-2133898015

   @rmuir can you have a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for reloading the SPI for KnnVectorsFormat class [lucene]

2024-05-27 Thread via GitHub


navneet1v commented on PR #13394:
URL: https://github.com/apache/lucene/pull/13394#issuecomment-2133913068

   @uschindler thanks. I didn't realize it. Generally I see labels added or 
last time I raised the PR manually for backport. Thanks for confirming this 
code is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add support for reloading the SPI for KnnVectorsFormat class [lucene]

2024-05-27 Thread via GitHub


navneet1v commented on issue #13393:
URL: https://github.com/apache/lucene/issues/13393#issuecomment-2133915854

   > At moment there's also missing the getter for list of names.
   
   @uschindler, when raising the PR I saw this and thought is it needed or not. 
To ensure the scope of the PR was minimal I didn't add the getter method. I see 
that you have raised the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add test for parsing brackets in range queries [lucene]

2024-05-27 Thread via GitHub


dweiss merged PR #13323:
URL: https://github.com/apache/lucene/pull/13323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add test for parsing brackets in range queries [lucene]

2024-05-27 Thread via GitHub


dweiss commented on PR #13323:
URL: https://github.com/apache/lucene/pull/13323#issuecomment-2133921546

   Apologies for the delay, @benchaplin - I've merged this test into the main 
branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add minimum number of segments to TieredMergePolicy [lucene]

2024-05-27 Thread via GitHub


carlosdelest opened a new pull request, #13430:
URL: https://github.com/apache/lucene/pull/13430

   Closes https://github.com/apache/lucene/issues/12877
   
   Adds a minimum number of segments to `TieredMergePolicy`. This allows to 
ensure that a minimum search concurrency parallelism is achievable using 
inter-segment parallelism.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add prefetching support to stored fields. [lucene]

2024-05-27 Thread via GitHub


jpountz commented on code in PR #13424:
URL: https://github.com/apache/lucene/pull/13424#discussion_r1616375903


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java:
##
@@ -609,6 +622,23 @@ public void skipBytes(long numBytes) throws IOException {
 }
   }
 
+  @Override
+  public void prefetch(int docID) throws IOException {
+final long blockID = indexReader.getBlockID(docID);
+
+for (long prefetchedBlockID : prefetchedBlockIDCache) {
+  if (prefetchedBlockID == blockID) {
+return;
+  }
+}

Review Comment:
   The cache is currently an unsorted rolling buffer, this is why I'm linearly 
scanning all entries, and also why I'm keeping it small (16 entries).
   
   I hesitated with only caching the last blockID, which would have been 
slightly simpler. To have a good hit ratio though, callers would need to fetch 
stored documents in doc ID order. I'm not sure how common it is today. I could 
switch to only caching the last block ID and add javadocs suggesting that 
applications fetch stored documents in doc ID order, what do you think? Maybe 
even add a helper method that takes a ScoreDoc[] array and returns the 
associated stored documents doing the right thing, ie. sorting doc IDs and 
doing prefetching so that I/O is parallelized?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove incorrect/expensive use of ServiceLoader for choosing random format [lucene]

2024-05-27 Thread via GitHub


uschindler merged PR #13428:
URL: https://github.com/apache/lucene/pull/13428


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove incorrect/expensive use of ServiceLoader for choosing random format [lucene]

2024-05-27 Thread via GitHub


uschindler commented on PR #13428:
URL: https://github.com/apache/lucene/pull/13428#issuecomment-2134102574

   Merged. I also removed the useless extra "uses" clause in test-frameworks 
module-info.java. Missed to do this here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add timeout support to AbstractVectorSimilarityQuery [lucene]

2024-05-27 Thread via GitHub


github-actions[bot] commented on PR #13285:
URL: https://github.com/apache/lucene/pull/13285#issuecomment-2134164108

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reuse BitSet when there are deleted documents in the index instead of creating new BitSet [lucene]

2024-05-27 Thread via GitHub


github-actions[bot] commented on PR #12857:
URL: https://github.com/apache/lucene/pull/12857#issuecomment-2134164315

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org