[PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
bruno-roustant opened a new pull request, #13422: URL: https://github.com/apache/lucene/pull/13422 - Add the required classes to the hppc fork. - Remove the hppc dependency from the facet, join, spatial modules. - Remove the hppc version from versions.props and versions.lock. It remains to move the hppc fork to oal.internal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
uschindler commented on PR #13422: URL: https://github.com/apache/lucene/pull/13422#issuecomment-2132806647 Ah I see you mentioned the missing move in the PR description, sorry! So except the `@lucene.internal` javadocs (should be added always, also when in private package), +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
dweiss commented on PR #13422: URL: https://github.com/apache/lucene/pull/13422#issuecomment-2132824844 I'll do it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
bruno-roustant commented on PR #13422: URL: https://github.com/apache/lucene/pull/13422#issuecomment-2132834604 Ok, Dawid I let you move the fork. I added the @lucene.internal annotations. I like that this hppc integration is cleaner and well defined now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
bruno-roustant commented on PR #13422: URL: https://github.com/apache/lucene/pull/13422#issuecomment-2132844966 Do you want to move in this PR, or in another PR once this one is merged? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move bulkScorer() from Weight to ScorerSupplier [lucene]
jpountz merged PR #13408: URL: https://github.com/apache/lucene/pull/13408 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
dweiss commented on PR #13422: URL: https://github.com/apache/lucene/pull/13422#issuecomment-2132876361 I'll push to this PR, if you don't mind? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid SegmentTermsEnumFrame reload block. [lucene]
vsop-479 commented on code in PR #13253: URL: https://github.com/apache/lucene/pull/13253#discussion_r1615645890 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java: ## @@ -287,6 +287,68 @@ void rewind() { */ } + // Only rewind, don't force reload block. + // Reset readers' position, don't read, decompress. + // Current term greater than target, reduce endCount. + void rewindWithoutReload() { +// Set nextEnt to 0, to prevent force load. +nextEnt = 0; +suffixesReader.setPosition(0); +suffixLengthsReader.setPosition(0); +statsReader.setPosition(0); +bytesReader.setPosition(0); + +// TODO: Since we only rewind without reload for fist floor(currentFrame.fp == Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid SegmentTermsEnumFrame reload block. [lucene]
vsop-479 commented on code in PR #13253: URL: https://github.com/apache/lucene/pull/13253#discussion_r1615650665 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -710,8 +761,28 @@ public SeekStatus seekCeil(BytesRef target) throws IOException { // System.out.println(" target is before current (shares prefixLen=" + targetUpto + "); // rewind frame ord=" + lastFrame.ord); // } + +// We got lastFrame by comparing target and term, and target less than last seeked term in +// currentFrame. If lastFrame's fp is same with currentFrame's fp, we can reduce entCount to +// nextEnt. Review Comment: > We don't seem to reduce entCount anymore? We can reduce entCount to nextEnt if we finally seek the same block. This has been already implemented in: // We still stay on withOutReload frame, reduce entCount to nextEnt. int origEntCount = -1; if (currentFrame.fp == withOutReloadFp && origNextEnt != 0) { origEntCount = currentFrame.entCount; currentFrame.entCount = origNextEnt; } ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -518,7 +541,20 @@ public boolean seekExact(BytesRef target) throws IOException { currentFrame.loadBlock(); +// We still stay on withOutReload frame, reduce entCount to nextEnt. +int origEntCount = -1; +if (currentFrame.fp == withOutReloadFp && origNextEnt != 0) { + origEntCount = currentFrame.entCount; + currentFrame.entCount = origNextEnt; +} + final SeekStatus result = currentFrame.scanToTerm(target, true); + +// Revert entCount to origEntCount. Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid SegmentTermsEnumFrame reload block. [lucene]
vsop-479 commented on code in PR #13253: URL: https://github.com/apache/lucene/pull/13253#discussion_r1615651266 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -573,7 +609,20 @@ public boolean seekExact(BytesRef target) throws IOException { currentFrame.loadBlock(); +// We still stay on withOutReload frame, reduce entCount to nextEnt. +int origEntCount = -1; +if (currentFrame.fp == withOutReloadFp && origNextEnt != 0) { + origEntCount = currentFrame.entCount; + currentFrame.entCount = origNextEnt; +} + final SeekStatus result = currentFrame.scanToTerm(target, true); + +// Revert entCount to origEntCount. Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid SegmentTermsEnumFrame reload block. [lucene]
vsop-479 commented on code in PR #13253: URL: https://github.com/apache/lucene/pull/13253#discussion_r1615655198 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -434,8 +436,29 @@ public boolean seekExact(BytesRef target) throws IOException { // System.out.println(" target is before current (shares prefixLen=" + targetUpto + "); // rewind frame ord=" + lastFrame.ord); // } + +// We got lastFrame by comparing target and term, and target less than last seeked term in +// currentFrame. If lastFrame's fp is same with currentFrame's fp, we can reduce entCount to +// nextEnt. +boolean currentIsLast = currentFrame.fp == lastFrame.fp; currentFrame = lastFrame; -currentFrame.rewind(); + +// Only rewindWithoutReload for non-floor block or first floor block. +// TODO: We need currentFrame's first entry to judge whether we can rewindWithoutReload for +// non-first floor blocks. +if (currentFrame.fp != currentFrame.fpOrig Review Comment: > Could you add comment for each of these reasons to still reload the frame? Added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use `IndexInput#prefetch` for terms dictionary lookups. [lucene]
jpountz commented on PR #13359: URL: https://github.com/apache/lucene/pull/13359#issuecomment-2132902792 Now that #13408 has been merged, I could update the benchmark to simply call IndexSearcher#search. ```java import java.io.IOException; import java.io.UncheckedIOException; import java.nio.file.Path; import java.nio.file.Paths; import java.util.ArrayList; import java.util.List; import java.util.Random; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.ThreadLocalRandom; import java.util.concurrent.TimeUnit; import java.util.concurrent.atomic.AtomicLong; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field.Store; import org.apache.lucene.document.StringField; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.Term; import org.apache.lucene.index.TieredMergePolicy; import org.apache.lucene.search.BooleanClause.Occur; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Sort; import org.apache.lucene.search.TermQuery; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; public class TermsEnumPrefetchBench { private static final int NUM_TERMS = 3; public static int DUMMY; public static void main(String[] args) throws Exception { Path dirPath = Paths.get(args[0]); Directory dir = FSDirectory.open(dirPath); if (DirectoryReader.indexExists(dir) == false) { TieredMergePolicy mp = new TieredMergePolicy(); mp.setSegmentsPerTier(100); mp.setMaxMergeAtOnce(100); mp.setMaxMergedSegmentMB(1024); try (IndexWriter w = new IndexWriter(dir, new IndexWriterConfig() .setMergePolicy(mp) .setRAMBufferSizeMB(1024))) { ExecutorService executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors()); AtomicLong indexed = new AtomicLong(0); for (int task = 0; task < 1000; ++task) { executor.execute(() -> { Random r = ThreadLocalRandom.current(); for (int i = 0; i < 1_000; ++i) { Document doc = new Document(); for (int j = 0; j < 10_000; ++j) { doc.add(new StringField("f", Long.toString(r.nextLong(20_000_000_000L)), Store.NO)); } try { w.addDocument(doc); } catch (IOException e) { throw new UncheckedIOException(e); } final long actualIndexed = indexed.incrementAndGet(); if (actualIndexed % 10_000 == 0) { System.out.println("Indexed: " + actualIndexed); } } }); } executor.shutdown(); executor.awaitTermination(1, TimeUnit.DAYS); w.commit(); System.out.println("Start force merging"); w.forceMerge(1); System.out.println("Done force merging"); w.commit(); } } List latencies = new ArrayList<>(); try (IndexReader reader = DirectoryReader.open(dir)) { IndexSearcher searcher = new IndexSearcher(reader); Random r = ThreadLocalRandom.current(); for (int i = 0; i < 10_000; ++i) { long start = System.nanoTime(); BooleanQuery.Builder query = new BooleanQuery.Builder(); for (int t = 0; t < NUM_TERMS; ++t) { query.add(new TermQuery(new Term("f", Long.toString(r.nextLong(20_000_000_000L, Occur.SHOULD); } DUMMY += searcher.search(query.build(), 1, Sort.INDEXORDER).totalHits.value; long end = System.nanoTime(); latencies.add((end - start) / 1000); } } latencies.sort(null); System.out.println("P50: " + latencies.get(latencies.size() / 2)); System.out.println("P90: " + latencies.get(latencies.size() * 9 / 10)); System.out.println("P99: " + latencies.get(latencies.size() * 99 / 100)); } } ``` Results still look good. Before the change: P50: 282 P90: 387 P99: 537 After the change: P50: 161 P90: 253 P99: 379 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Remove Scorer#getWeight. [lucene]
jpountz commented on issue #13410: URL: https://github.com/apache/lucene/issues/13410#issuecomment-2132905389 @navneet1v Since a `Scorer` must be created from a `Weight`, callers would be expected to keep track of the `Weight` that created the `Scorer` if they need it instead of relying on Lucene to do this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
dweiss commented on PR #13422: URL: https://github.com/apache/lucene/pull/13422#issuecomment-2132910636 So compilation works, but tests won't fly with this change. Recall we run tests in "classpath mode" so even though modules compile, for tests they're on classpath and then can't access internal packages from the core. Here's an example: ``` > java.lang.IllegalAccessError: class org.apache.lucene.search.join.DiversifyingNearestChildrenKnnCollector$NodeIdCachingHeap (in unnamed module @0x68472863) cannot access class org.apache.lucene.util.hppc.IntIntHashMap (in module org.apache.lucene.core) because module org.apache.lucene.core does not export org.apache.lucene.util.hppc to unnamed module @0x68472863 > at __randomizedtesting.SeedInfo.seed([9A5633A49FE14BF7:FC682151F0E8CD34]:0) > at org.apache.lucene.search.join.DiversifyingNearestChildrenKnnCollector$NodeIdCachingHeap.(DiversifyingNearestChildrenKnnCollector.java:132) > at org.apache.lucene.search.join.DiversifyingNearestChildrenKnnCollector.(DiversifyingNearestChildrenKnnCollector.java:47) > at org.apache.lucene.search.join.DiversifyingNearestChildrenKnnCollectorManager.newCollector(DiversifyingNearestChildrenKnnCollectorManager.java:68) > at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.search.TimeLimitingKnnCollectorManager.newCollector(TimeLimitingKnnCollectorManager.java:41) ``` Fixing this requires going down the rabbit hole of compiling and running tests with module patching and/or moving tests to a separate module and reorganizing their internals so that there are no split packages. Invasive... I wonder if the first baby step shouldn't be to move the hppc forked classes to .internal. (with proper annotation), but - sadly - export that package until we figure out how to compile and run tests in full modular mode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid SegmentTermsEnumFrame reload block. [lucene]
vsop-479 commented on code in PR #13253: URL: https://github.com/apache/lucene/pull/13253#discussion_r1615706277 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -434,8 +436,29 @@ public boolean seekExact(BytesRef target) throws IOException { // System.out.println(" target is before current (shares prefixLen=" + targetUpto + "); // rewind frame ord=" + lastFrame.ord); // } + +// We got lastFrame by comparing target and term, and target less than last seeked term in +// currentFrame. If lastFrame's fp is same with currentFrame's fp, we can reduce entCount to +// nextEnt. +boolean currentIsLast = currentFrame.fp == lastFrame.fp; currentFrame = lastFrame; -currentFrame.rewind(); + +// Only rewindWithoutReload for non-floor block or first floor block. +// TODO: We need currentFrame's first entry to judge whether we can rewindWithoutReload for +// non-first floor blocks. Review Comment: > Let's leave it as a future TODO? OK. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use SPI instead of Enum for VectorSimilarityFunctions [lucene]
ChrisHegarty commented on PR #13401: URL: https://github.com/apache/lucene/pull/13401#issuecomment-2132962554 > The main motivation behind this change was to get rid of ENUM implementation which is tightly coupled to field-info. This has caused inconvenience in deprecating the COSINE function from the list. Not sure what's the best approach then. Will take a look at your PR and comments to understand more about this. This is not a very compelling reason supporting the proposed change. I agree with @benwtrent, there are several challenges that would need to be ironed out before we could consider moving this forward. The recent addition of FlatVectorsScorer has certainly improved the extensibility in this area. For now at least, it appears to offer what is needed to plugin in new implementations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fixes failing test case for TestOrdinalMap.testRamBytesUsed [lucene]
pseudo-nymous commented on code in PR #13421: URL: https://github.com/apache/lucene/pull/13421#discussion_r1615714149 ## lucene/core/src/test/org/apache/lucene/index/TestOrdinalMap.java: ## @@ -52,7 +53,9 @@ public long accumulateObject( long shallowSize, java.util.Map fieldValues, java.util.Collection queue) { - if (o == LongValues.ZEROES || o == LongValues.IDENTITY) { + if (o == LongValues.ZEROES + || o == LongValues.IDENTITY + || o == PackedInts.NullReader.DEFAULT_PACKED_LONG_VALUES_PAGE_SIZE) { Review Comment: Sure. Thanks for quick review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
bruno-roustant commented on PR #13422: URL: https://github.com/apache/lucene/pull/13422#issuecomment-2132983280 I agree with the baby step, and this PR is already massive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid SegmentTermsEnumFrame reload block. [lucene]
vsop-479 commented on code in PR #13253: URL: https://github.com/apache/lucene/pull/13253#discussion_r1615722587 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -434,8 +436,29 @@ public boolean seekExact(BytesRef target) throws IOException { // System.out.println(" target is before current (shares prefixLen=" + targetUpto + "); // rewind frame ord=" + lastFrame.ord); // } + +// We got lastFrame by comparing target and term, and target less than last seeked term in +// currentFrame. If lastFrame's fp is same with currentFrame's fp, we can reduce entCount to +// nextEnt. +boolean currentIsLast = currentFrame.fp == lastFrame.fp; currentFrame = lastFrame; -currentFrame.rewind(); + +// Only rewindWithoutReload for non-floor block or first floor block. +// TODO: We need currentFrame's first entry to judge whether we can rewindWithoutReload for +// non-first floor blocks. +if (currentFrame.fp != currentFrame.fpOrig Review Comment: > Maybe currentFrame.nextEnt == -1 is enough for both unloaded block, I will confirm that. We may meet a block `nextEnt` is -1 and `entCount` is not 0( set by `SegmentTermsEnumFrame.scanToFloorFrame`), but we should not meet a block `entCount` is 0 and `nextEnt` is not -1. So `currentFrame.nextEnt == -1` is enough for unloaded block. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
dweiss commented on PR #13422: URL: https://github.com/apache/lucene/pull/13422#issuecomment-2133003069 I think the only thing missing is perhaps a changes entry and migration entry warning people that the dependency is gone and that the currently exposed internal package will be gone without a warning in the future (if we want to apply it to 9x as well). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
dweiss commented on PR #13422: URL: https://github.com/apache/lucene/pull/13422#issuecomment-2133074149 I can't remember what the problem was, to be honest. Will have to return to it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
bruno-roustant commented on PR #13422: URL: https://github.com/apache/lucene/pull/13422#issuecomment-2133113969 Thank you Dawid! I can work on the backport to 9x. How long should we wait for this PR before merging it? Given it is large, it has higher probabilities of conflicts, in main and 9x, so I would prefer to merge it quickly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
dweiss commented on PR #13422: URL: https://github.com/apache/lucene/pull/13422#issuecomment-2133132538 I'd merge it right away, I don't see the point in waiting. If there are conflicts - they should be easy to resolve. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
bruno-roustant merged PR #13422: URL: https://github.com/apache/lucene/pull/13422 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
uschindler commented on PR #13422: URL: https://github.com/apache/lucene/pull/13422#issuecomment-2133150944 Go ahead! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove the HPPC dependency from all modules and add the required classes to the hppc fork. [lucene]
bruno-roustant commented on PR #13422: URL: https://github.com/apache/lucene/pull/13422#issuecomment-2133194059 Backport to branch_9x complete! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]
RS146BIJAY commented on issue #13387: URL: https://github.com/apache/lucene/issues/13387#issuecomment-2133354180 > I agree that better organizing data across segments yields significant benefits, I'm only advocating for doing this by maintaining a separate IndexWriter for each group instead of doing it inside of DocumentsWriter. Sorry missed answering this part in my earlier response. We did explore this approach of creating an IndexWriter/Lucene Index (or OpenSearch shard) for each group. However, implementing this approach would require significant changes on the client (OpenSearch) side. Besides, having multiple additional shards will also lead to huge overhead due to task like metadata management. On the other hand, maintaining separate DWPT pools for different groups would require minimal changes inside Lucene. The overhead will be lesser here as Lucene shard will still be maintained as a single physical unit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add prefetching support to stored fields. [lucene]
jpountz opened a new pull request, #13424: URL: https://github.com/apache/lucene/pull/13424 This adds `StoredFields#prefetch(int)`, which mostly delegates to `IndexInput#prefetch`. Callers can take advantage of this API to parallelize I/O across multiple stored documents by first calling `StoredFields#prefetch` on all doc IDs before calling `StoredFields#document` on all doc IDs. I added a cache of recently prefetched blocks to the default codec, in order to avoid prefetching the same block multiple times in a short period of time. This felt sensible given that doc ID reordering via recursive graph bisection or index sorting are likely to result in search results being clustered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Rewrite newSlowRangeQuery to MatchNoDocsQuery when upper > lower [lucene]
ioanatia opened a new pull request, #13425: URL: https://github.com/apache/lucene/pull/13425 We already have a check in place that rewrites `SortedNumericDocValuesRangeQuery` to a `FieldExistsQuery` when the lookup range interval is `[ Long.MIN_VALUE, Long.MAX_VALUE]`. This PR adds another check to see whether the upper bound is smaller than the lower bound. If it is not, we can confidently rewrite `SortedNumericDocValuesRangeQuery` to a `MatchNoDocsQuery`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add prefetching support to stored fields. [lucene]
jpountz commented on PR #13424: URL: https://github.com/apache/lucene/pull/13424#issuecomment-2133420215 Like for previous changes, I wrote a synthetic benchmark to make sure that this new API actually helps. This benchmark simulates fetching 20 random stored documents in parallel. The index it creates is 39GB while my page cache only has a capacity of 25GB. ```java import java.io.IOException; import java.io.UncheckedIOException; import java.nio.file.Path; import java.nio.file.Paths; import java.util.ArrayList; import java.util.List; import java.util.Random; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.ThreadLocalRandom; import java.util.concurrent.TimeUnit; import java.util.concurrent.atomic.AtomicLong; import org.apache.lucene.document.Document; import org.apache.lucene.document.StoredField; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.FilterMergePolicy; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.MergePolicy; import org.apache.lucene.index.SegmentCommitInfo; import org.apache.lucene.index.SegmentInfos; import org.apache.lucene.index.StoredFields; import org.apache.lucene.index.TieredMergePolicy; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; public class StoredFieldsPrefetchBench { public static int DUMMY; public static void main(String[] args) throws Exception { Path dirPath = Paths.get(args[0]); Directory dir = FSDirectory.open(dirPath); if (DirectoryReader.indexExists(dir) == false) { MergePolicy mergePolicy = new FilterMergePolicy(new TieredMergePolicy()) { @Override public boolean useCompoundFile(SegmentInfos infos, SegmentCommitInfo mergedInfo, MergeContext mergeContext) throws IOException { return false; } }; try (IndexWriter w = new IndexWriter(dir, new IndexWriterConfig().setUseCompoundFile(false).setMergePolicy(mergePolicy))) { ExecutorService executor = Executors.newFixedThreadPool(4); AtomicLong indexed = new AtomicLong(0); for (int task = 0; task < 1000; ++task) { executor.execute(() -> { Random r = ThreadLocalRandom.current(); for (int i = 0; i < 40_000; ++i) { Document doc = new Document(); byte[] bytes = new byte[1024]; r.nextBytes(bytes); doc.add(new StoredField("content", bytes)); try { w.addDocument(doc); } catch (IOException e) { throw new UncheckedIOException(e); } final long actualIndexed = indexed.incrementAndGet(); if (actualIndexed % 1_000_000 == 0) { System.out.println("Indexed: " + actualIndexed); try { DirectoryReader.open(w).close(); } catch (IOException e) { throw new UncheckedIOException(e); } } } }); } executor.shutdown(); executor.awaitTermination(1, TimeUnit.DAYS); w.commit(); System.out.println("Done indexing"); } } List latencies = new ArrayList<>(); try (IndexReader reader = DirectoryReader.open(dir)) { Random r = ThreadLocalRandom.current(); for (int i = 0; i < 10_000; ++i) { StoredFields storedFields = reader.storedFields(); long start = System.nanoTime(); int[] ids = new int[20]; for (int j = 0; j < ids.length; ++j) { ids[j] = r.nextInt(reader.maxDoc()); } for (int doc : ids) { storedFields.prefetch(doc); } for (int doc : ids) { DUMMY += storedFields.document(doc).getBinaryValue("content").hashCode(); } long end = System.nanoTime(); latencies.add((end - start) / 1000); } } latencies.sort(null); System.out.println("P50: " + latencies.get(latencies.size() / 2)); System.out.println("P90: " + latencies.get(latencies.size() * 9 / 10)); System.out.println("P99: " + latencies.get(latencies.size() * 99 / 100)); } } ``` Before the change: P50: 2942 P90: 3900 P99: 4726 After the change: P50: 650 P90: 801 P99: 970 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.
[I] Examine assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
ChrisHegarty opened a new issue, #13426: URL: https://github.com/apache/lucene/issues/13426 A recent change, #13406 , added an assertion that may be incorrect. The assertion asserts that the number of entries matches that of the number of inputs processed. This may not be the case then a duplicate entry is passed in. For example, Add a duplicate entry: ``` $ git diff diff --git a/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt b/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt index 045b64eaa07..4513885a36b 100644 --- a/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt +++ b/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt @@ -5,6 +5,7 @@ C샤프 세종시 세종 시 대한민국날씨 대한민국 +대한민국 날씨 21세기대한민국 세기 \ No newline at end of file ``` ``` $ ./gradlew :lucene:analysis:nori:test --tests "org.apache.lucene.analysis.ko.TestKoreanTokenizer" ``` ``` reproduce with: gradlew test --tests TestKoreanTokenizer.testPartOfSpeechsWithCompound -Dtests.seed=6445594235429961 -Dtests.locale=bs-BA -Dtests.timezone=Atlantic/Faeroe -Dtests.asserts=true -Dtests.file.encoding=UTF-8 > java.lang.AssertionError > at __randomizedtesting.SeedInfo.seed([6445594235429961:9CEA448B4AB5C1F2]:0) > at org.apache.lucene.analysis.ko.dict.UserDictionary.(UserDictionary.java:137) > at org.apache.lucene.analysis.ko.dict.UserDictionary.open(UserDictionary.java:69) > at org.apache.lucene.analysis.ko.TestKoreanTokenizer.readDict(TestKoreanTokenizer.java:51) > at org.apache.lucene.analysis.ko.TestKoreanTokenizer.setUp(TestKoreanTokenizer.java:63) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) > at java.base/java.lang.reflect.Method.invoke(Method.java:580) > at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758) > at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:980) > at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996) > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48) > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) > at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20) > at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390) > at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843) > at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490) > at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955) > at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840) > at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891) > at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902) > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) ... ``` I encountered this assertion firing when testing a snapshot of the Lucene branch with Elasticsearch. The testNoriAnalyzerDuplicateUserDictRuleWithLegacyVersion test fails (hits the assertion), see https://github.com/elastic/elasticsearch/blob/main/plugins/analysis-nori/src/test/java/org/elasti
Re: [I] Examine assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
ChrisHegarty commented on issue #13426: URL: https://github.com/apache/lucene/issues/13426#issuecomment-2133707953 @bruno-roustant Unless I'm mistaken, this issue relates to your change in #13406. I think that the assertion is just incorrect? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Examine assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
bruno-roustant commented on issue #13426: URL: https://github.com/apache/lucene/issues/13426#issuecomment-2133732763 Thanks for the heads up. I look at that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove incorrect assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
ChrisHegarty commented on PR #13427: URL: https://github.com/apache/lucene/pull/13427#issuecomment-2133741247 Unless I'm mistaken, this recently added assertion is incorrect, so should simply be removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove incorrect assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
bruno-roustant commented on code in PR #13427: URL: https://github.com/apache/lucene/pull/13427#discussion_r1616229789 ## lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/UserDictionary.java: ## @@ -134,7 +134,6 @@ private UserDictionary(List entries) throws IOException { this.fst = new TokenInfoFST(FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader())); int[][] segmentations = _segmentations.toArray(new int[_segmentations.size()][]); -assert entryIndex == rightIds.length; Review Comment: The assertion was incorrect, but it detected a problem that needs a code fix. Can you replace the assertion by ``` if (entryIndex < rightIds.length) { rightIds = ArrayUtil.copyOfSubArray(rightIds, 0, entryIndex); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove incorrect assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
ChrisHegarty commented on code in PR #13427: URL: https://github.com/apache/lucene/pull/13427#discussion_r1616231384 ## lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/UserDictionary.java: ## @@ -134,7 +134,6 @@ private UserDictionary(List entries) throws IOException { this.fst = new TokenInfoFST(FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader())); int[][] segmentations = _segmentations.toArray(new int[_segmentations.size()][]); -assert entryIndex == rightIds.length; Review Comment: ha! snap! I just realised same! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove incorrect assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
bruno-roustant commented on PR #13427: URL: https://github.com/apache/lucene/pull/13427#issuecomment-2133778754 The recent code change replaced the previous List by a primitive structure. To avoid yet another hppc primitive list in the Lucene fork, this time I replaced by a basic short array. But I missed the duplicate case, and it was not tested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove incorrect assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
bruno-roustant commented on PR #13427: URL: https://github.com/apache/lucene/pull/13427#issuecomment-2133781188 I'm glad to have added this assertion :) It caught a misunderstanding. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
uschindler commented on PR #13427: URL: https://github.com/apache/lucene/pull/13427#issuecomment-2133787102 Do we have the same issue in kuromoji's UserDictionary? I just want to make sure about this because both nori and kuromoji use similar code. In @bruno-roustant 's PR where he changed this, he only touched Nor, but just to be sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
ChrisHegarty commented on PR #13427: URL: https://github.com/apache/lucene/pull/13427#issuecomment-2133790710 org.apache.lucene.analysis.ja.dict.UserDictionary looks ok to me - no issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] MemorySegment scorer should ensure that the values is of the correct type [lucene]
ChrisHegarty merged PR #13423: URL: https://github.com/apache/lucene/pull/13423 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
bruno-roustant commented on PR #13427: URL: https://github.com/apache/lucene/pull/13427#issuecomment-2133794952 The Japanese UserDictionary is quite different, and I didn't touch it. Although I plan later to replace the TreeMap of Integer if possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
uschindler commented on code in PR #13427: URL: https://github.com/apache/lucene/pull/13427#discussion_r1616243372 ## lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/UserDictionary.java: ## @@ -134,7 +135,9 @@ private UserDictionary(List entries) throws IOException { this.fst = new TokenInfoFST(FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader())); int[][] segmentations = _segmentations.toArray(new int[_segmentations.size()][]); -assert entryIndex == rightIds.length; +if (entryIndex < rightIds.length) { Review Comment: I would move that up to behind line 103, because that's where rightIds is built and the shorter version should be there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
uschindler commented on code in PR #13427: URL: https://github.com/apache/lucene/pull/13427#discussion_r1616243372 ## lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/UserDictionary.java: ## @@ -134,7 +135,9 @@ private UserDictionary(List entries) throws IOException { this.fst = new TokenInfoFST(FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader())); int[][] segmentations = _segmentations.toArray(new int[_segmentations.size()][]); -assert entryIndex == rightIds.length; +if (entryIndex < rightIds.length) { Review Comment: I would move that up to behind line 103, because that's where rightIds is built and the (possibly) shorter array should be allocated there. After that all is done with rightIds. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
ChrisHegarty commented on code in PR #13427: URL: https://github.com/apache/lucene/pull/13427#discussion_r1616246358 ## lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/UserDictionary.java: ## @@ -134,7 +135,9 @@ private UserDictionary(List entries) throws IOException { this.fst = new TokenInfoFST(FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader())); int[][] segmentations = _segmentations.toArray(new int[_segmentations.size()][]); -assert entryIndex == rightIds.length; +if (entryIndex < rightIds.length) { Review Comment: I moved it up a couple of lines ( which makes sense ), but it cannot go any further. We only wanna copy once after iterating the values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add prefetching support to stored fields. [lucene]
rmuir commented on code in PR #13424: URL: https://github.com/apache/lucene/pull/13424#discussion_r1616250356 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java: ## @@ -609,6 +622,23 @@ public void skipBytes(long numBytes) throws IOException { } } + @Override + public void prefetch(int docID) throws IOException { +final long blockID = indexReader.getBlockID(docID); + +for (long prefetchedBlockID : prefetchedBlockIDCache) { + if (prefetchedBlockID == blockID) { +return; + } +} Review Comment: can we use Arrays.binarysearch instead of scanning? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add prefetching support to stored fields. [lucene]
rmuir commented on code in PR #13424: URL: https://github.com/apache/lucene/pull/13424#discussion_r1616256454 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java: ## @@ -609,6 +622,23 @@ public void skipBytes(long numBytes) throws IOException { } } + @Override + public void prefetch(int docID) throws IOException { +final long blockID = indexReader.getBlockID(docID); + +for (long prefetchedBlockID : prefetchedBlockIDCache) { + if (prefetchedBlockID == blockID) { +return; + } +} Review Comment: or maybe even simpler, is it enough to just cache the last blockid? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
uschindler commented on code in PR #13427: URL: https://github.com/apache/lucene/pull/13427#discussion_r1616257319 ## lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/UserDictionary.java: ## @@ -134,7 +135,9 @@ private UserDictionary(List entries) throws IOException { this.fst = new TokenInfoFST(FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader())); int[][] segmentations = _segmentations.toArray(new int[_segmentations.size()][]); -assert entryIndex == rightIds.length; +if (entryIndex < rightIds.length) { Review Comment: Ah correct. I thought the loop ended after 103. The original code looked like this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix duplicate values in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
ChrisHegarty merged PR #13427: URL: https://github.com/apache/lucene/pull/13427 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Examine assertion in org.apache.lucene.analysis.ko.dict.UserDictionary [lucene]
ChrisHegarty closed issue #13426: Examine assertion in org.apache.lucene.analysis.ko.dict.UserDictionary URL: https://github.com/apache/lucene/issues/13426 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for reloading the SPI for KnnVectorsFormat class [lucene]
navneet1v commented on PR #13394: URL: https://github.com/apache/lucene/pull/13394#issuecomment-2133829356 @ChrisHegarty , @uschindler can merge the code and add the backport label to backport the code to branch_9x -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for reloading the SPI for KnnVectorsFormat class [lucene]
uschindler merged PR #13394: URL: https://github.com/apache/lucene/pull/13394 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for reloading the SPI for KnnVectorsFormat class [lucene]
uschindler commented on PR #13394: URL: https://github.com/apache/lucene/pull/13394#issuecomment-2133833594 Done! Thanks, @navneet1v -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add support for reloading the SPI for KnnVectorsFormat class [lucene]
uschindler commented on issue #13393: URL: https://github.com/apache/lucene/issues/13393#issuecomment-2133837367 Thanks for bringing this up. Unfortunately we have no way to automatically enforce this for our NamedSPI provides (as it is static methods), but we should keep all of them in sync. At moment there's also missing the getter for list of names. I argued about this in the other PR already. The testframework uses expensive extra walk though ServiceLoader to do get those. This should be also brought in line with the other codec/analysis components. I will open a PR for this. The code in test-framework is inconsistent and a duplicates existing logic in a different way (the one that randomly chooses a KNN format). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add support for reloading the SPI for KnnVectorsFormat class [lucene]
uschindler closed issue #13393: Add support for reloading the SPI for KnnVectorsFormat class URL: https://github.com/apache/lucene/issues/13393 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Remove incorrect/expensive use of ServiceLoader for choosing random format [lucene]
uschindler opened a new pull request, #13428: URL: https://github.com/apache/lucene/pull/13428 This cleans up the code: - Remove expensive usage of ServiceLoader which might be inconsistent with the NamedSPI code - Add a better readable version with comment about the special case regarding bit vectors -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add support for reloading the SPI for KnnVectorsFormat class [lucene]
uschindler commented on issue #13393: URL: https://github.com/apache/lucene/issues/13393#issuecomment-2133868504 see #13428 This also fixed the concern in other PR with SPI named similarity functions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove incorrect/expensive use of ServiceLoader for choosing random format [lucene]
uschindler commented on PR #13428: URL: https://github.com/apache/lucene/pull/13428#issuecomment-2133868897 This also make the method static to be in conformance with other LuceneTestCase methods -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use SPI instead of Enum for VectorSimilarityFunctions [lucene]
uschindler commented on code in PR #13401: URL: https://github.com/apache/lucene/pull/13401#discussion_r1616290305 ## lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java: ## @@ -3216,10 +3215,11 @@ public static BytesRef newBytesRef(byte[] bytesIn, int offset, int length) { } protected KnnVectorsFormat randomVectorFormat(VectorEncoding vectorEncoding) { Review Comment: I fixed this already in https://github.com/apache/lucene/pull/13428 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use SPI instead of Enum for VectorSimilarityFunctions [lucene]
uschindler commented on code in PR #13401: URL: https://github.com/apache/lucene/pull/13401#discussion_r1616291449 ## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ## @@ -16,104 +16,88 @@ */ package org.apache.lucene.index; -import static org.apache.lucene.util.VectorUtil.cosine; -import static org.apache.lucene.util.VectorUtil.dotProduct; -import static org.apache.lucene.util.VectorUtil.dotProductScore; -import static org.apache.lucene.util.VectorUtil.scaleMaxInnerProductScore; -import static org.apache.lucene.util.VectorUtil.squareDistance; +import java.util.Iterator; +import java.util.List; +import org.apache.lucene.util.NamedSPILoader; /** * Vector similarity function; used in search to return top K most similar vectors to a target - * vector. This is a label describing the method used during indexing and searching of the vectors - * in order to determine the nearest neighbors. + * vector. */ -public enum VectorSimilarityFunction { +public abstract class VectorSimilarityFunction implements NamedSPILoader.NamedSPI { - /** Euclidean distance */ - EUCLIDEAN { -@Override -public float compare(float[] v1, float[] v2) { - return 1 / (1 + squareDistance(v1, v2)); -} - -@Override -public float compare(byte[] v1, byte[] v2) { - return 1 / (1f + squareDistance(v1, v2)); -} - }, + private static class Holder { +private static final NamedSPILoader LOADER = +new NamedSPILoader<>(VectorSimilarityFunction.class); - /** - * Dot product. NOTE: this similarity is intended as an optimized way to perform cosine - * similarity. In order to use it, all vectors must be normalized, including both document and - * query vectors. Using dot product with vectors that are not normalized can result in errors or - * poor search results. Floating point vectors must be normalized to be of unit length, while byte - * vectors should simply all have the same norm. - */ - DOT_PRODUCT { -@Override -public float compare(float[] v1, float[] v2) { - return Math.max((1 + dotProduct(v1, v2)) / 2, 0); +static NamedSPILoader getLoader() { + if (LOADER == null) { +throw new IllegalStateException( +"You tried to lookup a VectorSimilarityFunction by name before all formats could be initialize. " ++ "This likely happens if you call VectorSimilarityFunction#forName " ++ "from a VectorSimilarityFunction's ctor."); + } + return LOADER; } + } -@Override -public float compare(byte[] v1, byte[] v2) { - return dotProductScore(v1, v2); -} - }, + /** Holds name of Vector Similarity Function */ + public final String name; /** - * Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all - * vectors to unit length, and instead use {@link VectorSimilarityFunction#DOT_PRODUCT}. You - * should only use this function if you need to preserve the original vectors and cannot normalize - * them in advance. The similarity score is normalised to assure it is positive. + * Holds integer value of Vector Similarity Function to be used while reading and writing + * field-info in the index */ - COSINE { -@Override -public float compare(float[] v1, float[] v2) { - return Math.max((1 + cosine(v1, v2)) / 2, 0); -} + public final int ordinal; -@Override -public float compare(byte[] v1, byte[] v2) { - return (1 + cosine(v1, v2)) / 2; -} - }, + /** Construct object with function name and ordinal value */ + protected VectorSimilarityFunction(String name, int ordinal) { +NamedSPILoader.checkServiceName(name); +this.name = name; +this.ordinal = ordinal; + } - /** - * Maximum inner product. This is like {@link VectorSimilarityFunction#DOT_PRODUCT}, but does not - * require normalization of the inputs. Should be used when the embedding vectors store useful - * information within the vector magnitude - */ - MAXIMUM_INNER_PRODUCT { -@Override -public float compare(float[] v1, float[] v2) { - return scaleMaxInnerProductScore(dotProduct(v1, v2)); -} + /** Get name of VectorSimilarityFunction used by the object */ + @Override + public String getName() { +return name; + } -@Override -public float compare(byte[] v1, byte[] v2) { - return scaleMaxInnerProductScore(dotProduct(v1, v2)); -} - }; + /** Get ordinal of VectorSimilarityFunction used by the object */ + public int getOrdinal() { +return ordinal; + } - /** - * Calculates a similarity score between the two vectors with a specified function. Higher - * similarity scores correspond to closer vectors. - * - * @param v1 a vector - * @param v2 another vector, of the same dimension - * @return the value of the similarity function applied to the two vectors - */ + /** Compares two float vector */ publ
Re: [PR] Use SPI instead of Enum for VectorSimilarityFunctions [lucene]
uschindler commented on code in PR #13401: URL: https://github.com/apache/lucene/pull/13401#discussion_r1616292043 ## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ## @@ -16,104 +16,88 @@ */ package org.apache.lucene.index; -import static org.apache.lucene.util.VectorUtil.cosine; -import static org.apache.lucene.util.VectorUtil.dotProduct; -import static org.apache.lucene.util.VectorUtil.dotProductScore; -import static org.apache.lucene.util.VectorUtil.scaleMaxInnerProductScore; -import static org.apache.lucene.util.VectorUtil.squareDistance; +import java.util.Iterator; +import java.util.List; +import org.apache.lucene.util.NamedSPILoader; /** * Vector similarity function; used in search to return top K most similar vectors to a target - * vector. This is a label describing the method used during indexing and searching of the vectors - * in order to determine the nearest neighbors. + * vector. */ -public enum VectorSimilarityFunction { +public abstract class VectorSimilarityFunction implements NamedSPILoader.NamedSPI { - /** Euclidean distance */ - EUCLIDEAN { -@Override -public float compare(float[] v1, float[] v2) { - return 1 / (1 + squareDistance(v1, v2)); -} - -@Override -public float compare(byte[] v1, byte[] v2) { - return 1 / (1f + squareDistance(v1, v2)); -} - }, + private static class Holder { +private static final NamedSPILoader LOADER = +new NamedSPILoader<>(VectorSimilarityFunction.class); - /** - * Dot product. NOTE: this similarity is intended as an optimized way to perform cosine - * similarity. In order to use it, all vectors must be normalized, including both document and - * query vectors. Using dot product with vectors that are not normalized can result in errors or - * poor search results. Floating point vectors must be normalized to be of unit length, while byte - * vectors should simply all have the same norm. - */ - DOT_PRODUCT { -@Override -public float compare(float[] v1, float[] v2) { - return Math.max((1 + dotProduct(v1, v2)) / 2, 0); +static NamedSPILoader getLoader() { + if (LOADER == null) { +throw new IllegalStateException( +"You tried to lookup a VectorSimilarityFunction by name before all formats could be initialize. " ++ "This likely happens if you call VectorSimilarityFunction#forName " ++ "from a VectorSimilarityFunction's ctor."); + } + return LOADER; } + } -@Override -public float compare(byte[] v1, byte[] v2) { - return dotProductScore(v1, v2); -} - }, + /** Holds name of Vector Similarity Function */ + public final String name; /** - * Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all - * vectors to unit length, and instead use {@link VectorSimilarityFunction#DOT_PRODUCT}. You - * should only use this function if you need to preserve the original vectors and cannot normalize - * them in advance. The similarity score is normalised to assure it is positive. + * Holds integer value of Vector Similarity Function to be used while reading and writing + * field-info in the index */ - COSINE { -@Override -public float compare(float[] v1, float[] v2) { - return Math.max((1 + cosine(v1, v2)) / 2, 0); -} + public final int ordinal; -@Override -public float compare(byte[] v1, byte[] v2) { - return (1 + cosine(v1, v2)) / 2; -} - }, + /** Construct object with function name and ordinal value */ + protected VectorSimilarityFunction(String name, int ordinal) { +NamedSPILoader.checkServiceName(name); +this.name = name; +this.ordinal = ordinal; + } - /** - * Maximum inner product. This is like {@link VectorSimilarityFunction#DOT_PRODUCT}, but does not - * require normalization of the inputs. Should be used when the embedding vectors store useful - * information within the vector magnitude - */ - MAXIMUM_INNER_PRODUCT { -@Override -public float compare(float[] v1, float[] v2) { - return scaleMaxInnerProductScore(dotProduct(v1, v2)); -} + /** Get name of VectorSimilarityFunction used by the object */ + @Override + public String getName() { +return name; + } -@Override -public float compare(byte[] v1, byte[] v2) { - return scaleMaxInnerProductScore(dotProduct(v1, v2)); -} - }; + /** Get ordinal of VectorSimilarityFunction used by the object */ + public int getOrdinal() { +return ordinal; + } - /** - * Calculates a similarity score between the two vectors with a specified function. Higher - * similarity scores correspond to closer vectors. - * - * @param v1 a vector - * @param v2 another vector, of the same dimension - * @return the value of the similarity function applied to the two vectors - */ + /** Compares two float vector */ publ
Re: [PR] Add support for reloading the SPI for KnnVectorsFormat class [lucene]
navneet1v commented on PR #13394: URL: https://github.com/apache/lucene/pull/13394#issuecomment-2133879878 @uschindler do i need to create a separate PR for porting this change to branch_9x or there is any automated way to do this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] hunspell: speed up "compress"; minimize the number of the generated entries; don't even consider "forbidden" entries anymore [lucene]
donnerpeter commented on PR #13429: URL: https://github.com/apache/lucene/pull/13429#issuecomment-2133893031 > I admit I've no idea what improvement I'm looking at here but I trust you know! I can add some comments with explanations :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for reloading the SPI for KnnVectorsFormat class [lucene]
uschindler commented on PR #13394: URL: https://github.com/apache/lucene/pull/13394#issuecomment-2133895659 thats already merged (see above @ "asfgit pushed"). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] hunspell: speed up "compress"; minimize the number of the generated entries; don't even consider "forbidden" entries anymore [lucene]
dweiss commented on PR #13429: URL: https://github.com/apache/lucene/pull/13429#issuecomment-2133896366 > I can add some comments with explanations :) No need. I glanced through the code and it doesn't seem to be a backdoor. :) You know better what works here anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] hunspell: speed up "compress"; minimize the number of the generated entries; don't even consider "forbidden" entries anymore [lucene]
uschindler commented on PR #13429: URL: https://github.com/apache/lucene/pull/13429#issuecomment-2133898015 @rmuir can you have a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for reloading the SPI for KnnVectorsFormat class [lucene]
navneet1v commented on PR #13394: URL: https://github.com/apache/lucene/pull/13394#issuecomment-2133913068 @uschindler thanks. I didn't realize it. Generally I see labels added or last time I raised the PR manually for backport. Thanks for confirming this code is merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add support for reloading the SPI for KnnVectorsFormat class [lucene]
navneet1v commented on issue #13393: URL: https://github.com/apache/lucene/issues/13393#issuecomment-2133915854 > At moment there's also missing the getter for list of names. @uschindler, when raising the PR I saw this and thought is it needed or not. To ensure the scope of the PR was minimal I didn't add the getter method. I see that you have raised the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add test for parsing brackets in range queries [lucene]
dweiss merged PR #13323: URL: https://github.com/apache/lucene/pull/13323 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add test for parsing brackets in range queries [lucene]
dweiss commented on PR #13323: URL: https://github.com/apache/lucene/pull/13323#issuecomment-2133921546 Apologies for the delay, @benchaplin - I've merged this test into the main branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add minimum number of segments to TieredMergePolicy [lucene]
carlosdelest opened a new pull request, #13430: URL: https://github.com/apache/lucene/pull/13430 Closes https://github.com/apache/lucene/issues/12877 Adds a minimum number of segments to `TieredMergePolicy`. This allows to ensure that a minimum search concurrency parallelism is achievable using inter-segment parallelism. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add prefetching support to stored fields. [lucene]
jpountz commented on code in PR #13424: URL: https://github.com/apache/lucene/pull/13424#discussion_r1616375903 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java: ## @@ -609,6 +622,23 @@ public void skipBytes(long numBytes) throws IOException { } } + @Override + public void prefetch(int docID) throws IOException { +final long blockID = indexReader.getBlockID(docID); + +for (long prefetchedBlockID : prefetchedBlockIDCache) { + if (prefetchedBlockID == blockID) { +return; + } +} Review Comment: The cache is currently an unsorted rolling buffer, this is why I'm linearly scanning all entries, and also why I'm keeping it small (16 entries). I hesitated with only caching the last blockID, which would have been slightly simpler. To have a good hit ratio though, callers would need to fetch stored documents in doc ID order. I'm not sure how common it is today. I could switch to only caching the last block ID and add javadocs suggesting that applications fetch stored documents in doc ID order, what do you think? Maybe even add a helper method that takes a ScoreDoc[] array and returns the associated stored documents doing the right thing, ie. sorting doc IDs and doing prefetching so that I/O is parallelized? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove incorrect/expensive use of ServiceLoader for choosing random format [lucene]
uschindler merged PR #13428: URL: https://github.com/apache/lucene/pull/13428 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove incorrect/expensive use of ServiceLoader for choosing random format [lucene]
uschindler commented on PR #13428: URL: https://github.com/apache/lucene/pull/13428#issuecomment-2134102574 Merged. I also removed the useless extra "uses" clause in test-frameworks module-info.java. Missed to do this here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add timeout support to AbstractVectorSimilarityQuery [lucene]
github-actions[bot] commented on PR #13285: URL: https://github.com/apache/lucene/pull/13285#issuecomment-2134164108 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reuse BitSet when there are deleted documents in the index instead of creating new BitSet [lucene]
github-actions[bot] commented on PR #12857: URL: https://github.com/apache/lucene/pull/12857#issuecomment-2134164315 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org