[PR] Tighten up IndexSortSortedNumericDocValuesRangeQuery index sort validation [lucene]
gsmiller opened a new pull request, #13973: URL: https://github.com/apache/lucene/pull/13973 ### Description Small refactoring to remove some duplicated code around index sort loading/validation and to improve readability (IMO anyway). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid reload block when seeking backward in SegmentTermsEnum. [lucene]
github-actions[bot] commented on PR #13253: URL: https://github.com/apache/lucene/pull/13253#issuecomment-2452749686 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A few small tweaks to VectorUtil#findNextGEQ [lucene]
gsmiller commented on code in PR #13972: URL: https://github.com/apache/lucene/pull/13972#discussion_r1826036633 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene912/Lucene912PostingsReader.java: ## @@ -1755,13 +1755,13 @@ static long readVLong15(DataInput in) throws IOException { } } - private static int findNextGEQ(long[] buffer, int length, long target, int from) { -for (int i = from; i < length; ++i) { + private static int findNextGEQ(long[] buffer, long target, int from, int to) { Review Comment: Made this change for consistency but don't have any strong feelings about it. If there's a preference to just leave it be, I'm fine reverting this file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A few small tweaks to VectorUtil#findNextGEQ [lucene]
gsmiller commented on code in PR #13972: URL: https://github.com/apache/lucene/pull/13972#discussion_r1826037886 ## lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java: ## @@ -601,7 +599,7 @@ public int advance(int target) throws IOException { } } - int next = VECTOR_SUPPORT.findNextGEQ(docBuffer, docBufferSize, target, docBufferUpto); + int next = VectorUtil.findNextGEQ(docBuffer, target, docBufferUpto, docBufferSize); Review Comment: @jpountz I made this change so that everything goes through `VectorUtil` and the `assert`. It also seemed a little more consistent with how we leverage `VectorUtil` vs. `VectorUtilSupport` elsewhere, but I'm not sure if you had a good reason for writing it the way you did. If there's a performance (or other) consideration here, I'd be happy to understand that better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]
gsmiller commented on PR #13969: URL: https://github.com/apache/lucene/pull/13969#issuecomment-2452189489 Added an assertion and tweaked parameter naming as discussed here over in #13972. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Revert "Disjunction as CompetitiveIterator for numeric dynamic pruning (#13221)" [lucene]
javanna commented on code in PR #13971: URL: https://github.com/apache/lucene/pull/13971#discussion_r1825756810 ## lucene/core/src/java/org/apache/lucene/util/IntArrayDocIdSet.java: ## @@ -34,23 +34,15 @@ public final class IntArrayDocIdSet extends DocIdSet { private final int[] docs; private final int length; - /** - * Build an IntArrayDocIdSet by an int array and len. - * - * @param docs A docs array whose length need to be greater than the param len. It needs to be - * sorted from 0(inclusive) to the len(exclusive), and the len-th doc in docs need to be - * {@link DocIdSetIterator#NO_MORE_DOCS}. - * @param len The valid docs length in array. - */ - public IntArrayDocIdSet(int[] docs, int len) { -if (docs[len] != DocIdSetIterator.NO_MORE_DOCS) { + public IntArrayDocIdSet(int[] docs, int length) { Review Comment: @jpountz this class used to be package private. With the revert that this PR applies to main, `TestBooleanOr` would be the only caller of this constructor outside of `o.a.l.util`, requiring it to be public. That looks odd, any chance we could update `TestBooleanOr` to not rely on this class being public? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Revert "Disjunction as CompetitiveIterator for numeric dynamic pruning (#13221)" [lucene]
javanna commented on PR #13971: URL: https://github.com/apache/lucene/pull/13971#issuecomment-2451858027 @gf2121 pinging you explicitly because you worked on #13221. Sadly, we observed some severe regressions caused by it, and we were not able to somehow adjust the updated algorithm to avoid those. We already reverted that commit in branch_10_0 for the Lucene 10.0 release, and I hereby propose to do the same on main and branch_10x until we manage to fill the gap around the search after regressions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]
msokolov commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1825833969 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -291,25 +296,125 @@ private float squareDistanceBody(float[] a, float[] b, int limit) { return res1.add(res2).reduceLanes(ADD); } - // Binary functions, these all follow a general pattern like this: - // - // short intermediate = a * b; - // int accumulator = (int)accumulator + (int)intermediate; - // - // 256 or 512 bit vectors can process 64 or 128 bits at a time, respectively - // intermediate results use 128 or 256 bit vectors, respectively - // final accumulator uses 256 or 512 bit vectors, respectively - // - // We also support 128 bit vectors, going 32 bits at a time. - // This is slower but still faster than not vectorizing at all. - + /** + * This method SHOULD NOT be used directly when the native dot-product is enabled. This is because + * it allocates off-heap memory in the hot code-path and copies the input byte-vectors onto it. + * This is necessary even with Panama APIs because native code is given a pointer to a memory + * address and if the content at that address can be moved at anytime (by Java GC in case of heap + * allocated memory) then safety of the native computation cannot be guaranteed. If we try to pass + * MemorySegment backing on-heap memory to native code we get + * "java.lang.UnsupportedOperationException: Not a native address" + * + * Stack overflow thread: Review Comment: Thanks this is helpful, but let's not permanently record the SO thread in the code comments ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -291,25 +296,125 @@ private float squareDistanceBody(float[] a, float[] b, int limit) { return res1.add(res2).reduceLanes(ADD); } - // Binary functions, these all follow a general pattern like this: - // - // short intermediate = a * b; - // int accumulator = (int)accumulator + (int)intermediate; - // - // 256 or 512 bit vectors can process 64 or 128 bits at a time, respectively - // intermediate results use 128 or 256 bit vectors, respectively - // final accumulator uses 256 or 512 bit vectors, respectively - // - // We also support 128 bit vectors, going 32 bits at a time. - // This is slower but still faster than not vectorizing at all. - + /** + * This method SHOULD NOT be used directly when the native dot-product is enabled. This is because + * it allocates off-heap memory in the hot code-path and copies the input byte-vectors onto it. + * This is necessary even with Panama APIs because native code is given a pointer to a memory + * address and if the content at that address can be moved at anytime (by Java GC in case of heap + * allocated memory) then safety of the native computation cannot be guaranteed. If we try to pass + * MemorySegment backing on-heap memory to native code we get + * "java.lang.UnsupportedOperationException: Not a native address" + * + * Stack overflow thread: + * https://stackoverflow.com/questions/69521289/jep-412-pass-a-on-heap-byte-array-to-native-code-getting-unsupportedoperatione + * explains the issue in more detail. + * + * Q1. Why did we enable the native code path here if its inefficient ? A1. So that it can be + * exercised in unit-tests to test correctness of underlying vectorized implementation in C + * without directly relying on preview Panama APIs. Without this, unit-tests would need to use + * reflection to 1) Get a method handle of {@link #dotProduct(MemorySegment, MemorySegment)} as it + * is not part of {@link VectorUtilSupport} 2) Get a method handle of a to-be-defined method that + * would copy byte[] to off-heap {@link MemorySegment} and return them as Objects 3) Invoke the + * method handle retrieved in 1) with Objects in 2) All doable but adds unnecessary complexity in + * the unit tests suite. + * + * Q2. Which method should HNSW scoring components use for computing dot-product in native code + * on byte-vectors? A2. They should use {@link #dotProduct(MemorySegment, MemorySegment)} directly + * and control the creation and reuse of off-heap MemorySegments as dotProduct is in the hot code + * path for both indexing and searching. It is easy to see that stored-vectors residing in + * Memory-mapped files can simply be accessed using {@link MemorySegment#asSlice(long, long, + * long)} which does not allocate new memory and does not require copying vector data onto JVM + * heap. For target-vector, copying to off-heap memory will still be needed but allocation can + * happen once per scorer. + * + * Q3. Should JMH benchmarks measure the performance of this method? A3. No, because they would + * be measuring the combined cost of memory-allocations and nati
Re: [PR] DocValuesSkipper implementation in IndexSortSorted [lucene]
gsmiller commented on code in PR #13886: URL: https://github.com/apache/lucene/pull/13886#discussion_r1826241555 ## lucene/core/src/java/org/apache/lucene/search/IndexSortSortedNumericDocValuesRangeQuery.java: ## @@ -397,106 +413,80 @@ private boolean matchAll(PointValues points, byte[] queryLowerPoint, byte[] quer } private IteratorAndCount getDocIdSetIteratorOrNullFromBkd( - LeafReaderContext context, DocIdSetIterator delegate) throws IOException { -Sort indexSort = context.reader().getMetaData().sort(); + LeafReader reader, NumericDocValues numericDocValues, DocValuesSkipper skipper) + throws IOException { +if (skipper.docCount() != reader.maxDoc()) { + return null; +} +final Sort indexSort = reader.getMetaData().sort(); if (indexSort == null || indexSort.getSort().length == 0 || indexSort.getSort()[0].getField().equals(field) == false) { return null; } +final int minDocID; +final int maxDocID; final boolean reverse = indexSort.getSort()[0].getReverse(); - -PointValues points = context.reader().getPointValues(field); Review Comment: Looks like this comment still isn't addressed. Is this something you're still working on? If you have questions (or if you disagree with the suggestion) please let me know. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] DocValuesSkipper implementation in IndexSortSorted [lucene]
gsmiller commented on code in PR #13886: URL: https://github.com/apache/lucene/pull/13886#discussion_r1826238079 ## lucene/core/src/java/org/apache/lucene/search/IndexSortSortedNumericDocValuesRangeQuery.java: ## @@ -397,106 +413,80 @@ private boolean matchAll(PointValues points, byte[] queryLowerPoint, byte[] quer } private IteratorAndCount getDocIdSetIteratorOrNullFromBkd( - LeafReaderContext context, DocIdSetIterator delegate) throws IOException { -Sort indexSort = context.reader().getMetaData().sort(); + LeafReader reader, NumericDocValues numericDocValues, DocValuesSkipper skipper) + throws IOException { +if (skipper.docCount() != reader.maxDoc()) { + return null; +} +final Sort indexSort = reader.getMetaData().sort(); if (indexSort == null || indexSort.getSort().length == 0 || indexSort.getSort()[0].getField().equals(field) == false) { return null; } +final int minDocID; +final int maxDocID; final boolean reverse = indexSort.getSort()[0].getReverse(); - -PointValues points = context.reader().getPointValues(field); -if (points == null) { - return null; -} - -if (points.getNumDimensions() != 1) { - return null; -} - -if (points.getBytesPerDimension() != Long.BYTES -&& points.getBytesPerDimension() != Integer.BYTES) { - return null; -} - -if (points.size() != points.getDocCount()) { - return null; -} - -assert lowerValue <= upperValue; -byte[] queryLowerPoint; -byte[] queryUpperPoint; -if (points.getBytesPerDimension() == Integer.BYTES) { - queryLowerPoint = IntPoint.pack((int) lowerValue).bytes; - queryUpperPoint = IntPoint.pack((int) upperValue).bytes; -} else { - queryLowerPoint = LongPoint.pack(lowerValue).bytes; - queryUpperPoint = LongPoint.pack(upperValue).bytes; -} -if (matchNone(points, queryLowerPoint, queryUpperPoint)) { - return IteratorAndCount.empty(); -} -if (matchAll(points, queryLowerPoint, queryUpperPoint)) { - int maxDoc = context.reader().maxDoc(); - if (points.getDocCount() == maxDoc) { -return IteratorAndCount.all(maxDoc); - } else { -return IteratorAndCount.sparseRange(0, maxDoc, delegate); - } -} - -int minDocId, maxDocId; -final ByteArrayComparator comparator = -ArrayUtil.getUnsignedComparator(points.getBytesPerDimension()); - if (reverse) { - minDocId = nextDoc(points.getPointTree(), queryUpperPoint, false, comparator, true) + 1; -} else { - minDocId = nextDoc(points.getPointTree(), queryLowerPoint, true, comparator, false); - if (minDocId == -1) { -// No matches -return IteratorAndCount.empty(); + if (skipper.maxValue() <= upperValue) { +minDocID = 0; + } else { +skipper.advance(Long.MIN_VALUE, upperValue); +minDocID = nextDocSkipper(skipper.minDocID(0), numericDocValues, l -> l <= upperValue); } -} - -if (reverse) { - maxDocId = nextDoc(points.getPointTree(), queryLowerPoint, true, comparator, true) + 1; - if (maxDocId == 0) { -// No matches -return IteratorAndCount.empty(); + if (skipper.minValue() >= lowerValue) { +maxDocID = skipper.docCount(); + } else { +skipper.advance(Long.MIN_VALUE, lowerValue); +maxDocID = nextDocSkipper(skipper.minDocID(0), numericDocValues, l -> l < lowerValue); } } else { - maxDocId = nextDoc(points.getPointTree(), queryUpperPoint, false, comparator, false); - if (maxDocId == -1) { -maxDocId = context.reader().maxDoc(); + if (skipper.minValue() >= lowerValue) { +minDocID = 0; + } else { +skipper.advance(lowerValue, Long.MAX_VALUE); +minDocID = nextDocSkipper(skipper.minDocID(0), numericDocValues, l -> l >= lowerValue); + } + if (skipper.maxValue() <= upperValue) { +maxDocID = skipper.docCount(); + } else { +skipper.advance(upperValue, Long.MAX_VALUE); +maxDocID = nextDocSkipper(skipper.minDocID(0), numericDocValues, l -> l > upperValue); } } -if (minDocId == maxDocId) { - return IteratorAndCount.empty(); -} - -if ((points.getDocCount() == context.reader().maxDoc())) { - return IteratorAndCount.denseRange(minDocId, maxDocId); -} else { - return IteratorAndCount.sparseRange(minDocId, maxDocId, delegate); -} +return minDocID == maxDocID +? IteratorAndCount.empty() +: IteratorAndCount.denseRange(minDocID, maxDocID); } private IteratorAndCount getDocIdSetIteratorOrNull(LeafReaderContext context) throws IOException { if (lowerValue > upperValue) { return IteratorAndCount.empty(); } -SortedNumericDocValues sortedNumericValues = -DocValues.getSortedNumeric(context.reader(), field); -NumericDocValu
Re: [I] [Discuss] What should we name the new aggregation engine? [lucene]
msokolov commented on issue #13965: URL: https://github.com/apache/lucene/issues/13965#issuecomment-2452668213 copy-pasting from the polling site: ``` Facetron Agamemnon Analytics Count Dracula Aggie Bucketeer Gator Fusion Slurper Lustre Match-time Aggregates Maggregates The Maggregator Laggregator Ruby Gatheriffic ChopChop Muncher Magma Summator Faceting engine ``` These are ordered by number of votes / entry order, but there were very few votes. I think mostly people just like saying names on the internet? I think it might help if we said where the name would be used. Are we looking for a java package name? A name to use when saying - should I convert from facets to "xxx"? Clearly if it's the latter one, "Faceting engine" would be super confusing, although of course "smart+fast-facets" would be great. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Rename NodeHash to FSTSuffixNodeCache + Javadoc improvement [lucene]
msokolov commented on PR #13259: URL: https://github.com/apache/lucene/pull/13259#issuecomment-2452660151 Thanks, this does seem much more explicit and helpful -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] [Discuss] What should we name the new aggregation engine? [lucene]
stefanvodita commented on issue #13965: URL: https://github.com/apache/lucene/issues/13965#issuecomment-2452683167 > Are we looking for a java package name? A name to use when saying - should I convert from facets to "xxx"? The second onw is what I have in mind. Whether this new aggregation engine lives in its own package is up for debate. @epotyom was pointing out that it might end up having a dependency on the facets package in that case. I'd say let's leave aside that question until we're actually ready to promote it outside sandbox. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Rename NodeHash to FSTSuffixNodeCache + Javadoc improvement [lucene]
msokolov merged PR #13259: URL: https://github.com/apache/lucene/pull/13259 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Significant performance regression with search after [lucene]
javanna commented on issue #13856: URL: https://github.com/apache/lucene/issues/13856#issuecomment-2451717823 I opened https://github.com/apache/lucene/pull/13971 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Significant performance regression with search after [lucene]
javanna commented on issue #13856: URL: https://github.com/apache/lucene/issues/13856#issuecomment-2451714216 There has been little activity on this issue, I am leaning towards applying the revert of the change that caused the regression to branch_10x and main as well. I am anxious that we will forget and end up in the same situation as we were with the Lucene 10.0 release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org