Re: [PR] Remove IndexSearcher#search(List, Weight, Collector) [lucene]
javanna merged PR #13780: URL: https://github.com/apache/lucene/pull/13780 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
uschindler commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348202070 Backport or not? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
jpountz commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348207344 I would backport as it looks pretty safe, though I doubt we have anyone using this postings format, so it likely doesn't matter much in practice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
vsop-479 commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348218644 > can you merge main back into your branch? Merged. > can we find other occurrences using some regex searches? I searched code with `final int targetUptoMid = targetUpto` which these iterating compare start with. There is only two places remains in `org.apache.lucene.backward_codecs.lucene40.blocktree.SegmentTermsEnum`, since this is a bakward class, I am not sure we should modify it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
uschindler commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348293165 > I would backport as it looks pretty safe, though I doubt we have anyone using this postings format, so it likely doesn't matter much in practice. The we should move the changes entry. Do we have a corresponding 9.x section already? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; [lucene]
cpoerschke merged PR #13757: URL: https://github.com/apache/lucene/pull/13757 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
vsop-479 commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348387100 > The we should move the changes entry. Do we have a corresponding 9.x section already? We already have a similar change entry for https://github.com/apache/lucene/pull/13252 under 9.11.0, so we should move this change entry? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add unit-of-least-precision float comparison [lucene]
mikemccand commented on code in PR #13723: URL: https://github.com/apache/lucene/pull/13723#discussion_r1758531891 ## lucene/test-framework/src/test/org/apache/lucene/tests/util/TestFloatingPointComparisons.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.tests.util; + +public class TestFloatingPointComparisons extends LuceneTestCase { + public void test() { +// Test adjacent numbers +assertFalse(doubleEquals(Double.longBitsToDouble(1L), Double.longBitsToDouble(2L), 0)); +assertTrue(doubleEquals(Double.longBitsToDouble(1L), Double.longBitsToDouble(2L), 1)); +assertFalse(floatEquals(Float.intBitsToFloat(1), Float.intBitsToFloat(2), (short) 0)); +assertTrue(floatEquals(Float.intBitsToFloat(1), Float.intBitsToFloat(2), (short) 1)); + +// Test signed zeros +assertTrue(doubleEquals(0.0d, -0.0d, 0)); +assertTrue(floatEquals(0.0f, -0.0f, (short) 0)); + +// Test NaNs +assertFalse(doubleEquals(Double.NaN, Double.NaN, Integer.MAX_VALUE)); +assertFalse(floatEquals(Float.NaN, Float.NaN, (short) 32767)); Review Comment: You could maybe also use `Math.nextAfter` to add one or two ulps to a random float/double and then `assert` that `float/doubleEquals` agrees? ## lucene/CHANGES.txt: ## @@ -422,7 +422,9 @@ Build Other -(No changes) + +* GITHUB#13720: Add float comparison based on unit of least precision and use it to stop test failures caused by float Review Comment: Maybe add `IEEE 754 float summation implemented by Java not being commutative` or so? Mathematically float summation is fine :) ## lucene/test-framework/src/test/org/apache/lucene/tests/util/TestFloatingPointComparisons.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.tests.util; + +public class TestFloatingPointComparisons extends LuceneTestCase { + public void test() { +// Test adjacent numbers +assertFalse(doubleEquals(Double.longBitsToDouble(1L), Double.longBitsToDouble(2L), 0)); +assertTrue(doubleEquals(Double.longBitsToDouble(1L), Double.longBitsToDouble(2L), 1)); +assertFalse(floatEquals(Float.intBitsToFloat(1), Float.intBitsToFloat(2), (short) 0)); +assertTrue(floatEquals(Float.intBitsToFloat(1), Float.intBitsToFloat(2), (short) 1)); + +// Test signed zeros +assertTrue(doubleEquals(0.0d, -0.0d, 0)); +assertTrue(floatEquals(0.0f, -0.0f, (short) 0)); Review Comment: Fun. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add unit-of-least-precision float comparison [lucene]
mikemccand commented on PR #13723: URL: https://github.com/apache/lucene/pull/13723#issuecomment-2348502055 > > I don't like the last commit because it changes from a assert-like method to a boolean returning method. > > I changed it away from an assertion because I liked this more. It makes it so you can assert on floats _not_ being equal or use their equality in a condition, without making an assertion statement much longer. Do you not like it because it's too generic? We could move it to `TestUtil` or we could provide assertion methods alongside the equality methods, if that helps. Maybe we could do both? I.e. add the `TestUtil` boolean-returning methods (`float/doubleEquals`) to `TestUtil` but then also add the sugar methods `void assertFloat/DoubleUlpEquals`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] The "PatternCaptureGroupTokenFilter" generates identical offsets, which causes issues with highlighting the string. [lucene]
shikhasharma3708 opened a new issue, #13783: URL: https://github.com/apache/lucene/issues/13783 ### Description I am implementing the PatternCaptureGroupTokenFilter in my code to generate tokens based on multiple regular expressions, with the goal of highlighting any matches found within the string. Currently, I am working with Lucene 9, but I am encountering the following error during execution. I'm using one of the latest Lucene jars (9.11.1) for PatternCaptureGroupTokenFilter, but the token positions are not as expected, making it difficult to accurately highlight the search results. Here are the tokens generated for the string: `{ "tokens" : [ { "token" : "test:data", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "test", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : ":data", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "test:", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "test", "start_offset" : 10, "end_offset" : 14, "type" : "word", "position" : 1 } ] }` The generated offsets are not compatible for highlighting the searchedQuery. To generate different offsets, i tried using the `org.apache.lucene.analysis.tokenattributes.OffsetAttribute` Lucene package. which is giving me the different offsets but i am encountering another error while indexing the document. I am using the below java code. `public final class PatternCaptureGroupTokenFilter extends TokenFilter { private final CharTermAttribute charTermAttr = addAttribute(CharTermAttribute.class); private final PositionIncrementAttribute posAttr = addAttribute(PositionIncrementAttribute.class); private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class); private final TypeAttribute typeAttribute = addAttribute(TypeAttribute.class); private State state; private final Matcher[] matchers; private final CharsRefBuilder spare = new CharsRefBuilder(); private final int[] groupCounts; private final boolean preserveOriginal; private int[] currentGroup; private int currentMatcher; private int main_token_start; private int main_token_end; public PatternCaptureGroupTokenFilter(TokenStream input, boolean preserveOriginal, Pattern... patterns) { super(input); this.preserveOriginal = preserveOriginal; this.matchers = new Matcher[patterns.length]; this.groupCounts = new int[patterns.length]; this.currentGroup = new int[patterns.length]; for (int i = 0; i < patterns.length; i++) { this.matchers[i] = patterns[i].matcher(""); this.groupCounts[i] = this.matchers[i].groupCount(); this.currentGroup[i] = -1; } } private boolean nextCapture() { int min_offset = Integer.MAX_VALUE; currentMatcher = -1; Matcher matcher; for (int i = 0; i < matchers.length; i++) { matcher = matchers[i]; if (currentGroup[i] == -1) { currentGroup[i] = matcher.find() ? 1 : 0; } if (currentGroup[i] != 0) { while (currentGroup[i] < groupCounts[i] + 1) { final int start = matcher.start(currentGroup[i]); final int end = matcher.end(currentGroup[i]); if (start == end || preserveOriginal && start == 0 && spare.length() == end) { currentGroup[i]++; continue; } if (start < min_offset) { min_offset = start; currentMatcher = i; } break; } if (currentGroup[i] == groupCounts[i] + 1) { currentGroup[i] = -1; i--; } } } return currentMatcher != -1; } @Override public boolean incrementToken() throws IOException { if (currentMatcher != -1 && nextCapture()) { assert state != null; clearAttributes(); restoreState(state); final int start = matchers[currentMatcher] .start(currentGroup[currentMatcher]);
[PR] Change docValuesSkipIndex from a boolean to an enum. [lucene]
jpountz opened a new pull request, #13784: URL: https://github.com/apache/lucene/pull/13784 At the moment, our skip indexes record min/max ordinal/value per range of doc IDs. It would be natural to extend it to other pre-aggregated data such as a sum and value count, which facets could take advantage of. This change switches `docValuesSkipIndex` from a boolean to an enum so that we could release such changes in the future in an additive fashion, by adding constants to this enum and new methods to `DocValuesSkipper`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
jpountz commented on PR #13779: URL: https://github.com/apache/lucene/pull/13779#issuecomment-2348633015 > think you had said 9/22 would be a feature freeze date I was thinking of doing it next week, but we can backport this PR even though the branch has been cut if it looks ready/safe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
jpountz commented on code in PR #13779: URL: https://github.com/apache/lucene/pull/13779#discussion_r1758641965 ## lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java: ## @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.io.IOException; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.document.KnnByteVectorField; +import org.apache.lucene.document.KnnFloatVectorField; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.Bits; + +/** + * This class abstracts addressing of document vector values indexed as {@link KnnFloatVectorField} + * or {@link KnnByteVectorField}. + * + * @lucene.experimental + */ +public abstract class KnnVectorValues { + + /** The iterator associated with these values. */ + protected DocIndexIterator iterator; + + /** Return the dimension of the vectors */ + public abstract int dimension(); + + /** + * Return the number of vectors for this field. + * + * @return the number of vectors returned by this iterator + */ + public abstract int size(); + + /** + * Return the docid of the document indexed with the given vector ordinal. This default + * implementation returns the argument and is appropriate for dense values implementations where + * every doc has a single value. + */ + public int ordToDoc(int ord) { +return ord; + } + + /** + * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access + * different values at once, to avoid overwriting the underlying vector returned. + */ + public abstract KnnVectorValues copy() throws IOException; + + /** Returns the vector byte length, defaults to dimension multiplied by float byte size */ + public int getVectorByteLength() { +return dimension() * getEncoding().byteSize; + } + + /** The vector encoding of these values. */ + public abstract VectorEncoding getEncoding(); + + /** Returns a Bits accepting docs accepted by the argument and having a vector value */ + public Bits getAcceptOrds(Bits acceptDocs) { +// FIXME: change default to return acceptDocs and provide this impl +// somewhere more specialized (in every non-dense impl). +if (acceptDocs == null) { + return null; +} +return new Bits() { + @Override + public boolean get(int index) { +return acceptDocs.get(ordToDoc(index)); + } + + @Override + public int length() { +return size(); + } +}; + } + + /** + * Return the iterator for this instance. If you need multiple iterators, call + * this.copy().iterator(). + */ + public DocIndexIterator iterator() { +if (iterator == null) { + iterator = createIterator(); +} +return iterator; + } + + /** + * Create an iterator for this instance; typically called once by iterator(). Wrapper + * value classes delegate to their inner instance's iterator and shouldn't implement this. + */ + protected DocIndexIterator createIterator() { +throw new UnsupportedOperationException(); + } + + /** + * A DocIdSetIterator that also provides an index() method tracking a distinct ordinal for a + * vector associated with each doc. + */ + public abstract static class DocIndexIterator extends DocIdSetIterator { + +/** return the value index (aka "ordinal" or "ord") corresponding to the current doc */ +public abstract int index(); + +@Override +public int advance(int target) throws IOException { + return slowAdvance(target); +} + +@Override +public long cost() { + throw new UnsupportedOperationException(); Review Comment: If we default cost() to returning size(), that would work for me. But I'm not comfortable with having implementations of DocIdSetIterator#cost that may throw, which means e.g. that they cannot be put in a Conjunction DISI(which will want to sort its clauses by cost). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach
Re: [PR] Add unit-of-least-precision float comparison [lucene]
uschindler commented on PR #13723: URL: https://github.com/apache/lucene/pull/13723#issuecomment-2348681854 > > > I don't like the last commit because it changes from a assert-like method to a boolean returning method. > > > > > > I changed it away from an assertion because I liked this more. It makes it so you can assert on floats _not_ being equal or use their equality in a condition, without making an assertion statement much longer. Do you not like it because it's too generic? We could move it to `TestUtil` or we could provide assertion methods alongside the equality methods, if that helps. > > Maybe we could do both? > > I.e. add the `TestUtil` boolean-returning methods (`float/doubleEquals`) to `TestUtil` but then also add the sugar methods `void assertFloat/DoubleUlpEquals`? That was my intention. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
uschindler commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348768324 > > The we should move the changes entry. Do we have a corresponding 9.x section already? > > We already have a similar change entry for #13252 under 9.11.0, so we should move this change entry? Yes, let's merge them an list both PRs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
vsop-479 commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348774168 Ok, I will do it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add unit-of-least-precision float comparison [lucene]
stefanvodita commented on PR #13723: URL: https://github.com/apache/lucene/pull/13723#issuecomment-2349071396 I've moved the methods around and, as I was writing more tests, realised I'm not going to be as comprehensive as the originals tests, so I adapted those instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace Map with IntObjectHashMap for KnnVectorsReader [lucene]
bugmakerr commented on PR #13763: URL: https://github.com/apache/lucene/pull/13763#issuecomment-2349265068 @benwtrent @jpountz I have merged main branch into this one, can I get a review on this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix Flaky Test In TestBlockJoinBulkScorer [lucene]
Mikep86 opened a new pull request, #13785: URL: https://github.com/apache/lucene/pull/13785 Fix the `testSetMinCompetitiveScoreWithScoreModeMax` test, which sometimes failed due to randomizations in how the docs were scored -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
ChrisHegarty commented on code in PR #13779: URL: https://github.com/apache/lucene/pull/13779#discussion_r1759099771 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/HasIndexSlice.java: ## @@ -14,23 +14,16 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.lucene.util.quantization; +package org.apache.lucene.codecs.lucene95; -import java.io.IOException; -import org.apache.lucene.util.hnsw.RandomAccessVectorValues; +import org.apache.lucene.store.IndexInput; /** - * Random access values for byte[], but also includes accessing the score correction - * constant for the current vector in the buffer. - * - * @lucene.experimental + * Implementors can return the IndexInput from which their values are read. For use by vector + * quantizers. */ -public interface RandomAccessQuantizedByteVectorValues extends RandomAccessVectorValues.Bytes { - - ScalarQuantizer getScalarQuantizer(); - - float getScoreCorrectionConstant(int vectorOrd) throws IOException; +public interface HasIndexSlice { - @Override - RandomAccessQuantizedByteVectorValues copy() throws IOException; + /** Returns an IndexInput from which to read this instance's values. */ + IndexInput getSlice(); Review Comment: I very much like this, and had something similar in a past unmarked PR. š ## lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java: ## @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.io.IOException; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.document.KnnByteVectorField; +import org.apache.lucene.document.KnnFloatVectorField; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.Bits; + +/** + * This class abstracts addressing of document vector values indexed as {@link KnnFloatVectorField} + * or {@link KnnByteVectorField}. + * + * @lucene.experimental + */ +public abstract class KnnVectorValues { + + /** The iterator associated with these values. */ + protected DocIndexIterator iterator; + + /** Return the dimension of the vectors */ + public abstract int dimension(); + + /** + * Return the number of vectors for this field. + * + * @return the number of vectors returned by this iterator + */ + public abstract int size(); + + /** + * Return the docid of the document indexed with the given vector ordinal. This default + * implementation returns the argument and is appropriate for dense values implementations where + * every doc has a single value. + */ + public int ordToDoc(int ord) { +return ord; + } + + /** + * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access + * different values at once, to avoid overwriting the underlying vector returned. + */ + public abstract KnnVectorValues copy() throws IOException; + + /** Returns the vector byte length, defaults to dimension multiplied by float byte size */ + public int getVectorByteLength() { Review Comment: yes. I think that it is needed. For example, for int4 the byte length depends on whether or not the vector data is compressed. Maybe update the javadoc - it's not always *float*. ## lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java: ## @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * l
Re: [PR] Use range optimizations for "slow" MultiTermQueries when terms happen to be contiguous [lucene]
iverase commented on PR #13693: URL: https://github.com/apache/lucene/pull/13693#issuecomment-2349297359 I have my doubts in this approach and I am unsure we should expose the scorer that way. On the other hand, I wonder if we can rewrite the query to a range query. I understand that in this case, you need to be able to apply the optimisation to all segments, but that's probably the most common case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add unit-of-least-precision float comparison [lucene]
uschindler commented on code in PR #13723: URL: https://github.com/apache/lucene/pull/13723#discussion_r1759153816 ## lucene/CHANGES.txt: ## @@ -422,7 +422,9 @@ Build Other -(No changes) + +* GITHUB#13720: Add float comparison based on unit of least precision and use it to stop test failures caused by float + summation not being commutative in Java's IEEE 754 implementation. (Alex Herbert, Stefan Vodita) Review Comment: It is always not commutative regardless of implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add unit-of-least-precision float comparison [lucene]
uschindler commented on code in PR #13723: URL: https://github.com/apache/lucene/pull/13723#discussion_r1759155851 ## lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java: ## @@ -864,6 +864,14 @@ public static void assumeNoException(String msg, Exception e) { RandomizedTest.assumeNoException(msg, e); } + public static void assertFloatUlpEquals(final float x, final float y, final short maxUlps) { +assertTrue(TestUtil.floatUlpEquals(x, y, maxUlps)); Review Comment: Maybe add a message showing string of both values. The problem with assert true is that you get no useful output. This is why I wanted to have the assert methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Can we remove `compress` option for quantized KNN vector indexing? [lucene]
mikemccand commented on issue #13768: URL: https://github.com/apache/lucene/issues/13768#issuecomment-2349362383 >> But I don't think we should block removing compress option due to non-SIMD results? Actually, thinking about this more ... I'm changing my mind. I don't fully understand how poor our Panama/SIMD coverage is across CPU types/versions, "typically" in use by our users. E.g. for ARM CPUs (various versions of NEON instructions). What %tg of our users would hit the non-SIMD (non-Panama) path? It's spooky that the likes of OpenSearch, Elasticsearch, Solr are needing to pull in their own Panama FMA wrappers around native code to better optimize for certain vectorized instruction cases (see discussion on #13572). Ideally such optimizations would be in Lucene so we could make decisions like this (remove `compress` option to simplify our API / reduce surface area) with more confidence. I'd like to run benchmarks across many more CPUs before rushing to a decision here, and I think for now we should otherwise respect the non-SIMD results? I love our new `aws-jmh` dev tool (thank you @rmuir)! I looked at its `playbook.yml` to figure out if I could also add "go check out `luceneutil`, download this massive 95 GB `.vec` file, and run `knnPerfTest.py` and summarize the results" but I haven't made much progress so far ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C⦠[lucene]
ChrisHegarty commented on PR #13572: URL: https://github.com/apache/lucene/pull/13572#issuecomment-2349442717 > > Anyways: At moment we do not want to have native code in Lucene Core. > .. > Having the likes of OpenSearch, Elasticsearch, and Solr implement their own (high performance) direct native SIMD access seems not ideal to me. It really should somehow be an option in Lucene, even if it's not the default. As someone who wrote and maintains the native scorer in Elasticsearch, it gives great speed ups for Scalar Quantised vectors on ARM. Since Panama Vector is horrible for arithmetic operations on byte-sized values; both widen and accessing various parts of the vector. We have an AVX implementation too, but it gives less improvements. I don't see Panama Vector improving in this area any time soon :-(. For floats we don't do any native. In the latest Binary Quantization https://github.com/apache/lucene/pull/13651, Panama Vector performs very nicely. It might be that some of the motivation for a native scorer may just go away. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add unit-of-least-precision float comparison [lucene]
stefanvodita commented on code in PR #13723: URL: https://github.com/apache/lucene/pull/13723#discussion_r1759505470 ## lucene/CHANGES.txt: ## @@ -422,7 +422,9 @@ Build Other -(No changes) + +* GITHUB#13720: Add float comparison based on unit of least precision and use it to stop test failures caused by float + summation not being commutative in Java's IEEE 754 implementation. (Alex Herbert, Stefan Vodita) Review Comment: Now that I think about it, it's not even a commutativity issue. It's an associativity issue. `a + b == b + a` But `(a + b) + c != a + (b + c)` I'll fix the entry. ## lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java: ## @@ -864,6 +864,14 @@ public static void assumeNoException(String msg, Exception e) { RandomizedTest.assumeNoException(msg, e); } + public static void assertFloatUlpEquals(final float x, final float y, final short maxUlps) { +assertTrue(TestUtil.floatUlpEquals(x, y, maxUlps)); Review Comment: Good point, will do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
vsop-479 commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2350779495 > let's merge them an list both PRs. I merged them into one entry. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Remove recurse into sub block when scan leaf block in IDVersionSegmentTermsEnumFrame. [lucene]
vsop-479 opened a new pull request, #13786: URL: https://github.com/apache/lucene/pull/13786 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]
msfroh commented on code in PR #13747: URL: https://github.com/apache/lucene/pull/13747#discussion_r1759648681 ## lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java: ## @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.join; + +import java.io.IOException; +import org.apache.lucene.search.Collector; + +interface MergeableCollector> extends Collector { + + /** Folds the results from another collector of the same type into this collector. */ + void merge(C collector) throws IOException; +} Review Comment: Yeah... I think you're right. I'll still need an abstraction to describe what the Collector produces, but then either that "thing" will know how to merge themselves, or it can be an anonymous function passed to the CollectorManager. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
vsop-479 commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2350800984 > can we find other occurrences using some regex searches? Hmm, I also find some suffix's loop comparing in `IDVersionSegmentTermsEnumFrame` and `OrdsSegmentTermsEnumFrame`, similar to `SegmentTermsEnumFrame`, these code maybe could replaced by `Arrays#compareUnsigned` too. But, firstly I find we attempt to load sub block when scanning a leaf block: https://github.com/apache/lucene/pull/13786. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix comment on compare suffix and target. [lucene]
vsop-479 opened a new pull request, #13787: URL: https://github.com/apache/lucene/pull/13787 ### Description Since we do not loop over bytes (hand-written) in the suffix, to comaring to the target anymore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]
msfroh commented on code in PR #13747: URL: https://github.com/apache/lucene/pull/13747#discussion_r1759668285 ## lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java: ## @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.join; + +import java.io.IOException; +import org.apache.lucene.search.Collector; + +interface MergeableCollector> extends Collector { + + /** Folds the results from another collector of the same type into this collector. */ + void merge(C collector) throws IOException; +} Review Comment: Oh, wait, I'm a moron. I should just more or less turn all of these Collectors into CollectorManagers. Then I'm not duplicating anything. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Reduce number of calculations in FSTCompiler [lucene]
mrhbj opened a new pull request, #13788: URL: https://github.com/apache/lucene/pull/13788 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org