Re: [PR] Backport to 9x: Initialize facet counting data structures lazily #12408 [lucene]
stefanvodita merged PR #13300: URL: https://github.com/apache/lucene/pull/13300 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Datacube format changes to support materialized views [lucene]
Bukhtawar commented on code in PR #13342: URL: https://github.com/apache/lucene/pull/13342#discussion_r1588928997 ## lucene/core/src/java/org/apache/lucene/index/SegmentInfo.java: ## @@ -124,7 +125,8 @@ public SegmentInfo( Map diagnostics, byte[] id, Map attributes, - Sort indexSort) { + Sort indexSort, + DataCubesConfig dataCubesConfig) { Review Comment: Should we overload this ctor to avoid changes across the board, unless you think this should be mandatory -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Datacube format changes to support materialized views [lucene]
bharath-techie commented on code in PR #13342: URL: https://github.com/apache/lucene/pull/13342#discussion_r1589030697 ## lucene/core/src/java/org/apache/lucene/index/SegmentInfo.java: ## @@ -124,7 +125,8 @@ public SegmentInfo( Map diagnostics, byte[] id, Map attributes, - Sort indexSort) { + Sort indexSort, + DataCubesConfig dataCubesConfig) { Review Comment: Yeah will revert the second commit which removed the overload. I think overload is good , we can remove it if needed based on feedback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new VectorScorer interface to vector value iterators [lucene]
benwtrent commented on code in PR #13181: URL: https://github.com/apache/lucene/pull/13181#discussion_r1589212230 ## lucene/core/src/java/org/apache/lucene/search/VectorScorer.java: ## @@ -18,64 +18,39 @@ import java.io.IOException; import org.apache.lucene.index.ByteVectorValues; -import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.FloatVectorValues; -import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.VectorSimilarityFunction; /** * Computes the similarity score between a given query vector and different document vectors. This - * is primarily used by {@link KnnFloatVectorQuery} to run an exact, exhaustive search over the - * vectors. + * is used for exact searching and scoring + * + * @lucene.experimental */ -abstract class VectorScorer { - protected final VectorSimilarityFunction similarity; +public interface VectorScorer { /** - * Create a new vector scorer instance. + * Compute the score for the current document ID. * - * @param context the reader context - * @param fi the FieldInfo for the field containing document vectors - * @param query the query vector to compute the similarity for + * @return the score for the current document ID + * @throws IOException if an exception occurs during score computation */ - static FloatVectorScorer create(LeafReaderContext context, FieldInfo fi, float[] query) - throws IOException { -FloatVectorValues values = context.reader().getFloatVectorValues(fi.name); -if (values == null) { - FloatVectorValues.checkField(context.reader(), fi.name); - return null; -} -final VectorSimilarityFunction similarity = fi.getVectorSimilarityFunction(); -return new FloatVectorScorer(values, query, similarity); - } - - static ByteVectorScorer create(LeafReaderContext context, FieldInfo fi, byte[] query) - throws IOException { -ByteVectorValues values = context.reader().getByteVectorValues(fi.name); -if (values == null) { - ByteVectorValues.checkField(context.reader(), fi.name); - return null; -} -VectorSimilarityFunction similarity = fi.getVectorSimilarityFunction(); -return new ByteVectorScorer(values, query, similarity); - } - - VectorScorer(VectorSimilarityFunction similarity) { -this.similarity = similarity; - } + float score() throws IOException; - /** Compute the similarity score for the current document. */ - abstract float score() throws IOException; - - abstract boolean advanceExact(int doc) throws IOException; + /** + * @return a {@link DocIdSetIterator} over the documents. + */ + DocIdSetIterator iterator(); Review Comment: I have been thinking more. I think it would be good for the VectorScorer to use a "copy" of the iterator. This way the ONLY way to iterate is the iterator returned by the scorer. Allowing both the VectorValues and a iterator returned from the scorer refer to the same internal iterator seems trappy. Requiring iteration with the VectorScorer#iterator seems more natural and safer. I will update the PR accordingly soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589285702 ## lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java: ## @@ -198,6 +201,11 @@ private static void ensureCaller() { private static final class Holder { private Holder() {} -static final VectorizationProvider INSTANCE = lookup(false); +// TODO: this is not quite right. But we should be able to run tests with Panama Vector +static boolean testMode() { + return TESTS_VECTOR_SIZE.isPresent() || TESTS_FORCE_INTEGER_VECTORS; Review Comment: Actually at the moment the easiest is to pass CI=true as environment variable to Gradle. ## lucene/core/src/java/org/apache/lucene/codecs/hnsw/FlatVectorScorerProvider.java: ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.hnsw; + +import java.lang.invoke.MethodHandles; +import java.lang.invoke.MethodType; +import org.apache.lucene.internal.vectorization.VectorizationProvider; + +/** + * A utility class that provides access to the default FlatVectorsScorer. + * + * @lucene.experimental + */ +public class FlatVectorScorerProvider { + + /** Returns the default FlatVectorsScorer. */ + public static FlatVectorsScorer createDefault() { +if (isPanamaVectorUtilSupportEnabled()) { + // we only enable this scorer if the Panama vector provider is also enabled + return lookup(); +} +return new DefaultFlatVectorScorer(); + } + + public static FlatVectorsScorer lookup() { Review Comment: I don't like the additional code here to lookup the FlatVectorScorerProvider. This should be done in VectorizationProvider. So we should add a method `newFlatVectorScorer()` to VectorizationProvider. The default one returns the default only, the other one the wrapped one. This is how the VectorizationProvider interface was designed. It should get method to create "vectorized" instances or return static instances. So this whole class should be removed and the code should just call VectorizationProvider newDefaultFlatVectorScorer(). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589290658 ## lucene/core/src/java/org/apache/lucene/codecs/hnsw/FlatVectorScorerProvider.java: ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.hnsw; + +import java.lang.invoke.MethodHandles; +import java.lang.invoke.MethodType; +import org.apache.lucene.internal.vectorization.VectorizationProvider; + +/** + * A utility class that provides access to the default FlatVectorsScorer. + * + * @lucene.experimental + */ +public class FlatVectorScorerProvider { + + /** Returns the default FlatVectorsScorer. */ + public static FlatVectorsScorer createDefault() { +if (isPanamaVectorUtilSupportEnabled()) { + // we only enable this scorer if the Panama vector provider is also enabled + return lookup(); +} +return new DefaultFlatVectorScorer(); + } + + public static FlatVectorsScorer lookup() { Review Comment: I don't like the additional code here to lookup the FlatVectorScorerProvider. This should be done in VectorizationProvider. So we should add a method `newFlatVectorScorer()` to VectorizationProvider. The default one returns the default only, the other one the wrapped one. This is how the VectorizationProvider interface was designed. It should get methods to create or return instances of "vectorized" implementations, e.g. VectorUtilSupport or here a FlatVectorScorer. So this whole class should be removed and the code should just call VectorizationProvider newDefaultFlatVectorScorer(). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589312414 ## lucene/core/src/java/org/apache/lucene/codecs/hnsw/FlatVectorScorerProvider.java: ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.hnsw; + +import java.lang.invoke.MethodHandles; +import java.lang.invoke.MethodType; +import org.apache.lucene.internal.vectorization.VectorizationProvider; + +/** + * A utility class that provides access to the default FlatVectorsScorer. + * + * @lucene.experimental + */ +public class FlatVectorScorerProvider { + + /** Returns the default FlatVectorsScorer. */ + public static FlatVectorsScorer createDefault() { +if (isPanamaVectorUtilSupportEnabled()) { + // we only enable this scorer if the Panama vector provider is also enabled + return lookup(); +} +return new DefaultFlatVectorScorer(); + } + + public static FlatVectorsScorer lookup() { Review Comment: When integrating this code into the VectorizationProvider all those checks below are also obsolete because everything that returns components with vectorization support are at one place. I may also change this here in your branch if required. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589314731 ## lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java: ## @@ -198,6 +201,11 @@ private static void ensureCaller() { private static final class Holder { private Holder() {} -static final VectorizationProvider INSTANCE = lookup(false); +// TODO: this is not quite right. But we should be able to run tests with Panama Vector +static boolean testMode() { + return TESTS_VECTOR_SIZE.isPresent() || TESTS_FORCE_INTEGER_VECTORS; Review Comment: We must remove this again, because the problem with tests is that they use a special JVM flag to disable tiered compilation, so C2 never gets active. By the trick `CI=true` we enable this. We should better document this or maybe add another sysprop. The code added here makes no sense its just a workaround to enforce slowness. It works if you have a very fast machine or only run a few tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589316210 ## lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java: ## @@ -198,6 +201,11 @@ private static void ensureCaller() { private static final class Holder { private Holder() {} -static final VectorizationProvider INSTANCE = lookup(false); +// TODO: this is not quite right. But we should be able to run tests with Panama Vector +static boolean testMode() { + return TESTS_VECTOR_SIZE.isPresent() || TESTS_FORCE_INTEGER_VECTORS; Review Comment: ha! yes. I forgot. Thanks for the reminder. I'll use that for testing. ``` export CI=true; ./gradlew :lucene:core:test ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589317289 ## lucene/core/src/java/org/apache/lucene/codecs/hnsw/FlatVectorScorerProvider.java: ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.hnsw; + +import java.lang.invoke.MethodHandles; +import java.lang.invoke.MethodType; +import org.apache.lucene.internal.vectorization.VectorizationProvider; + +/** + * A utility class that provides access to the default FlatVectorsScorer. + * + * @lucene.experimental + */ +public class FlatVectorScorerProvider { + + /** Returns the default FlatVectorsScorer. */ + public static FlatVectorsScorer createDefault() { +if (isPanamaVectorUtilSupportEnabled()) { + // we only enable this scorer if the Panama vector provider is also enabled + return lookup(); +} +return new DefaultFlatVectorScorer(); + } + + public static FlatVectorsScorer lookup() { Review Comment: This is much cleaner - done. Maybe you have further improvements, but I think it looks much better now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589323816 ## lucene/core/src/test/org/apache/lucene/search/TestKnnByteVectorQueryMMap.java: ## @@ -0,0 +1,36 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.io.UncheckedIOException; +import org.apache.lucene.store.MMapDirectory; +import org.apache.lucene.tests.store.BaseDirectoryWrapper; +import org.apache.lucene.tests.store.MockDirectoryWrapper; + +public class TestKnnByteVectorQueryMMap extends TestKnnByteVectorQuery { Review Comment: I know that we don't strictly need this, but I do find it helpful to have testing over an mmap memory segment index input on every run. And also have it programmatically available, rather than through a system property. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589316210 ## lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java: ## @@ -198,6 +201,11 @@ private static void ensureCaller() { private static final class Holder { private Holder() {} -static final VectorizationProvider INSTANCE = lookup(false); +// TODO: this is not quite right. But we should be able to run tests with Panama Vector +static boolean testMode() { + return TESTS_VECTOR_SIZE.isPresent() || TESTS_FORCE_INTEGER_VECTORS; Review Comment: ha! yes. I forgot. Thanks for the reminder. I'll use that for testing. reverted the source change. ``` export CI=true; ./gradlew :lucene:core:test ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
msokolov commented on PR #13339: URL: https://github.com/apache/lucene/pull/13339#issuecomment-2093195536 So excited to see this finally come to fruition! No more double-buffering! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
uschindler commented on PR #13339: URL: https://github.com/apache/lucene/pull/13339#issuecomment-2093209660 How can i change the review to "undecided"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
ChrisHegarty commented on PR #13339: URL: https://github.com/apache/lucene/pull/13339#issuecomment-2093212149 > How can i change the review to "undecided"? I re-requested ur review - so there is no official reviewer yet. Take ur time. I have some luceneutil benchmarks to run, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589339242 ## lucene/core/src/java/org/apache/lucene/codecs/hnsw/FlatVectorScorerProvider.java: ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.hnsw; + +import java.lang.invoke.MethodHandles; +import java.lang.invoke.MethodType; +import org.apache.lucene.internal.vectorization.VectorizationProvider; + +/** + * A utility class that provides access to the default FlatVectorsScorer. + * + * @lucene.experimental + */ +public class FlatVectorScorerProvider { + + /** Returns the default FlatVectorsScorer. */ + public static FlatVectorsScorer createDefault() { +if (isPanamaVectorUtilSupportEnabled()) { + // we only enable this scorer if the Panama vector provider is also enabled + return lookup(); +} +return new DefaultFlatVectorScorer(); + } + + public static FlatVectorsScorer lookup() { Review Comment: Yes very nice now. The whole construct is still a bit strange but at least in main branch quite clean, because we know that MMapDir always available with memory segments, so the vector provider does not need to differentiate. If we want to backport there are more knobs that could change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
uschindler commented on PR #13339: URL: https://github.com/apache/lucene/pull/13339#issuecomment-2093214942 > > How can i change the review to "undecided"? > > I re-requested ur review - so there is no official reviewer yet. Take ur time. I have some luceneutil benchmarks to run, etc. The benchmark code seems broken after your changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] IndexWriter loses track of parent field when index is empty [lucene]
msokolov commented on issue #13340: URL: https://github.com/apache/lucene/issues/13340#issuecomment-2093214232 @simonw of you get a moment, your perspective would be helpful. Should we be writing index metadata somewhere outside of a segment? Or tweak the hack we have... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589359449 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentByteVectorScorerSupplier.java: ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.internal.vectorization; + +import java.io.IOException; +import java.lang.foreign.MemorySegment; +import java.util.Optional; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.store.FilterIndexInput; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.MemorySegmentAccess; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.hnsw.RandomAccessVectorValues; +import org.apache.lucene.util.hnsw.RandomVectorScorer; +import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier; + +/** A scorer of vectors whose element size is byte. */ +public abstract sealed class MemorySegmentByteVectorScorerSupplier +implements RandomVectorScorerSupplier, RandomVectorScorer +permits DotProductByteVectorScorerSupplier, EuclideanByteVectorScorerSupplier { + final int vectorByteSize; + final int dims; + final int maxOrd; + final IndexInput input; + final MemorySegmentAccess memorySegmentAccess; + + final RandomAccessVectorValues values; // to support ordToDoc/getAcceptOrds + final byte[] scratch1, scratch2; + + MemorySegment first; + + /** + * Return an optional whose value, if present, is the scorer. Otherwise, an empty optional is + * returned. + */ + public static Optional create( + int dims, + int maxOrd, + int vectorByteSize, + VectorSimilarityFunction type, + IndexInput input, + RandomAccessVectorValues values) { +input = FilterIndexInput.unwrap(input); +if (!(input instanceof MemorySegmentAccess)) { + return Optional.empty(); +} +checkInvariants(maxOrd, vectorByteSize, input); +return switch (type) { + case DOT_PRODUCT -> Optional.of( + new DotProductByteVectorScorerSupplier(dims, maxOrd, vectorByteSize, input, values)); + case EUCLIDEAN -> Optional.of( + new EuclideanByteVectorScorerSupplier(dims, maxOrd, vectorByteSize, input, values)); + case MAXIMUM_INNER_PRODUCT -> Optional.empty(); // TODO: implement MAXIMUM_INNER_PRODUCT + case COSINE -> Optional.empty(); // TODO: implement Cosine +}; + } + + MemorySegmentByteVectorScorerSupplier( + int dims, int maxOrd, int vectorByteSize, IndexInput input, RandomAccessVectorValues values) { Review Comment: `input` should be typed `MemorySegmentAccess` from beginning, so remove cast below, cast should be in create method when we know what it is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589375462 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentByteVectorScorerSupplier.java: ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.internal.vectorization; + +import java.io.IOException; +import java.lang.foreign.MemorySegment; +import java.util.Optional; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.store.FilterIndexInput; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.MemorySegmentAccess; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.hnsw.RandomAccessVectorValues; +import org.apache.lucene.util.hnsw.RandomVectorScorer; +import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier; + +/** A scorer of vectors whose element size is byte. */ +public abstract sealed class MemorySegmentByteVectorScorerSupplier +implements RandomVectorScorerSupplier, RandomVectorScorer +permits DotProductByteVectorScorerSupplier, EuclideanByteVectorScorerSupplier { + final int vectorByteSize; + final int dims; + final int maxOrd; + final IndexInput input; + final MemorySegmentAccess memorySegmentAccess; + + final RandomAccessVectorValues values; // to support ordToDoc/getAcceptOrds + final byte[] scratch1, scratch2; + + MemorySegment first; + + /** + * Return an optional whose value, if present, is the scorer. Otherwise, an empty optional is + * returned. + */ + public static Optional create( + int dims, + int maxOrd, + int vectorByteSize, + VectorSimilarityFunction type, + IndexInput input, + RandomAccessVectorValues values) { +input = FilterIndexInput.unwrap(input); +if (!(input instanceof MemorySegmentAccess)) { + return Optional.empty(); +} +checkInvariants(maxOrd, vectorByteSize, input); +return switch (type) { + case DOT_PRODUCT -> Optional.of( + new DotProductByteVectorScorerSupplier(dims, maxOrd, vectorByteSize, input, values)); + case EUCLIDEAN -> Optional.of( + new EuclideanByteVectorScorerSupplier(dims, maxOrd, vectorByteSize, input, values)); + case MAXIMUM_INNER_PRODUCT -> Optional.empty(); // TODO: implement MAXIMUM_INNER_PRODUCT + case COSINE -> Optional.empty(); // TODO: implement Cosine +}; + } + + MemorySegmentByteVectorScorerSupplier( + int dims, int maxOrd, int vectorByteSize, IndexInput input, RandomAccessVectorValues values) { Review Comment: yeah, there is a clear tension here. I created MemorySegmentAccess so as to avoid making MemorySegmentIndexInput public. We can keep it, and check and cast `input` to both types we need `IndexInput` and `MemorySegmentAccess`. Or we can try to unify the types. I don't have a strong preference, other than we should do one or the other. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
ChrisHegarty commented on PR #13339: URL: https://github.com/apache/lucene/pull/13339#issuecomment-2093300534 > Dismissing Uwe's review, since he is undecided. Can be explicitly added later, when we convince him ;-) (Oh, this looks harsh!) I hope that I did this right. If not, I apologise. No offence intended. I just want to reflect @uschindler's comment above about being currently undecided. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589410048 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentByteVectorScorerSupplier.java: ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.internal.vectorization; + +import java.io.IOException; +import java.lang.foreign.MemorySegment; +import java.util.Optional; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.store.FilterIndexInput; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.MemorySegmentAccess; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.hnsw.RandomAccessVectorValues; +import org.apache.lucene.util.hnsw.RandomVectorScorer; +import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier; + +/** A scorer of vectors whose element size is byte. */ +public abstract sealed class MemorySegmentByteVectorScorerSupplier +implements RandomVectorScorerSupplier, RandomVectorScorer +permits DotProductByteVectorScorerSupplier, EuclideanByteVectorScorerSupplier { + final int vectorByteSize; + final int dims; + final int maxOrd; + final IndexInput input; + final MemorySegmentAccess memorySegmentAccess; + + final RandomAccessVectorValues values; // to support ordToDoc/getAcceptOrds + final byte[] scratch1, scratch2; + + MemorySegment first; + + /** + * Return an optional whose value, if present, is the scorer. Otherwise, an empty optional is + * returned. + */ + public static Optional create( + int dims, + int maxOrd, + int vectorByteSize, + VectorSimilarityFunction type, + IndexInput input, + RandomAccessVectorValues values) { +input = FilterIndexInput.unwrap(input); +if (!(input instanceof MemorySegmentAccess)) { + return Optional.empty(); +} +checkInvariants(maxOrd, vectorByteSize, input); +return switch (type) { + case DOT_PRODUCT -> Optional.of( + new DotProductByteVectorScorerSupplier(dims, maxOrd, vectorByteSize, input, values)); + case EUCLIDEAN -> Optional.of( + new EuclideanByteVectorScorerSupplier(dims, maxOrd, vectorByteSize, input, values)); + case MAXIMUM_INNER_PRODUCT -> Optional.empty(); // TODO: implement MAXIMUM_INNER_PRODUCT + case COSINE -> Optional.empty(); // TODO: implement Cosine +}; + } + + MemorySegmentByteVectorScorerSupplier( + int dims, int maxOrd, int vectorByteSize, IndexInput input, RandomAccessVectorValues values) { Review Comment: I cleaned this up a bit. It's probably ok now. But feel free to refactor it if you have a better idea, or just wanna make MemorySegmentIndexInput public. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589472836 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/DotProductByteVectorScorerSupplier.java: ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.internal.vectorization; + +import java.io.IOException; +import org.apache.lucene.store.MemorySegmentAccessInput; +import org.apache.lucene.util.hnsw.RandomAccessVectorValues; + +final class DotProductByteVectorScorerSupplier extends MemorySegmentByteVectorScorerSupplier { + + DotProductByteVectorScorerSupplier( + int dims, + int maxOrd, + int vectorByteSize, + MemorySegmentAccessInput input, + RandomAccessVectorValues values) { +super(dims, maxOrd, vectorByteSize, input, values); + } + + @Override + public float score(int node) throws IOException { +// divide by 2 * 2^14 (maximum absolute value of product of 2 signed bytes) * len +float raw = PanamaVectorUtilSupport.dotProduct(first, getSegment(node, scratch2)); +return 0.5f + raw / (float) (dims * (1 << 15)); + } + + @Override + public DotProductByteVectorScorerSupplier copy() throws IOException { +return new DotProductByteVectorScorerSupplier( +dims, maxOrd, vectorByteSize, input.clone(), values); Review Comment: this was the reason why we needed the original input! ## lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/VectorScorerBenchmark.java: ## @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.benchmark.jmh; + +import static org.apache.lucene.index.VectorSimilarityFunction.DOT_PRODUCT; + +import java.io.IOException; +import java.nio.file.Files; +import java.util.concurrent.ThreadLocalRandom; +import java.util.concurrent.TimeUnit; +import org.apache.lucene.codecs.lucene95.OffHeapByteVectorValues; +import org.apache.lucene.internal.vectorization.VectorizationProvider; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.MMapDirectory; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.hnsw.RandomAccessVectorValues; +import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier; +import org.openjdk.jmh.annotations.*; + +@BenchmarkMode(Mode.Throughput) +@OutputTimeUnit(TimeUnit.MICROSECONDS) +@State(Scope.Benchmark) +// first iteration is complete garbage, so make sure we really warmup +@Warmup(iterations = 4, time = 1) +// real iterations. not useful to spend tons of time here, better to fork more +@Measurement(iterations = 5, time = 1) +// engage some noise reduction +@Fork( +value = 3, +jvmArgsAppend = {"-Xmx2g", "-Xms2g", "-XX:+AlwaysPreTouch"}) +public class VectorScorerBenchmark { + + @Param({"1", "128", "207", "256", "300", "512", "702", "1024"}) + int size; + + Directory dir; + IndexInput in; + RandomAccessVectorValues vectorValues; + byte[] vec1, vec2; + RandomVectorScorerSupplier scorer; + + @Setup(Level.Iteration) + public void init() throws IOException { +vec1 = new byte[size]; +vec2 = new byte[size]; +ThreadLocalRandom.current().nextBytes(vec1); +ThreadLocalRandom.current().nextBytes(vec2); + +dir = new MMapDirectory(Files.createTempDirectory("VectorScor
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589477380 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/DotProductByteVectorScorerSupplier.java: ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.internal.vectorization; + +import java.io.IOException; +import org.apache.lucene.store.MemorySegmentAccessInput; +import org.apache.lucene.util.hnsw.RandomAccessVectorValues; + +final class DotProductByteVectorScorerSupplier extends MemorySegmentByteVectorScorerSupplier { + + DotProductByteVectorScorerSupplier( + int dims, + int maxOrd, + int vectorByteSize, + MemorySegmentAccessInput input, + RandomAccessVectorValues values) { +super(dims, maxOrd, vectorByteSize, input, values); + } + + @Override + public float score(int node) throws IOException { +// divide by 2 * 2^14 (maximum absolute value of product of 2 signed bytes) * len +float raw = PanamaVectorUtilSupport.dotProduct(first, getSegment(node, scratch2)); +return 0.5f + raw / (float) (dims * (1 << 15)); + } + + @Override + public DotProductByteVectorScorerSupplier copy() throws IOException { +return new DotProductByteVectorScorerSupplier( +dims, maxOrd, vectorByteSize, input.clone(), values); Review Comment: it's cool that the interface automatically adds the bridge method... thanks javac! :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589486120 ## lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/VectorScorerBenchmark.java: ## @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.benchmark.jmh; + +import static org.apache.lucene.index.VectorSimilarityFunction.DOT_PRODUCT; + +import java.io.IOException; +import java.nio.file.Files; +import java.util.concurrent.ThreadLocalRandom; +import java.util.concurrent.TimeUnit; +import org.apache.lucene.codecs.lucene95.OffHeapByteVectorValues; +import org.apache.lucene.internal.vectorization.VectorizationProvider; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.MMapDirectory; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.hnsw.RandomAccessVectorValues; +import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier; +import org.openjdk.jmh.annotations.*; + +@BenchmarkMode(Mode.Throughput) +@OutputTimeUnit(TimeUnit.MICROSECONDS) +@State(Scope.Benchmark) +// first iteration is complete garbage, so make sure we really warmup +@Warmup(iterations = 4, time = 1) +// real iterations. not useful to spend tons of time here, better to fork more +@Measurement(iterations = 5, time = 1) +// engage some noise reduction +@Fork( +value = 3, +jvmArgsAppend = {"-Xmx2g", "-Xms2g", "-XX:+AlwaysPreTouch"}) +public class VectorScorerBenchmark { + + @Param({"1", "128", "207", "256", "300", "512", "702", "1024"}) + int size; + + Directory dir; + IndexInput in; + RandomAccessVectorValues vectorValues; + byte[] vec1, vec2; + RandomVectorScorerSupplier scorer; + + @Setup(Level.Iteration) + public void init() throws IOException { +vec1 = new byte[size]; +vec2 = new byte[size]; +ThreadLocalRandom.current().nextBytes(vec1); +ThreadLocalRandom.current().nextBytes(vec2); + +dir = new MMapDirectory(Files.createTempDirectory("VectorScorerBenchmark")); +try (IndexOutput out = dir.createOutput("vector.data", IOContext.DEFAULT)) { + out.writeBytes(vec1, 0, vec1.length); + out.writeBytes(vec2, 0, vec2.length); +} +in = dir.openInput("vector.data", IOContext.DEFAULT); +vectorValues = vectorValues(size, 2, in); +scorer = +VectorizationProvider.getInstance() +.newFlatVectorScorer() +.getRandomVectorScorerSupplier(DOT_PRODUCT, vectorValues); + +// Ensure we're using the right vector scorer Review Comment: Was mainly for my own sanity. Can be removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589486399 ## lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/VectorScorerBenchmark.java: ## @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.benchmark.jmh; + +import static org.apache.lucene.index.VectorSimilarityFunction.DOT_PRODUCT; + +import java.io.IOException; +import java.nio.file.Files; +import java.util.concurrent.ThreadLocalRandom; +import java.util.concurrent.TimeUnit; +import org.apache.lucene.codecs.lucene95.OffHeapByteVectorValues; +import org.apache.lucene.internal.vectorization.VectorizationProvider; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.MMapDirectory; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.hnsw.RandomAccessVectorValues; +import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier; +import org.openjdk.jmh.annotations.*; + +@BenchmarkMode(Mode.Throughput) +@OutputTimeUnit(TimeUnit.MICROSECONDS) +@State(Scope.Benchmark) +// first iteration is complete garbage, so make sure we really warmup +@Warmup(iterations = 4, time = 1) +// real iterations. not useful to spend tons of time here, better to fork more +@Measurement(iterations = 5, time = 1) +// engage some noise reduction +@Fork( +value = 3, +jvmArgsAppend = {"-Xmx2g", "-Xms2g", "-XX:+AlwaysPreTouch"}) +public class VectorScorerBenchmark { + + @Param({"1", "128", "207", "256", "300", "512", "702", "1024"}) + int size; + + Directory dir; + IndexInput in; + RandomAccessVectorValues vectorValues; + byte[] vec1, vec2; + RandomVectorScorerSupplier scorer; + + @Setup(Level.Iteration) + public void init() throws IOException { +vec1 = new byte[size]; +vec2 = new byte[size]; +ThreadLocalRandom.current().nextBytes(vec1); +ThreadLocalRandom.current().nextBytes(vec2); + +dir = new MMapDirectory(Files.createTempDirectory("VectorScorerBenchmark")); +try (IndexOutput out = dir.createOutput("vector.data", IOContext.DEFAULT)) { + out.writeBytes(vec1, 0, vec1.length); + out.writeBytes(vec2, 0, vec2.length); +} +in = dir.openInput("vector.data", IOContext.DEFAULT); +vectorValues = vectorValues(size, 2, in); +scorer = +VectorizationProvider.getInstance() +.newFlatVectorScorer() +.getRandomVectorScorerSupplier(DOT_PRODUCT, vectorValues); + +// Ensure we're using the right vector scorer Review Comment: anyways if you have wrong cpu, also all other benchmark will be as fast as non-vectorized one. In this case the assertion just fails, but this is not strongly needed. We print a log message anyways if vectorization is enabled and you see this in the benchmark anyways. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589508058 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentByteVectorScorerSupplier.java: ## @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.internal.vectorization; + +import java.io.IOException; +import java.lang.foreign.MemorySegment; +import java.util.Optional; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.store.FilterIndexInput; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.MemorySegmentAccessInput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.hnsw.RandomAccessVectorValues; +import org.apache.lucene.util.hnsw.RandomVectorScorer; +import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier; + +/** A scorer of vectors whose element size is byte. */ +public abstract sealed class MemorySegmentByteVectorScorerSupplier +implements RandomVectorScorerSupplier, RandomVectorScorer +permits DotProductByteVectorScorerSupplier, EuclideanByteVectorScorerSupplier { + final int vectorByteSize; + final int dims; + final int maxOrd; + final MemorySegmentAccessInput input; + + final RandomAccessVectorValues values; // to support ordToDoc/getAcceptOrds + final byte[] scratch1, scratch2; + + MemorySegment first; + + /** + * Return an optional whose value, if present, is the scorer. Otherwise, an empty optional is + * returned. + */ + public static Optional create( + int dims, + int maxOrd, + int vectorByteSize, + VectorSimilarityFunction type, + IndexInput input, + RandomAccessVectorValues values) { +input = FilterIndexInput.unwrap(input); Review Comment: this is my only problem: if we unwrap all index inputs we may break some custom wrapper code of users? We should not do this. If we can get rid of that I am fine. Do we wrap the readers at other places, too!? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589514775 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentByteVectorScorerSupplier.java: ## @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.internal.vectorization; + +import java.io.IOException; +import java.lang.foreign.MemorySegment; +import java.util.Optional; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.store.FilterIndexInput; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.MemorySegmentAccessInput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.hnsw.RandomAccessVectorValues; +import org.apache.lucene.util.hnsw.RandomVectorScorer; +import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier; + +/** A scorer of vectors whose element size is byte. */ +public abstract sealed class MemorySegmentByteVectorScorerSupplier +implements RandomVectorScorerSupplier, RandomVectorScorer +permits DotProductByteVectorScorerSupplier, EuclideanByteVectorScorerSupplier { + final int vectorByteSize; + final int dims; + final int maxOrd; + final MemorySegmentAccessInput input; + + final RandomAccessVectorValues values; // to support ordToDoc/getAcceptOrds + final byte[] scratch1, scratch2; + + MemorySegment first; + + /** + * Return an optional whose value, if present, is the scorer. Otherwise, an empty optional is + * returned. + */ + public static Optional create( + int dims, + int maxOrd, + int vectorByteSize, + VectorSimilarityFunction type, + IndexInput input, + RandomAccessVectorValues values) { +input = FilterIndexInput.unwrap(input); Review Comment: if we wrap some CFS files, its breaks, too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589535535 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentByteVectorScorerSupplier.java: ## @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.internal.vectorization; + +import java.io.IOException; +import java.lang.foreign.MemorySegment; +import java.util.Optional; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.store.FilterIndexInput; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.MemorySegmentAccessInput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.hnsw.RandomAccessVectorValues; +import org.apache.lucene.util.hnsw.RandomVectorScorer; +import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier; + +/** A scorer of vectors whose element size is byte. */ +public abstract sealed class MemorySegmentByteVectorScorerSupplier +implements RandomVectorScorerSupplier, RandomVectorScorer +permits DotProductByteVectorScorerSupplier, EuclideanByteVectorScorerSupplier { + final int vectorByteSize; + final int dims; + final int maxOrd; + final MemorySegmentAccessInput input; + + final RandomAccessVectorValues values; // to support ordToDoc/getAcceptOrds + final byte[] scratch1, scratch2; + + MemorySegment first; + + /** + * Return an optional whose value, if present, is the scorer. Otherwise, an empty optional is + * returned. + */ + public static Optional create( + int dims, + int maxOrd, + int vectorByteSize, + VectorSimilarityFunction type, + IndexInput input, + RandomAccessVectorValues values) { +input = FilterIndexInput.unwrap(input); Review Comment: Argh!! We don’t have a pattern for unwrapping here?? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589596339 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentByteVectorScorerSupplier.java: ## @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.internal.vectorization; + +import java.io.IOException; +import java.lang.foreign.MemorySegment; +import java.util.Optional; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.store.FilterIndexInput; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.MemorySegmentAccessInput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.hnsw.RandomAccessVectorValues; +import org.apache.lucene.util.hnsw.RandomVectorScorer; +import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier; + +/** A scorer of vectors whose element size is byte. */ +public abstract sealed class MemorySegmentByteVectorScorerSupplier +implements RandomVectorScorerSupplier, RandomVectorScorer +permits DotProductByteVectorScorerSupplier, EuclideanByteVectorScorerSupplier { + final int vectorByteSize; + final int dims; + final int maxOrd; + final MemorySegmentAccessInput input; + + final RandomAccessVectorValues values; // to support ordToDoc/getAcceptOrds + final byte[] scratch1, scratch2; + + MemorySegment first; + + /** + * Return an optional whose value, if present, is the scorer. Otherwise, an empty optional is + * returned. + */ + public static Optional create( + int dims, + int maxOrd, + int vectorByteSize, + VectorSimilarityFunction type, + IndexInput input, + RandomAccessVectorValues values) { +input = FilterIndexInput.unwrap(input); Review Comment: Can we simply remove the unwrapping here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1589680684 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentByteVectorScorerSupplier.java: ## @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.internal.vectorization; + +import java.io.IOException; +import java.lang.foreign.MemorySegment; +import java.util.Optional; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.store.FilterIndexInput; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.MemorySegmentAccessInput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.hnsw.RandomAccessVectorValues; +import org.apache.lucene.util.hnsw.RandomVectorScorer; +import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier; + +/** A scorer of vectors whose element size is byte. */ +public abstract sealed class MemorySegmentByteVectorScorerSupplier +implements RandomVectorScorerSupplier, RandomVectorScorer +permits DotProductByteVectorScorerSupplier, EuclideanByteVectorScorerSupplier { + final int vectorByteSize; + final int dims; + final int maxOrd; + final MemorySegmentAccessInput input; + + final RandomAccessVectorValues values; // to support ordToDoc/getAcceptOrds + final byte[] scratch1, scratch2; + + MemorySegment first; + + /** + * Return an optional whose value, if present, is the scorer. Otherwise, an empty optional is + * returned. + */ + public static Optional create( + int dims, + int maxOrd, + int vectorByteSize, + VectorSimilarityFunction type, + IndexInput input, + RandomAccessVectorValues values) { +input = FilterIndexInput.unwrap(input); Review Comment: Hmmm.. yeah maybe. The unwrapping was initially for an instanceof check, which we still do. Don’t we still need to unwrap to do this? Otherwise, it’ll not be executed much (at all?) in tests? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org