[GitHub] [lucene] javanna commented on pull request #12183: Make TermStates#build concurrent
javanna commented on PR #12183: URL: https://github.com/apache/lucene/pull/12183#issuecomment-1730930146 Great to see this merged, thanks @shubhamvishu for all the work as well as patience as we were figuring out a way forward! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vsop-479 commented on pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.
vsop-479 commented on PR #12528: URL: https://github.com/apache/lucene/pull/12528#issuecomment-1730932421 @iverase I replaced int values with static variables. Please take a look. Actually, i used enum to define the match states in pre version. but it downgraded the performance a little. Static variables is good, but do you think it is ok to use enum to make code graceful, even through there is little performance downgrade? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a diff in pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.
iverase commented on code in PR #12528: URL: https://github.com/apache/lucene/pull/12528#discussion_r1334028608 ## lucene/core/src/java/org/apache/lucene/index/PointValues.java: ## @@ -228,6 +228,22 @@ public enum Relation { CELL_CROSSES_QUERY }; + /** Math states for current value. */ + public static final class MatchState { +private MatchState() {} + +/** Invalid state */ +public static final int INVALID = -1; +/** Packed value matches the range in this dimension */ +public static final int MATCH = 0; +/** Packed value is too low in this SORTED or NON-SORTED dimension */ +public static final int LOW = 1; +/** Packed value is too high in SORTED dimension */ +public static final int HIGH_IN_SORTED_DIM = 2; +/** Packed value is too high in NON-SORTED dimension */ +public static final int HIGH_IN_NON_SORTED_DIM = 3; + } Review Comment: My main concern here is that the concept of SORTED dimension does not exist in the PointValues API. If you have a look to the javadocs when visiting a leaf node: ``` /** * Called for all documents in a leaf cell that crosses the query. The consumer should * scrutinize the packedValue to decide whether to accept it. In the 1D case, values are visited * in increasing order, and in the case of ties, in increasing docID order. */ ``` It only constraints the 1D case but in higher dimensions there is no constraint how data is visited. The concept of SORTED dimension sounds to me an implementation detail that should not be leaked to the public API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a diff in pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.
iverase commented on code in PR #12528: URL: https://github.com/apache/lucene/pull/12528#discussion_r1334030455 ## lucene/core/src/java/org/apache/lucene/index/PointValues.java: ## @@ -281,6 +297,12 @@ public interface PointTree extends Cloneable { * @lucene.experimental */ public interface IntersectVisitor { + +/** return true if this is an inverse visitor. */ +default boolean isInverse() { + return false; +} Review Comment: This method is difficult to grasp and sounds to me an implementation detail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a diff in pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.
iverase commented on code in PR #12528: URL: https://github.com/apache/lucene/pull/12528#discussion_r1334077045 ## lucene/core/src/java/org/apache/lucene/index/PointValues.java: ## @@ -317,6 +329,18 @@ default void visit(DocIdSetIterator iterator, byte[] packedValue) throws IOExcep } } +/** + * Similar to {@link IntersectVisitor#visit(DocIdSetIterator, byte[])} but return a match state. + */ +default int visitWithState(DocIdSetIterator iterator, byte[] packedValue, int sortedDim) Review Comment: I am wondering if we need to return different values. At the end of the day we only need to know if we should visit more points on the leaf. Have you tried something more simple like: ``` /** Similar to {@link IntersectVisitor#visit(int, byte[])} but ensure that data is visited in * increasing order on the {@sortedDim}, and in the case of ties, in increasing docID order. * Implementors can stop processing points on the leaf by returning =false, when for example the * sorted dimension value is too high to be matched by the query. * * @return true if the visitor should continue visiting points on this leaf, otherwise false. * */ default boolean visitWithSortedDim(int docID, byte[] packedValue, int sortedDim) throws IOException { visit(docID, packedValue); return true; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a diff in pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.
iverase commented on code in PR #12528: URL: https://github.com/apache/lucene/pull/12528#discussion_r1334077045 ## lucene/core/src/java/org/apache/lucene/index/PointValues.java: ## @@ -317,6 +329,18 @@ default void visit(DocIdSetIterator iterator, byte[] packedValue) throws IOExcep } } +/** + * Similar to {@link IntersectVisitor#visit(DocIdSetIterator, byte[])} but return a match state. + */ +default int visitWithState(DocIdSetIterator iterator, byte[] packedValue, int sortedDim) Review Comment: I am wondering if we need to return different values. At the end of the day we only need to know if we should visit more points on the leaf. Have you tried something more simple like: ``` /** Similar to {@link IntersectVisitor#visit(int, byte[])} but data is visited in * increasing order on the {@sortedDim}, and in the case of ties, in increasing docID order. * Implementers can stop processing points on the leaf by returning false when for example the * sorted dimension value is too high to be matched by the query. * * @return true if the visitor should continue visiting points on this leaf, otherwise false. * */ default boolean visitWithSortedDim(int docID, byte[] packedValue, int sortedDim) throws IOException { visit(docID, packedValue); return true; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12526: Speed up disjunctions by computing estimations of the score of the k-th top hit up-front.
jpountz commented on PR #12526: URL: https://github.com/apache/lucene/pull/12526#issuecomment-1731162359 > Maybe we should add OrHighVeryLow to nightly benchy too? @mikemccand I started looking into this, but my enwiki (`enwiki-20120502-lines-with-random-label.txt`) seems to have slightly different frequencies compared to frequencies reported in wikinightly.tasks, are nightly benchmarks using the same export or a different one? I think it could make sense to have two new tasks `OrHighLow110` where the low-frequency term always has a frequency of 110 >k and `OrHighLow90` where the low-frequency term always has a frequency of 90
[GitHub] [lucene] rmuir commented on a diff in pull request #12583: Fix hidden range embedded in UAX29URLEmail grammar
rmuir commented on code in PR #12583: URL: https://github.com/apache/lucene/pull/12583#discussion_r1334328967 ## lucene/analysis/common/src/test/org/apache/lucene/analysis/email/TestUAX29URLEmailAnalyzer.java: ## @@ -433,9 +433,9 @@ public void testMailtoSchemeEmails() throws Exception { new String[] { "mailto", "pers...@example.com", - // TODO: recognize ',' address delimiter. Also, see examples of ';' delimiter use at: + // Also, see examples of ';' delimiter use at: Review Comment: yeah i don't know, I just tried to preserve these comments, and there are other similar TODO in the test. Especially this one: https://github.com/apache/lucene/blob/53ba27a63be6849d5383b8bfc6d1508dd7b66f0c/lucene/analysis/common/src/test/org/apache/lucene/analysis/email/TestUAX29URLEmailAnalyzer.java#L428C8-L428C96 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec
jpountz commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1334309792 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/QuantizedVectorsWriter.java: ## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene99; + +import java.io.Closeable; +import java.io.IOException; +import org.apache.lucene.codecs.KnnFieldVectorsWriter; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.Sorter; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.util.Accountable; + +/** Quantized vector reader */ Review Comment: ```suggestion /** Quantized vector writer */ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1334388920 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene99; + +import static org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.DIRECT_MONOTONIC_BLOCK_SHIFT; +import static org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.calculateDefaultQuantile; +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnFieldVectorsWriter; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat; +import org.apache.lucene.index.DocIDMerger; +import org.apache.lucene.index.DocsWithFieldSet; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.index.Sorter; +import org.apache.lucene.index.VectorEncoding; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.RamUsageEstimator; +import org.apache.lucene.util.ScalarQuantizer; +import org.apache.lucene.util.VectorUtil; +import org.apache.lucene.util.packed.DirectMonotonicWriter; + +/** + * Writes quantized vector values and metadata to index segments. + * + * @lucene.experimental + */ +public final class Lucene99ScalarQuantizedVectorsWriter implements QuantizedVectorsWriter { + + private static final long BASE_RAM_BYTES_USED = + RamUsageEstimator.shallowSizeOfInstance(Lucene99ScalarQuantizedVectorsWriter.class); + + private static final float QUANTIZATION_RECOMPUTE_LIMIT = 32; + private final SegmentWriteState segmentWriteState; + private final IndexOutput meta, quantizedVectorData; + private final Float quantile; + private final List fields = new ArrayList<>(); + + private boolean finished; + + Lucene99ScalarQuantizedVectorsWriter(SegmentWriteState state, Float quantile) throws IOException { +this.quantile = quantile; +segmentWriteState = state; +String metaFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, +state.segmentSuffix, + Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_META_EXTENSION); + +String quantizedVectorDataFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, +state.segmentSuffix, + Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_EXTENSION); + +boolean success = false; +try { + meta = state.directory.createOutput(metaFileName, state.context); + quantizedVectorData = + state.directory.createOutput(quantizedVectorDataFileName, state.context); + + CodecUtil.writeIndexHeader( + meta, + Lucene99ScalarQuantizedVectorsFormat.META_CODEC_NAME, + Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + CodecUtil.writeIndexHeader( + quantizedVectorData, + Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_CODEC_NAME, + Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + success = true; +} finally { + if (success == false) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + @Override + public KnnFieldVectorsWriter addField(FieldInfo fieldInfo)
[GitHub] [lucene] benwtrent commented on pull request #12582: Add new int8 scalar quantization to HNSW codec
benwtrent commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1731463225 > Do we know why search is faster? Is it mostly because working on the quantized vectors requires a lower memory bandwi[d]th? Search is faster in two regards: - PanamaVector allows for more `byte` actions to occur at once than `float32` (should be major) - Reading `byte[]` off of a buffer doesn't require decoding floats (very minor change) IMO, we should be seeing WAY better search numbers. I need to do more testing to triple check. > Do you know how recall degrades compared to without quantization? I saw the numbers you shared but I don't have a good sense of what recall we usually had until now. ++ I want to graph the two together to compare so its clearer. > I don't feel great about the logic that merges quantiles at merge time and only requantizes if the merged quantiles don't differ too much from the input quantiles. It feels like quantiles could slowly change over multiple merging rounds and we'd end up in a state where the quantized vectors would be different from requantizing the raw vectors with the quantization state that is stored in the segment, which feels wrong. Am I missing something? The quantization buckets could change slightly overtime, but since we are bucketing `float32` into `int8`, the error bounds are comparatively large. The cost of requantization is almost never worth it. In my testing, quantiles over random data from the same data set shows that segments differ by only around `1e-4`, which is tiny and shouldn't require requantization. @tveasey helped me do some empirical analysis here and can provide some numbers. > Related to the above, it looks like we ignore deletions when merging quantiles. It would probably be ok in practice most of the time but I worry that there might be corner cases? A corner case in what way? That we potentially include deletions when computing quantiles or if re-quantization is required? We can easily exclude them as conceptually, the "new" doc (if it were an update) would exist in another segment. It could be we are double counting a vector and we probably shouldn't do that. > > Do we want to have a new "flat" vector codec that HNSW (or other complicated vector indexing methods), can use? Detractor here is that now HNSW codec relies on another pluggable thing that is a "flat" vector index (just provides mechanisms for reading, writing, merging vectors in a flat index). > I don't have a strong opinion on this. Making it a codec though has the downside that it would require more files since two codecs can't write to the same file. Maybe having utility methods around reading/writing flat vectors is good enough? Utility methods are honestly what I am leaning towards. Its then a discussion around how a codec (like HNSW) is configured to use it. > > Should "quantization" just be a thing that is provided to vector codecs? > I might be misunderstanding the question, but to me this is what the byte[] encoding is about. And this quantization that's getting added here is more powerful because it's adaptative and will change over time depending on what vectors get indexed or deleted? If it needs to adapt to the data then it belongs to the codec. We could have utility code to make it easier to write codecs that quantize their data though (maybe this is what your question suggested?). Yeah, it needs to adapt over time. There are adverse cases (indexing vectors sorted by relative clusters is one) that need to be handled. But, they can be handled easily at merge time by recomputing quantiles and potentially re-quantizing. > > Should the "quantizer" keep the raw vectors around itself? > My understanding is that we have to, as the accuracy of the quantization could otherwise degrade over time in an unbounded fashion. After a period of time, if vectors are part of the same corpus and created via the same model, the quantiles actually level out and re-quantizing will rarely or never occur since the calculated quantiles are statistically equivalent. Especially given the binning into `int8`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec
uschindler commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1334448931 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Random; +import java.util.stream.IntStream; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.VectorSimilarityFunction; + +/** Will scalar quantize float vectors into `int8` byte values */ +public class ScalarQuantizer { + + public static final int SCALAR_QUANTIZATION_SAMPLE_SIZE = 25_000; + + private final float alpha; + private final float offset; + private final float minQuantile, maxQuantile; + + public ScalarQuantizer(float minQuantile, float maxQuantile) { +assert maxQuantile >= maxQuantile; +this.minQuantile = minQuantile; +this.maxQuantile = maxQuantile; +this.alpha = (maxQuantile - minQuantile) / 127f; +this.offset = minQuantile; + } + + public void quantize(float[] src, byte[] dest) { +assert src.length == dest.length; +for (int i = 0; i < src.length; i++) { + dest[i] = + (byte) + Math.round( + (Math.max(minQuantile, Math.min(maxQuantile, src[i])) - minQuantile) / alpha); +} + } + + public void deQuantize(byte[] src, float[] dest) { +assert src.length == dest.length; +for (int i = 0; i < src.length; i++) { + dest[i] = (alpha * src[i]) + offset; +} + } + + public float calculateVectorOffset(byte[] vector, VectorSimilarityFunction similarityFunction) { +if (similarityFunction != VectorSimilarityFunction.EUCLIDEAN) { + int sum = 0; + for (byte b : vector) { Review Comment: Can't we use VectorUtil here for SIMD dotProduct? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec
uschindler commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1334450128 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Random; +import java.util.stream.IntStream; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.VectorSimilarityFunction; + +/** Will scalar quantize float vectors into `int8` byte values */ +public class ScalarQuantizer { + + public static final int SCALAR_QUANTIZATION_SAMPLE_SIZE = 25_000; + + private final float alpha; + private final float offset; + private final float minQuantile, maxQuantile; + + public ScalarQuantizer(float minQuantile, float maxQuantile) { +assert maxQuantile >= maxQuantile; +this.minQuantile = minQuantile; +this.maxQuantile = maxQuantile; +this.alpha = (maxQuantile - minQuantile) / 127f; +this.offset = minQuantile; + } + + public void quantize(float[] src, byte[] dest) { +assert src.length == dest.length; +for (int i = 0; i < src.length; i++) { + dest[i] = + (byte) + Math.round( + (Math.max(minQuantile, Math.min(maxQuantile, src[i])) - minQuantile) / alpha); +} + } + + public void deQuantize(byte[] src, float[] dest) { +assert src.length == dest.length; +for (int i = 0; i < src.length; i++) { + dest[i] = (alpha * src[i]) + offset; +} + } + + public float calculateVectorOffset(byte[] vector, VectorSimilarityFunction similarityFunction) { +if (similarityFunction != VectorSimilarityFunction.EUCLIDEAN) { + int sum = 0; + for (byte b : vector) { Review Comment: Ah sorry it just sums up. But we could add this to VectorUtil... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tveasey commented on pull request #12582: Add new int8 scalar quantization to HNSW codec
tveasey commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1731530040 > @tveasey helped me do some empirical analysis here and can provide some numbers. So the rationale is quite simple as Ben said. If you change the upper and lower quantiles very little then in fact re-quantising doesn't change the quantized vectors much at all. In particular, you expect values roughly uniform in each bin and unless you are near a snapping boundary you simply map the value to the same integer. Therefore, if the difference in the upper and lower quantile is "bin width" / n you have roughly 1 / n probability of changing any given value, by at most one and only when the impact on the error is marginal (< "bin width" / n). In practice, even if the odd component, where the snapping decision is marginal, changes by +/- 1 the effect is dwarfed by the all the other snapping going on when you quantize. I measured this for a few different datasets (using different SOTA embedding models) and for each dataset over 100 merges the effect was always less than 0.05 * "quantisation error". I note as well that this error magnitude is pretty consistent with the theory above (when properly formalised). Finally, this is all completely in the noise in terms of impact on recall for nearest neighbour retrieval. I'll follow up with a link to a repo with a more detailed discussion and the code used for these experiments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec
rmuir commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1334477512 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Random; +import java.util.stream.IntStream; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.VectorSimilarityFunction; + +/** Will scalar quantize float vectors into `int8` byte values */ +public class ScalarQuantizer { + + public static final int SCALAR_QUANTIZATION_SAMPLE_SIZE = 25_000; + + private final float alpha; + private final float offset; + private final float minQuantile, maxQuantile; + + public ScalarQuantizer(float minQuantile, float maxQuantile) { +assert maxQuantile >= maxQuantile; +this.minQuantile = minQuantile; +this.maxQuantile = maxQuantile; +this.alpha = (maxQuantile - minQuantile) / 127f; +this.offset = minQuantile; + } + + public void quantize(float[] src, byte[] dest) { +assert src.length == dest.length; +for (int i = 0; i < src.length; i++) { + dest[i] = + (byte) + Math.round( + (Math.max(minQuantile, Math.min(maxQuantile, src[i])) - minQuantile) / alpha); +} + } + + public void deQuantize(byte[] src, float[] dest) { +assert src.length == dest.length; +for (int i = 0; i < src.length; i++) { + dest[i] = (alpha * src[i]) + offset; +} + } + + public float calculateVectorOffset(byte[] vector, VectorSimilarityFunction similarityFunction) { +if (similarityFunction != VectorSimilarityFunction.EUCLIDEAN) { + int sum = 0; + for (byte b : vector) { Review Comment: summing bytes across array like this should work with autovectorization, or its seriously broke. there is no pesky floating point order of operations restriction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1334508429 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Random; +import java.util.stream.IntStream; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.VectorSimilarityFunction; + +/** Will scalar quantize float vectors into `int8` byte values */ +public class ScalarQuantizer { + + public static final int SCALAR_QUANTIZATION_SAMPLE_SIZE = 25_000; + + private final float alpha; + private final float offset; + private final float minQuantile, maxQuantile; + + public ScalarQuantizer(float minQuantile, float maxQuantile) { +assert maxQuantile >= maxQuantile; +this.minQuantile = minQuantile; +this.maxQuantile = maxQuantile; +this.alpha = (maxQuantile - minQuantile) / 127f; +this.offset = minQuantile; + } + + public void quantize(float[] src, byte[] dest) { +assert src.length == dest.length; +for (int i = 0; i < src.length; i++) { + dest[i] = + (byte) + Math.round( + (Math.max(minQuantile, Math.min(maxQuantile, src[i])) - minQuantile) / alpha); +} + } + + public void deQuantize(byte[] src, float[] dest) { +assert src.length == dest.length; +for (int i = 0; i < src.length; i++) { + dest[i] = (alpha * src[i]) + offset; +} + } + + public float calculateVectorOffset(byte[] vector, VectorSimilarityFunction similarityFunction) { +if (similarityFunction != VectorSimilarityFunction.EUCLIDEAN) { + int sum = 0; + for (byte b : vector) { Review Comment: @rmuir, exactly. Since it isn't floating point addition, I didn't think it necessary for VectorUtil to get involved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] easyice commented on pull request #12557: Improve refresh speed with softdelete enable
easyice commented on PR #12557: URL: https://github.com/apache/lucene/pull/12557#issuecomment-1731767546 Update: when we call `softUpdateDocument` for a segment that already has some deleted doc, it will iterate all the deleted doc use `ReadersAndUpdates#MergedDocValues#onDiskDocValues`, but he has to iterate the array twice, the first time is `Lucene90DocValuesConsumer#writeValues` will compute gcd, min, max. the second time is `IndexedDISI#writeBitSet`, this creates some waste, we can remove the first iterate for soft delete, this can speed up about 53% for updates. Benchmark code: ``` public static void main(final String[] args) throws Exception { long min = Long.MAX_VALUE; for (int i = 0; i < 5; i++) { min = Math.min(doWriteOK(), min); } System.out.println("BEST:" + min); } static long doWrite() throws IOException { Random rand = new Random(5); Directory dir = new ByteBuffersDirectory(); IndexWriter writer = new IndexWriter( dir, new IndexWriterConfig(null) .setSoftDeletesField("_soft_deletes") .setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH)); int maxDoc = 4096 * 100; for (int i = 0; i < maxDoc; i++) { Document doc = new Document(); doc.add(new StringField("id", String.valueOf(i), Field.Store.NO)); writer.addDocument(doc); if (i > 0 && i % 5000 == 0) { writer.commit(); } } System.out.println("start update"); long t0 = System.currentTimeMillis(); for (int i = 0; i < maxDoc; i += 2) { Document doc = new Document(); writer.softUpdateDocument( new Term("id", String.valueOf(i)), doc, new NumericDocValuesField("_soft_deletes", 1)); if (i > 0 && i % 100 == 0) { writer.commit(); } } long tookMs = System.currentTimeMillis() - t0; System.out.println("update took:" + (System.currentTimeMillis() - t0)); IOUtils.close(writer, dir); return tookMs; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12560: Defer #advanceExact on expression dependencies until their values are needed
gsmiller commented on PR #12560: URL: https://github.com/apache/lucene/pull/12560#issuecomment-1731814078 Circling back on this: For Amazon's Product Search engine, we make fairly heavy use of these expression implementations. I pulled this change into our Lucene fork early (currently on 9.7) and ran internal benchmarks we have, and this produced a ~23% redline QPS improvement. Milage may vary of course, but the impact was significant so others may find a nice win as well (if heavy expression users). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jimczi commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec
jimczi commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1334738549 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene99; + +import static org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.DIRECT_MONOTONIC_BLOCK_SHIFT; +import static org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.calculateDefaultQuantile; +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnFieldVectorsWriter; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat; +import org.apache.lucene.index.DocIDMerger; +import org.apache.lucene.index.DocsWithFieldSet; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.index.Sorter; +import org.apache.lucene.index.VectorEncoding; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.RamUsageEstimator; +import org.apache.lucene.util.ScalarQuantizer; +import org.apache.lucene.util.VectorUtil; +import org.apache.lucene.util.packed.DirectMonotonicWriter; + +/** + * Writes quantized vector values and metadata to index segments. + * + * @lucene.experimental + */ +public final class Lucene99ScalarQuantizedVectorsWriter implements QuantizedVectorsWriter { + + private static final long BASE_RAM_BYTES_USED = + RamUsageEstimator.shallowSizeOfInstance(Lucene99ScalarQuantizedVectorsWriter.class); + + private static final float QUANTIZATION_RECOMPUTE_LIMIT = 32; + private final SegmentWriteState segmentWriteState; + private final IndexOutput meta, quantizedVectorData; + private final Float quantile; + private final List fields = new ArrayList<>(); + + private boolean finished; + + Lucene99ScalarQuantizedVectorsWriter(SegmentWriteState state, Float quantile) throws IOException { +this.quantile = quantile; +segmentWriteState = state; +String metaFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, +state.segmentSuffix, + Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_META_EXTENSION); + +String quantizedVectorDataFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, +state.segmentSuffix, + Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_EXTENSION); + +boolean success = false; +try { + meta = state.directory.createOutput(metaFileName, state.context); + quantizedVectorData = + state.directory.createOutput(quantizedVectorDataFileName, state.context); + + CodecUtil.writeIndexHeader( + meta, + Lucene99ScalarQuantizedVectorsFormat.META_CODEC_NAME, + Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + CodecUtil.writeIndexHeader( + quantizedVectorData, + Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_CODEC_NAME, + Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + success = true; +} finally { + if (success == false) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + @Override + public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) thr
[GitHub] [lucene] jimczi commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec
jimczi commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1334746185 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene99; + +import static org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.DIRECT_MONOTONIC_BLOCK_SHIFT; +import static org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.calculateDefaultQuantile; +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnFieldVectorsWriter; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat; +import org.apache.lucene.index.DocIDMerger; +import org.apache.lucene.index.DocsWithFieldSet; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.index.Sorter; +import org.apache.lucene.index.VectorEncoding; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.RamUsageEstimator; +import org.apache.lucene.util.ScalarQuantizer; +import org.apache.lucene.util.VectorUtil; +import org.apache.lucene.util.packed.DirectMonotonicWriter; + +/** + * Writes quantized vector values and metadata to index segments. + * + * @lucene.experimental + */ +public final class Lucene99ScalarQuantizedVectorsWriter implements QuantizedVectorsWriter { + + private static final long BASE_RAM_BYTES_USED = + RamUsageEstimator.shallowSizeOfInstance(Lucene99ScalarQuantizedVectorsWriter.class); + + private static final float QUANTIZATION_RECOMPUTE_LIMIT = 32; + private final SegmentWriteState segmentWriteState; + private final IndexOutput meta, quantizedVectorData; + private final Float quantile; + private final List fields = new ArrayList<>(); + + private boolean finished; + + Lucene99ScalarQuantizedVectorsWriter(SegmentWriteState state, Float quantile) throws IOException { +this.quantile = quantile; +segmentWriteState = state; +String metaFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, +state.segmentSuffix, + Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_META_EXTENSION); + +String quantizedVectorDataFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, +state.segmentSuffix, + Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_EXTENSION); + +boolean success = false; +try { + meta = state.directory.createOutput(metaFileName, state.context); + quantizedVectorData = + state.directory.createOutput(quantizedVectorDataFileName, state.context); + + CodecUtil.writeIndexHeader( + meta, + Lucene99ScalarQuantizedVectorsFormat.META_CODEC_NAME, + Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + CodecUtil.writeIndexHeader( + quantizedVectorData, + Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_CODEC_NAME, + Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + success = true; +} finally { + if (success == false) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + @Override + public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) thr
[GitHub] [lucene] jimczi commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec
jimczi commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1334758274 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene99; + +import static org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.DIRECT_MONOTONIC_BLOCK_SHIFT; +import static org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.calculateDefaultQuantile; +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnFieldVectorsWriter; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat; +import org.apache.lucene.index.DocIDMerger; +import org.apache.lucene.index.DocsWithFieldSet; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.index.Sorter; +import org.apache.lucene.index.VectorEncoding; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.RamUsageEstimator; +import org.apache.lucene.util.ScalarQuantizer; +import org.apache.lucene.util.VectorUtil; +import org.apache.lucene.util.packed.DirectMonotonicWriter; + +/** + * Writes quantized vector values and metadata to index segments. + * + * @lucene.experimental + */ +public final class Lucene99ScalarQuantizedVectorsWriter implements QuantizedVectorsWriter { + + private static final long BASE_RAM_BYTES_USED = + RamUsageEstimator.shallowSizeOfInstance(Lucene99ScalarQuantizedVectorsWriter.class); + + private static final float QUANTIZATION_RECOMPUTE_LIMIT = 32; + private final SegmentWriteState segmentWriteState; + private final IndexOutput meta, quantizedVectorData; + private final Float quantile; + private final List fields = new ArrayList<>(); + + private boolean finished; + + Lucene99ScalarQuantizedVectorsWriter(SegmentWriteState state, Float quantile) throws IOException { +this.quantile = quantile; +segmentWriteState = state; +String metaFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, +state.segmentSuffix, + Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_META_EXTENSION); + +String quantizedVectorDataFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, +state.segmentSuffix, + Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_EXTENSION); + +boolean success = false; +try { + meta = state.directory.createOutput(metaFileName, state.context); + quantizedVectorData = + state.directory.createOutput(quantizedVectorDataFileName, state.context); + + CodecUtil.writeIndexHeader( + meta, + Lucene99ScalarQuantizedVectorsFormat.META_CODEC_NAME, + Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + CodecUtil.writeIndexHeader( + quantizedVectorData, + Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_CODEC_NAME, + Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + success = true; +} finally { + if (success == false) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + @Override + public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) thr