[GitHub] [lucene] javanna commented on pull request #12183: Make TermStates#build concurrent

2023-09-22 Thread via GitHub


javanna commented on PR #12183:
URL: https://github.com/apache/lucene/pull/12183#issuecomment-1730930146

   Great to see this merged, thanks @shubhamvishu for all the work as well as 
patience as we were figuring out a way forward!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vsop-479 commented on pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.

2023-09-22 Thread via GitHub


vsop-479 commented on PR #12528:
URL: https://github.com/apache/lucene/pull/12528#issuecomment-1730932421

   @iverase I replaced int values with static variables. Please take a look.
   Actually, i used enum to define the match states in pre version. but it 
downgraded the performance a little.
   Static variables is good, but do you think it is ok to use enum to make code 
graceful, even through there is little performance downgrade?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a diff in pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.

2023-09-22 Thread via GitHub


iverase commented on code in PR #12528:
URL: https://github.com/apache/lucene/pull/12528#discussion_r1334028608


##
lucene/core/src/java/org/apache/lucene/index/PointValues.java:
##
@@ -228,6 +228,22 @@ public enum Relation {
 CELL_CROSSES_QUERY
   };
 
+  /** Math states for current value. */
+  public static final class MatchState {
+private MatchState() {}
+
+/** Invalid state */
+public static final int INVALID = -1;
+/** Packed value matches the range in this dimension */
+public static final int MATCH = 0;
+/** Packed value is too low in this SORTED or NON-SORTED dimension */
+public static final int LOW = 1;
+/** Packed value is too high in SORTED dimension */
+public static final int HIGH_IN_SORTED_DIM = 2;
+/** Packed value is too high in NON-SORTED dimension */
+public static final int HIGH_IN_NON_SORTED_DIM = 3;
+  }

Review Comment:
   My main concern here is that the concept of SORTED dimension does not exist 
in the PointValues API. If you have a look to the javadocs when visiting a leaf 
node:
   ```
   /**
* Called for all documents in a leaf cell that crosses the query. The 
consumer should
* scrutinize the packedValue to decide whether to accept it. In the 1D 
case, values are visited
* in increasing order, and in the case of ties, in increasing docID 
order.
*/
   ```
   It only constraints the 1D case but in higher dimensions there is no 
constraint how data is visited. The concept of SORTED dimension sounds to me an 
implementation detail that should not be leaked to the public API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a diff in pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.

2023-09-22 Thread via GitHub


iverase commented on code in PR #12528:
URL: https://github.com/apache/lucene/pull/12528#discussion_r1334030455


##
lucene/core/src/java/org/apache/lucene/index/PointValues.java:
##
@@ -281,6 +297,12 @@ public interface PointTree extends Cloneable {
* @lucene.experimental
*/
   public interface IntersectVisitor {
+
+/** return true if this is an inverse visitor. */
+default boolean isInverse() {
+  return false;
+}

Review Comment:
   This method is difficult to grasp and sounds to me an implementation detail. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a diff in pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.

2023-09-22 Thread via GitHub


iverase commented on code in PR #12528:
URL: https://github.com/apache/lucene/pull/12528#discussion_r1334077045


##
lucene/core/src/java/org/apache/lucene/index/PointValues.java:
##
@@ -317,6 +329,18 @@ default void visit(DocIdSetIterator iterator, byte[] 
packedValue) throws IOExcep
   }
 }
 
+/**
+ * Similar to {@link IntersectVisitor#visit(DocIdSetIterator, byte[])} but 
return a match state.
+ */
+default int visitWithState(DocIdSetIterator iterator, byte[] packedValue, 
int sortedDim)

Review Comment:
   I am wondering if we need to return different values. At the end of the day 
we only need to know if we should visit more points on the leaf. Have you tried 
something more simple like:
   
   ```
   /** Similar to {@link IntersectVisitor#visit(int, byte[])} but ensure 
that data is visited in
* increasing order on the {@sortedDim}, and in the case of ties, in 
increasing docID order.
* Implementors can stop processing points on the leaf by returning 
=false, when for example the 
* sorted dimension value is too high to be matched by the query.
* 
* @return true if the visitor should continue visiting points on this 
leaf, otherwise false.
* */
   default boolean visitWithSortedDim(int docID, byte[] packedValue, int 
sortedDim) throws IOException {
 visit(docID, packedValue);
 return true;
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a diff in pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.

2023-09-22 Thread via GitHub


iverase commented on code in PR #12528:
URL: https://github.com/apache/lucene/pull/12528#discussion_r1334077045


##
lucene/core/src/java/org/apache/lucene/index/PointValues.java:
##
@@ -317,6 +329,18 @@ default void visit(DocIdSetIterator iterator, byte[] 
packedValue) throws IOExcep
   }
 }
 
+/**
+ * Similar to {@link IntersectVisitor#visit(DocIdSetIterator, byte[])} but 
return a match state.
+ */
+default int visitWithState(DocIdSetIterator iterator, byte[] packedValue, 
int sortedDim)

Review Comment:
   I am wondering if we need to return different values. At the end of the day 
we only need to know if we should visit more points on the leaf. Have you tried 
something more simple like:
   
   ```
   /** Similar to {@link IntersectVisitor#visit(int, byte[])} but data is 
visited in
* increasing order on the {@sortedDim}, and in the case of ties, in 
increasing docID order.
* Implementers can stop processing points on the leaf by returning 
false when for example the 
* sorted dimension value is too high to be matched by the query.
* 
* @return true if the visitor should continue visiting points on this 
leaf, otherwise false.
* */
   default boolean visitWithSortedDim(int docID, byte[] packedValue, int 
sortedDim) throws IOException {
 visit(docID, packedValue);
 return true;
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12526: Speed up disjunctions by computing estimations of the score of the k-th top hit up-front.

2023-09-22 Thread via GitHub


jpountz commented on PR #12526:
URL: https://github.com/apache/lucene/pull/12526#issuecomment-1731162359

   > Maybe we should add OrHighVeryLow to nightly benchy too?
   
   @mikemccand  I started looking into this, but my enwiki 
(`enwiki-20120502-lines-with-random-label.txt`) seems to have slightly 
different frequencies compared to frequencies reported in wikinightly.tasks, 
are nightly benchmarks using the same export or a different one? I think it 
could make sense to have two new tasks `OrHighLow110` where the low-frequency 
term always has a frequency of 110 >k and `OrHighLow90` where the low-frequency 
term always has a frequency of 90

[GitHub] [lucene] rmuir commented on a diff in pull request #12583: Fix hidden range embedded in UAX29URLEmail grammar

2023-09-22 Thread via GitHub


rmuir commented on code in PR #12583:
URL: https://github.com/apache/lucene/pull/12583#discussion_r1334328967


##
lucene/analysis/common/src/test/org/apache/lucene/analysis/email/TestUAX29URLEmailAnalyzer.java:
##
@@ -433,9 +433,9 @@ public void testMailtoSchemeEmails() throws Exception {
 new String[] {
   "mailto",
   "pers...@example.com",
-  // TODO: recognize ',' address delimiter. Also, see examples of ';' 
delimiter use at:
+  // Also, see examples of ';' delimiter use at:

Review Comment:
   yeah i don't know, I just tried to preserve these comments, and there are 
other similar TODO in the test. Especially this one: 
https://github.com/apache/lucene/blob/53ba27a63be6849d5383b8bfc6d1508dd7b66f0c/lucene/analysis/common/src/test/org/apache/lucene/analysis/email/TestUAX29URLEmailAnalyzer.java#L428C8-L428C96



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec

2023-09-22 Thread via GitHub


jpountz commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1334309792


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/QuantizedVectorsWriter.java:
##
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene99;
+
+import java.io.Closeable;
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnFieldVectorsWriter;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.Sorter;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.util.Accountable;
+
+/** Quantized vector reader */

Review Comment:
   ```suggestion
   /** Quantized vector writer */
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec

2023-09-22 Thread via GitHub


benwtrent commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1334388920


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java:
##
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene99;
+
+import static 
org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.DIRECT_MONOTONIC_BLOCK_SHIFT;
+import static 
org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.calculateDefaultQuantile;
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnFieldVectorsWriter;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat;
+import org.apache.lucene.index.DocIDMerger;
+import org.apache.lucene.index.DocsWithFieldSet;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.IndexFileNames;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.index.Sorter;
+import org.apache.lucene.index.VectorEncoding;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.RamUsageEstimator;
+import org.apache.lucene.util.ScalarQuantizer;
+import org.apache.lucene.util.VectorUtil;
+import org.apache.lucene.util.packed.DirectMonotonicWriter;
+
+/**
+ * Writes quantized vector values and metadata to index segments.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene99ScalarQuantizedVectorsWriter implements 
QuantizedVectorsWriter {
+
+  private static final long BASE_RAM_BYTES_USED =
+  
RamUsageEstimator.shallowSizeOfInstance(Lucene99ScalarQuantizedVectorsWriter.class);
+
+  private static final float QUANTIZATION_RECOMPUTE_LIMIT = 32;
+  private final SegmentWriteState segmentWriteState;
+  private final IndexOutput meta, quantizedVectorData;
+  private final Float quantile;
+  private final List fields = new ArrayList<>();
+
+  private boolean finished;
+
+  Lucene99ScalarQuantizedVectorsWriter(SegmentWriteState state, Float 
quantile) throws IOException {
+this.quantile = quantile;
+segmentWriteState = state;
+String metaFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name,
+state.segmentSuffix,
+
Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_META_EXTENSION);
+
+String quantizedVectorDataFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name,
+state.segmentSuffix,
+
Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_EXTENSION);
+
+boolean success = false;
+try {
+  meta = state.directory.createOutput(metaFileName, state.context);
+  quantizedVectorData =
+  state.directory.createOutput(quantizedVectorDataFileName, 
state.context);
+
+  CodecUtil.writeIndexHeader(
+  meta,
+  Lucene99ScalarQuantizedVectorsFormat.META_CODEC_NAME,
+  Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  CodecUtil.writeIndexHeader(
+  quantizedVectorData,
+  
Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_CODEC_NAME,
+  Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  success = true;
+} finally {
+  if (success == false) {
+IOUtils.closeWhileHandlingException(this);
+  }
+}
+  }
+
+  @Override
+  public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) 

[GitHub] [lucene] benwtrent commented on pull request #12582: Add new int8 scalar quantization to HNSW codec

2023-09-22 Thread via GitHub


benwtrent commented on PR #12582:
URL: https://github.com/apache/lucene/pull/12582#issuecomment-1731463225

   > Do we know why search is faster? Is it mostly because working on the 
quantized vectors requires a lower memory bandwi[d]th?
   
   Search is faster in two regards:
   
- PanamaVector allows for more `byte` actions to occur at once than 
`float32` (should be major)
- Reading `byte[]` off of a buffer doesn't require decoding floats (very 
minor change)
   
   IMO, we should be seeing WAY better search numbers. I need to do more 
testing to triple check.
   
   > Do you know how recall degrades compared to without quantization? I saw 
the numbers you shared but I don't have a good sense of what recall we usually 
had until now.
   
   ++ I want to graph the two together to compare so its clearer.
   
   
   
   > I don't feel great about the logic that merges quantiles at merge time and 
only requantizes if the merged quantiles don't differ too much from the input 
quantiles. It feels like quantiles could slowly change over multiple merging 
rounds and we'd end up in a state where the quantized vectors would be 
different from requantizing the raw vectors with the quantization state that is 
stored in the segment, which feels wrong. Am I missing something?
   
   The quantization buckets could change slightly overtime, but since we are 
bucketing `float32` into `int8`, the error bounds are comparatively large. 
   
   The cost of requantization is almost never worth it. In my testing, 
quantiles over random data from the same data set shows that segments differ by 
only around `1e-4`, which is tiny and shouldn't require requantization.
   
   @tveasey helped me do some empirical analysis here and can provide some 
numbers.
   
   
   > Related to the above, it looks like we ignore deletions when merging 
quantiles. It would probably be ok in practice most of the time but I worry 
that there might be corner cases?
   
   A corner case in what way? That we potentially include deletions when 
computing quantiles or if re-quantization is required?
   
   We can easily exclude them as conceptually, the "new" doc (if it were an 
update) would exist in another segment. It could be we are double counting a 
vector and we probably shouldn't do that.
   
   > > Do we want to have a new "flat" vector codec that HNSW (or other 
complicated vector indexing methods), can use? Detractor here is that now HNSW 
codec relies on another pluggable thing that is a "flat" vector index (just 
provides mechanisms for reading, writing, merging vectors in a flat index).
   
   > I don't have a strong opinion on this. Making it a codec though has the 
downside that it would require more files since two codecs can't write to the 
same file. Maybe having utility methods around reading/writing flat vectors is 
good enough?
   
   Utility methods are honestly what I am leaning towards. Its then a 
discussion around how a codec (like HNSW) is configured to use it.
   
   > > Should "quantization" just be a thing that is provided to vector codecs?
   
   > I might be misunderstanding the question, but to me this is what the 
byte[] encoding is about. And this quantization that's getting added here is 
more powerful because it's adaptative and will change over time depending on 
what vectors get indexed or deleted? If it needs to adapt to the data then it 
belongs to the codec. We could have utility code to make it easier to write 
codecs that quantize their data though (maybe this is what your question 
suggested?).
   
   Yeah, it needs to adapt over time. There are adverse cases (indexing vectors 
sorted by relative clusters is one) that need to be handled. But, they can be 
handled easily at merge time by recomputing quantiles and potentially 
re-quantizing.
   
   > > Should the "quantizer" keep the raw vectors around itself?
   
   > My understanding is that we have to, as the accuracy of the quantization 
could otherwise degrade over time in an unbounded fashion.
   
   After a period of time, if vectors are part of the same corpus and created 
via the same model, the quantiles actually level out and re-quantizing will 
rarely or never occur since the calculated quantiles are statistically 
equivalent. Especially given the binning into `int8`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec

2023-09-22 Thread via GitHub


uschindler commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1334448931


##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Random;
+import java.util.stream.IntStream;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.VectorSimilarityFunction;
+
+/** Will scalar quantize float vectors into `int8` byte values */
+public class ScalarQuantizer {
+
+  public static final int SCALAR_QUANTIZATION_SAMPLE_SIZE = 25_000;
+
+  private final float alpha;
+  private final float offset;
+  private final float minQuantile, maxQuantile;
+
+  public ScalarQuantizer(float minQuantile, float maxQuantile) {
+assert maxQuantile >= maxQuantile;
+this.minQuantile = minQuantile;
+this.maxQuantile = maxQuantile;
+this.alpha = (maxQuantile - minQuantile) / 127f;
+this.offset = minQuantile;
+  }
+
+  public void quantize(float[] src, byte[] dest) {
+assert src.length == dest.length;
+for (int i = 0; i < src.length; i++) {
+  dest[i] =
+  (byte)
+  Math.round(
+  (Math.max(minQuantile, Math.min(maxQuantile, src[i])) - 
minQuantile) / alpha);
+}
+  }
+
+  public void deQuantize(byte[] src, float[] dest) {
+assert src.length == dest.length;
+for (int i = 0; i < src.length; i++) {
+  dest[i] = (alpha * src[i]) + offset;
+}
+  }
+
+  public float calculateVectorOffset(byte[] vector, VectorSimilarityFunction 
similarityFunction) {
+if (similarityFunction != VectorSimilarityFunction.EUCLIDEAN) {
+  int sum = 0;
+  for (byte b : vector) {

Review Comment:
   Can't we use VectorUtil here for SIMD dotProduct?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec

2023-09-22 Thread via GitHub


uschindler commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1334450128


##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Random;
+import java.util.stream.IntStream;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.VectorSimilarityFunction;
+
+/** Will scalar quantize float vectors into `int8` byte values */
+public class ScalarQuantizer {
+
+  public static final int SCALAR_QUANTIZATION_SAMPLE_SIZE = 25_000;
+
+  private final float alpha;
+  private final float offset;
+  private final float minQuantile, maxQuantile;
+
+  public ScalarQuantizer(float minQuantile, float maxQuantile) {
+assert maxQuantile >= maxQuantile;
+this.minQuantile = minQuantile;
+this.maxQuantile = maxQuantile;
+this.alpha = (maxQuantile - minQuantile) / 127f;
+this.offset = minQuantile;
+  }
+
+  public void quantize(float[] src, byte[] dest) {
+assert src.length == dest.length;
+for (int i = 0; i < src.length; i++) {
+  dest[i] =
+  (byte)
+  Math.round(
+  (Math.max(minQuantile, Math.min(maxQuantile, src[i])) - 
minQuantile) / alpha);
+}
+  }
+
+  public void deQuantize(byte[] src, float[] dest) {
+assert src.length == dest.length;
+for (int i = 0; i < src.length; i++) {
+  dest[i] = (alpha * src[i]) + offset;
+}
+  }
+
+  public float calculateVectorOffset(byte[] vector, VectorSimilarityFunction 
similarityFunction) {
+if (similarityFunction != VectorSimilarityFunction.EUCLIDEAN) {
+  int sum = 0;
+  for (byte b : vector) {

Review Comment:
   Ah sorry it just sums up. But we could add this to VectorUtil...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tveasey commented on pull request #12582: Add new int8 scalar quantization to HNSW codec

2023-09-22 Thread via GitHub


tveasey commented on PR #12582:
URL: https://github.com/apache/lucene/pull/12582#issuecomment-1731530040

   > @tveasey helped me do some empirical analysis here and can provide some 
numbers.
   
   So the rationale is quite simple as Ben said. If you change the upper and 
lower quantiles very little then in fact re-quantising doesn't change the 
quantized vectors much at all. In particular, you expect values roughly uniform 
in each bin and unless you are near a snapping boundary you simply map the 
value to the same integer. Therefore, if the difference in the upper and lower 
quantile is "bin width" / n you have roughly 1 / n probability of changing any 
given value, by at most one and only when the impact on the error is marginal 
(< "bin width" / n). In practice, even if the odd component, where the snapping 
decision is marginal, changes by +/- 1 the effect is dwarfed by the all the 
other snapping going on when you quantize.
   
   I measured this for a few different datasets (using different SOTA embedding 
models) and for each dataset over 100 merges the effect was always less than 
0.05 * "quantisation error". I note as well that this error magnitude is pretty 
consistent with the theory above (when properly formalised). Finally, this is 
all completely in the noise in terms of impact on recall for nearest neighbour 
retrieval.
   
   I'll follow up with a link to a repo with a more detailed discussion and the 
code used for these experiments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec

2023-09-22 Thread via GitHub


rmuir commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1334477512


##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Random;
+import java.util.stream.IntStream;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.VectorSimilarityFunction;
+
+/** Will scalar quantize float vectors into `int8` byte values */
+public class ScalarQuantizer {
+
+  public static final int SCALAR_QUANTIZATION_SAMPLE_SIZE = 25_000;
+
+  private final float alpha;
+  private final float offset;
+  private final float minQuantile, maxQuantile;
+
+  public ScalarQuantizer(float minQuantile, float maxQuantile) {
+assert maxQuantile >= maxQuantile;
+this.minQuantile = minQuantile;
+this.maxQuantile = maxQuantile;
+this.alpha = (maxQuantile - minQuantile) / 127f;
+this.offset = minQuantile;
+  }
+
+  public void quantize(float[] src, byte[] dest) {
+assert src.length == dest.length;
+for (int i = 0; i < src.length; i++) {
+  dest[i] =
+  (byte)
+  Math.round(
+  (Math.max(minQuantile, Math.min(maxQuantile, src[i])) - 
minQuantile) / alpha);
+}
+  }
+
+  public void deQuantize(byte[] src, float[] dest) {
+assert src.length == dest.length;
+for (int i = 0; i < src.length; i++) {
+  dest[i] = (alpha * src[i]) + offset;
+}
+  }
+
+  public float calculateVectorOffset(byte[] vector, VectorSimilarityFunction 
similarityFunction) {
+if (similarityFunction != VectorSimilarityFunction.EUCLIDEAN) {
+  int sum = 0;
+  for (byte b : vector) {

Review Comment:
   summing bytes across array like this should work with autovectorization, or 
its seriously broke. there is no pesky floating point order of operations 
restriction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec

2023-09-22 Thread via GitHub


benwtrent commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1334508429


##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Random;
+import java.util.stream.IntStream;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.VectorSimilarityFunction;
+
+/** Will scalar quantize float vectors into `int8` byte values */
+public class ScalarQuantizer {
+
+  public static final int SCALAR_QUANTIZATION_SAMPLE_SIZE = 25_000;
+
+  private final float alpha;
+  private final float offset;
+  private final float minQuantile, maxQuantile;
+
+  public ScalarQuantizer(float minQuantile, float maxQuantile) {
+assert maxQuantile >= maxQuantile;
+this.minQuantile = minQuantile;
+this.maxQuantile = maxQuantile;
+this.alpha = (maxQuantile - minQuantile) / 127f;
+this.offset = minQuantile;
+  }
+
+  public void quantize(float[] src, byte[] dest) {
+assert src.length == dest.length;
+for (int i = 0; i < src.length; i++) {
+  dest[i] =
+  (byte)
+  Math.round(
+  (Math.max(minQuantile, Math.min(maxQuantile, src[i])) - 
minQuantile) / alpha);
+}
+  }
+
+  public void deQuantize(byte[] src, float[] dest) {
+assert src.length == dest.length;
+for (int i = 0; i < src.length; i++) {
+  dest[i] = (alpha * src[i]) + offset;
+}
+  }
+
+  public float calculateVectorOffset(byte[] vector, VectorSimilarityFunction 
similarityFunction) {
+if (similarityFunction != VectorSimilarityFunction.EUCLIDEAN) {
+  int sum = 0;
+  for (byte b : vector) {

Review Comment:
   @rmuir, exactly. Since it isn't floating point addition, I didn't think it 
necessary for VectorUtil to get involved.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] easyice commented on pull request #12557: Improve refresh speed with softdelete enable

2023-09-22 Thread via GitHub


easyice commented on PR #12557:
URL: https://github.com/apache/lucene/pull/12557#issuecomment-1731767546

   Update:
   
   when we call `softUpdateDocument` for a segment that already has some 
deleted doc, it will iterate all the deleted doc use 
`ReadersAndUpdates#MergedDocValues#onDiskDocValues`, but he has to iterate the 
array twice, the first time is 
`Lucene90DocValuesConsumer#writeValues`  will compute gcd, min, max. the 
second time is `IndexedDISI#writeBitSet`, this creates some waste,  we can 
remove the first iterate for soft delete, this can speed up about 53% for 
updates.
   
   Benchmark code:
   ```
 public static void main(final String[] args) throws Exception {
   long min = Long.MAX_VALUE;
   for (int i = 0; i < 5; i++) {
 min = Math.min(doWriteOK(), min);
   }
   System.out.println("BEST:" + min);
 }
   
   static long doWrite() throws IOException {
   Random rand = new Random(5);
   Directory dir = new ByteBuffersDirectory();
   IndexWriter writer =
   new IndexWriter(
   dir,
   new IndexWriterConfig(null)
   .setSoftDeletesField("_soft_deletes")
   .setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH));
   int maxDoc = 4096 * 100;
   
   for (int i = 0; i < maxDoc; i++) {
 Document doc = new Document();
 doc.add(new StringField("id", String.valueOf(i), Field.Store.NO));
   
 writer.addDocument(doc);
 if (i > 0 && i % 5000 == 0) {
   writer.commit();
 }
   }
   
   System.out.println("start update");
   long t0 = System.currentTimeMillis();
   
   for (int i = 0; i < maxDoc; i += 2) {
 Document doc = new Document();
 writer.softUpdateDocument(
 new Term("id", String.valueOf(i)), doc, new 
NumericDocValuesField("_soft_deletes", 1));
 if (i > 0 && i % 100 == 0) {
   writer.commit();
 }
   }
   long tookMs = System.currentTimeMillis() - t0;
   System.out.println("update took:" + (System.currentTimeMillis() - t0));
   
   IOUtils.close(writer, dir);
   return tookMs;
 }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12560: Defer #advanceExact on expression dependencies until their values are needed

2023-09-22 Thread via GitHub


gsmiller commented on PR #12560:
URL: https://github.com/apache/lucene/pull/12560#issuecomment-1731814078

   Circling back on this: For Amazon's Product Search engine, we make fairly 
heavy use of these expression implementations. I pulled this change into our 
Lucene fork early (currently on 9.7) and ran internal benchmarks we have, and 
this produced a ~23% redline QPS improvement. Milage may vary of course, but 
the impact was significant so others may find a nice win as well (if heavy 
expression users).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimczi commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec

2023-09-22 Thread via GitHub


jimczi commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1334738549


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java:
##
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene99;
+
+import static 
org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.DIRECT_MONOTONIC_BLOCK_SHIFT;
+import static 
org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.calculateDefaultQuantile;
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnFieldVectorsWriter;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat;
+import org.apache.lucene.index.DocIDMerger;
+import org.apache.lucene.index.DocsWithFieldSet;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.IndexFileNames;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.index.Sorter;
+import org.apache.lucene.index.VectorEncoding;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.RamUsageEstimator;
+import org.apache.lucene.util.ScalarQuantizer;
+import org.apache.lucene.util.VectorUtil;
+import org.apache.lucene.util.packed.DirectMonotonicWriter;
+
+/**
+ * Writes quantized vector values and metadata to index segments.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene99ScalarQuantizedVectorsWriter implements 
QuantizedVectorsWriter {
+
+  private static final long BASE_RAM_BYTES_USED =
+  
RamUsageEstimator.shallowSizeOfInstance(Lucene99ScalarQuantizedVectorsWriter.class);
+
+  private static final float QUANTIZATION_RECOMPUTE_LIMIT = 32;
+  private final SegmentWriteState segmentWriteState;
+  private final IndexOutput meta, quantizedVectorData;
+  private final Float quantile;
+  private final List fields = new ArrayList<>();
+
+  private boolean finished;
+
+  Lucene99ScalarQuantizedVectorsWriter(SegmentWriteState state, Float 
quantile) throws IOException {
+this.quantile = quantile;
+segmentWriteState = state;
+String metaFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name,
+state.segmentSuffix,
+
Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_META_EXTENSION);
+
+String quantizedVectorDataFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name,
+state.segmentSuffix,
+
Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_EXTENSION);
+
+boolean success = false;
+try {
+  meta = state.directory.createOutput(metaFileName, state.context);
+  quantizedVectorData =
+  state.directory.createOutput(quantizedVectorDataFileName, 
state.context);
+
+  CodecUtil.writeIndexHeader(
+  meta,
+  Lucene99ScalarQuantizedVectorsFormat.META_CODEC_NAME,
+  Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  CodecUtil.writeIndexHeader(
+  quantizedVectorData,
+  
Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_CODEC_NAME,
+  Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  success = true;
+} finally {
+  if (success == false) {
+IOUtils.closeWhileHandlingException(this);
+  }
+}
+  }
+
+  @Override
+  public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) thr

[GitHub] [lucene] jimczi commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec

2023-09-22 Thread via GitHub


jimczi commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1334746185


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java:
##
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene99;
+
+import static 
org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.DIRECT_MONOTONIC_BLOCK_SHIFT;
+import static 
org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.calculateDefaultQuantile;
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnFieldVectorsWriter;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat;
+import org.apache.lucene.index.DocIDMerger;
+import org.apache.lucene.index.DocsWithFieldSet;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.IndexFileNames;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.index.Sorter;
+import org.apache.lucene.index.VectorEncoding;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.RamUsageEstimator;
+import org.apache.lucene.util.ScalarQuantizer;
+import org.apache.lucene.util.VectorUtil;
+import org.apache.lucene.util.packed.DirectMonotonicWriter;
+
+/**
+ * Writes quantized vector values and metadata to index segments.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene99ScalarQuantizedVectorsWriter implements 
QuantizedVectorsWriter {
+
+  private static final long BASE_RAM_BYTES_USED =
+  
RamUsageEstimator.shallowSizeOfInstance(Lucene99ScalarQuantizedVectorsWriter.class);
+
+  private static final float QUANTIZATION_RECOMPUTE_LIMIT = 32;
+  private final SegmentWriteState segmentWriteState;
+  private final IndexOutput meta, quantizedVectorData;
+  private final Float quantile;
+  private final List fields = new ArrayList<>();
+
+  private boolean finished;
+
+  Lucene99ScalarQuantizedVectorsWriter(SegmentWriteState state, Float 
quantile) throws IOException {
+this.quantile = quantile;
+segmentWriteState = state;
+String metaFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name,
+state.segmentSuffix,
+
Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_META_EXTENSION);
+
+String quantizedVectorDataFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name,
+state.segmentSuffix,
+
Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_EXTENSION);
+
+boolean success = false;
+try {
+  meta = state.directory.createOutput(metaFileName, state.context);
+  quantizedVectorData =
+  state.directory.createOutput(quantizedVectorDataFileName, 
state.context);
+
+  CodecUtil.writeIndexHeader(
+  meta,
+  Lucene99ScalarQuantizedVectorsFormat.META_CODEC_NAME,
+  Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  CodecUtil.writeIndexHeader(
+  quantizedVectorData,
+  
Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_CODEC_NAME,
+  Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  success = true;
+} finally {
+  if (success == false) {
+IOUtils.closeWhileHandlingException(this);
+  }
+}
+  }
+
+  @Override
+  public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) thr

[GitHub] [lucene] jimczi commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec

2023-09-22 Thread via GitHub


jimczi commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1334758274


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java:
##
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene99;
+
+import static 
org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.DIRECT_MONOTONIC_BLOCK_SHIFT;
+import static 
org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.calculateDefaultQuantile;
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnFieldVectorsWriter;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat;
+import org.apache.lucene.index.DocIDMerger;
+import org.apache.lucene.index.DocsWithFieldSet;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.IndexFileNames;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.index.Sorter;
+import org.apache.lucene.index.VectorEncoding;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.RamUsageEstimator;
+import org.apache.lucene.util.ScalarQuantizer;
+import org.apache.lucene.util.VectorUtil;
+import org.apache.lucene.util.packed.DirectMonotonicWriter;
+
+/**
+ * Writes quantized vector values and metadata to index segments.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene99ScalarQuantizedVectorsWriter implements 
QuantizedVectorsWriter {
+
+  private static final long BASE_RAM_BYTES_USED =
+  
RamUsageEstimator.shallowSizeOfInstance(Lucene99ScalarQuantizedVectorsWriter.class);
+
+  private static final float QUANTIZATION_RECOMPUTE_LIMIT = 32;
+  private final SegmentWriteState segmentWriteState;
+  private final IndexOutput meta, quantizedVectorData;
+  private final Float quantile;
+  private final List fields = new ArrayList<>();
+
+  private boolean finished;
+
+  Lucene99ScalarQuantizedVectorsWriter(SegmentWriteState state, Float 
quantile) throws IOException {
+this.quantile = quantile;
+segmentWriteState = state;
+String metaFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name,
+state.segmentSuffix,
+
Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_META_EXTENSION);
+
+String quantizedVectorDataFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name,
+state.segmentSuffix,
+
Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_EXTENSION);
+
+boolean success = false;
+try {
+  meta = state.directory.createOutput(metaFileName, state.context);
+  quantizedVectorData =
+  state.directory.createOutput(quantizedVectorDataFileName, 
state.context);
+
+  CodecUtil.writeIndexHeader(
+  meta,
+  Lucene99ScalarQuantizedVectorsFormat.META_CODEC_NAME,
+  Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  CodecUtil.writeIndexHeader(
+  quantizedVectorData,
+  
Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_DATA_CODEC_NAME,
+  Lucene99ScalarQuantizedVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  success = true;
+} finally {
+  if (success == false) {
+IOUtils.closeWhileHandlingException(this);
+  }
+}
+  }
+
+  @Override
+  public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) thr