[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw

2022-12-06 Thread GitBox


benwtrent commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1040969450


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -37,6 +37,7 @@
  * @param  the type of query vector
  */
 public class HnswGraphSearcher {
+  private final int UNBOUNDED_QUEUE_INIT_SIZE = 10_000;

Review Comment:
   Any research to indicate why this number was chosen? It seems silly that if 
a user provides `k = 10_001` it would have a queue bigger than `k = 
Integer.MAX_VALUE`.
   
   Technically, the max value here should be something like 
`ArrayUtil.MAX_ARRAY_LENGTH` But this eagerly allocates a `new 
long[heapSize];`. This is VERY costly.
   
   I would prefer a number with some significant reason behind it or some 
better way of queueing neighbors.



##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -235,7 +312,7 @@ private NeighborQueue searchLevel(
 while (candidates.size() > 0 && results.incomplete() == false) {
   // get the best candidate (closest or best scoring)
   float topCandidateSimilarity = candidates.topScore();
-  if (topCandidateSimilarity < minAcceptedSimilarity) {
+  if (topCandidateSimilarity < minAcceptedSimilarity && results.size() >= 
topK) {
 break;
   }

Review Comment:
   I am not sure about this. This stops gathering results once its filled. This 
defeats the purpose of exploring the graph.
   
   Have you seen how this effects recall?



##
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##
@@ -232,8 +232,48 @@ public final PostingsEnum postings(Term term) throws 
IOException {
* @return the k nearest neighbor documents, along with their 
(searchStrategy-specific) scores.
* @lucene.experimental
*/
+  public final TopDocs searchNearestVectors(
+  String field, float[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException {
+return searchNearestVectors(
+field, target, k, Float.NEGATIVE_INFINITY, acceptDocs, visitedLimit);
+  }
+
+  /**
+   * Return the k nearest neighbor documents as determined by comparison of 
their vector values for
+   * this field, to the given vector, by the field's similarity function. The 
score of each document
+   * is derived from the vector similarity in a way that ensures scores are 
positive and that a
+   * larger score corresponds to a higher ranking.
+   *
+   * The search is allowed to be approximate, meaning the results are not 
guaranteed to be the
+   * true k closest neighbors. For large values of k (for example when k is 
close to the total
+   * number of documents), the search may also retrieve fewer than k documents.
+   *
+   * The returned {@link TopDocs} will contain a {@link ScoreDoc} for each 
nearest neighbor,
+   * sorted in order of their similarity to the query vector (decreasing 
scores). The {@link
+   * TotalHits} contains the number of documents visited during the search. If 
the search stopped
+   * early because it hit {@code visitedLimit}, it is indicated through the 
relation {@code
+   * TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO}.
+   *
+   * @param field the vector field to search
+   * @param target the vector-valued query
+   * @param k the number of docs to return (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   Would it be possible for this threshold to be an actual distance? My concern 
here is that for things like `byteVectors`, dot-product scores are insanely 
small (I think this is a design flaw in itself) and may be confusing to users 
who want a given "radius" but instead have to figure out a score related to 
their radius. 
   
   It would be prudent that IF we provided some filtering on a threshold within 
the search, that this threshold reflects vector distance directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11946: add similarity threshold for hnsw

2022-12-06 Thread GitBox


rmuir commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1041043232


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java:
##
@@ -236,7 +236,13 @@ public VectorValues getVectorValues(String field) throws 
IOException {
   }
 
   @Override
-  public TopDocs search(String field, float[] target, int k, Bits acceptDocs, 
int visitedLimit)
+  public TopDocs search(
+  String field,
+  float[] target,
+  int k,
+  float similarityThreshold,
+  Bits acceptDocs,
+  int visitedLimit)

Review Comment:
   please overload the method, and tag all the APIs experimental. I'm really 
concerned about us locking ourselves into HNSW, and we must...must get away 
from it (its like 1000x slower than it should be).
   
   the alternative is to feature-freeze vectors completely until they scale. so 
i think this is a reasonable compromise.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11946: add similarity threshold for hnsw

2022-12-06 Thread GitBox


rmuir commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1339474637

   > What I have in mind would be to implement entirely in the
   > KnnVectorQuery. Since results are sorted by score, they can easily be
   > post-filtered there: no need to implement anything at the codec layer
   > I think. Am I missing something?
   
   is there any possibility other than adding all these LeafReader/IndexReader 
signatures?
   
   Currently I'm -1 to the change from an API persective. It is too invasive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11998: Migrate away from per-segment-per-threadlocals on SegmentReader

2022-12-06 Thread GitBox


rmuir commented on PR #11998:
URL: https://github.com/apache/lucene/pull/11998#issuecomment-1339780615

   That's fine. or we could fix `newSearcher` to not wrap with crazy 
CodecReader's. or we could fix said CodecReaders (since they are only used for 
tests) to implement the deprecated document apis, like SegmentReader does. Or 
we could give CodecReader a default impl that isn't very performant other than 
UOE.
   
   I wanted to throw the UOE, at least at first, to be sure i knew exactly what 
was calling old .document API (e.g. in case i forgot to fix a filter-reader). 
But it doesn't have to stay.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11997: Add IntField, LongField, FloatField and DoubleField

2022-12-06 Thread GitBox


rmuir commented on code in PR #11997:
URL: https://github.com/apache/lucene/pull/11997#discussion_r1041309581


##
lucene/core/src/java/org/apache/lucene/document/DoubleField.java:
##
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.PointValues;
+import org.apache.lucene.search.IndexOrDocValuesQuery;
+import org.apache.lucene.search.PointRangeQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.NumericUtils;
+
+/**
+ * Field that stores a per-document double value for scoring, 
sorting or value
+ * retrieval and index the field for fast range filters. If you also need to 
store the value, you
+ * should add a separate {@link StoredField} instance. If you need more 
fine-grained control you can
+ * use {@link DoublePoint} and {@link DoubleDocValuesField}.
+ *
+ * This field defines static factory methods for creating common queries:
+ *
+ * 
+ *   {@link #newExactQuery(String, double)} for matching an exact 1D point.
+ *   {@link #newRangeQuery(String, double, double)} for matching a 1D 
range.
+ * 
+ *
+ * @see PointValues
+ */
+public final class DoubleField extends Field {
+  /**
+   * Creates a new DoubleField, indexing the provided point and storing it as 
a DocValue
+   *
+   * @param name field name
+   * @param value the double value
+   * @param sorted configure the field to support multiple DocValues
+   * @throws IllegalArgumentException if the field name or value is null.
+   */
+  public DoubleField(String name, double value, boolean sorted) {
+this(name, Double.valueOf(value), sorted);
+  }
+
+  /**
+   * Creates a new DoubleField, indexing the provided point and storing it as 
a DocValue
+   *
+   * @param name field name
+   * @param value the double value
+   * @param sorted configure the field to support multiple DocValues
+   * @throws IllegalArgumentException if the field name or value is null.
+   */
+  public DoubleField(String name, Double value, boolean sorted) {

Review Comment:
   I don't think we need a boxed version of the ctor `Double` vs `double`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11997: Add IntField, LongField, FloatField and DoubleField

2022-12-06 Thread GitBox


rmuir commented on code in PR #11997:
URL: https://github.com/apache/lucene/pull/11997#discussion_r1041312439


##
lucene/core/src/java/org/apache/lucene/document/DoubleField.java:
##
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.PointValues;
+import org.apache.lucene.search.IndexOrDocValuesQuery;
+import org.apache.lucene.search.PointRangeQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.NumericUtils;
+
+/**
+ * Field that stores a per-document double value for scoring, 
sorting or value
+ * retrieval and index the field for fast range filters. If you also need to 
store the value, you
+ * should add a separate {@link StoredField} instance. If you need more 
fine-grained control you can
+ * use {@link DoublePoint} and {@link DoubleDocValuesField}.
+ *
+ * This field defines static factory methods for creating common queries:
+ *
+ * 
+ *   {@link #newExactQuery(String, double)} for matching an exact 1D point.
+ *   {@link #newRangeQuery(String, double, double)} for matching a 1D 
range.
+ * 
+ *
+ * @see PointValues
+ */
+public final class DoubleField extends Field {
+  /**
+   * Creates a new DoubleField, indexing the provided point and storing it as 
a DocValue
+   *
+   * @param name field name
+   * @param value the double value
+   * @param sorted configure the field to support multiple DocValues
+   * @throws IllegalArgumentException if the field name or value is null.
+   */
+  public DoubleField(String name, double value, boolean sorted) {

Review Comment:
   Should we just make the field multivalued always to simplify it? Or, make 
the default `true' and provide another ctor with `boolean multivalued`? I think 
i'd prefer `multivalued` to `sorted` as the name of the parameter.



##
lucene/core/src/java/org/apache/lucene/document/DoubleField.java:
##
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.PointValues;
+import org.apache.lucene.search.IndexOrDocValuesQuery;
+import org.apache.lucene.search.PointRangeQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.NumericUtils;
+
+/**
+ * Field that stores a per-document double value for scoring, 
sorting or value
+ * retrieval and index the field for fast range filters. If you also need to 
store the value, you
+ * should add a separate {@link StoredField} instance. If you need more 
fine-grained control you can
+ * use {@link DoublePoint} and {@link DoubleDocValuesField}.
+ *
+ * This field defines static factory methods for creating common queries:
+ *
+ * 
+ *   {@link #newExactQuery(String, double)} for matching an exact 1D point.
+ *   {@link #newRangeQuery(String, double, double)} for matching a 1D 
range.
+ * 
+ *
+ * @see PointValues
+ */
+public final class DoubleField extends Field {
+  /**
+   * Creates a new DoubleField, indexing the provided point and storing it as 
a DocValue
+   *
+   * @param name field name
+   * @param value the double value
+   * @param sorted configure the field to support multiple DocValues
+   * @throws IllegalArgumentException if the field name or value is null.
+  

[GitHub] [lucene] rmuir commented on pull request #11997: Add IntField, LongField, FloatField and DoubleField

2022-12-06 Thread GitBox


rmuir commented on PR #11997:
URL: https://github.com/apache/lucene/pull/11997#issuecomment-1339794352

   I like the idea too. Make the .document API simpler for typical use-cases! I 
added a couple cosmetic comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11998: Migrate away from per-segment-per-threadlocals on SegmentReader

2022-12-06 Thread GitBox


rmuir commented on PR #11998:
URL: https://github.com/apache/lucene/pull/11998#issuecomment-1339823026

   I got the tests happy for now with 12a5dfaeba954a049675830eabd54bd8f58b51c2
   
   Maybe not the right solution in the end, but makes it easier to iterate when 
you have passing tests at least.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11995: enable fully directly copy merge/flush fdt files when index sorting

2022-12-06 Thread GitBox


jpountz commented on PR #11995:
URL: https://github.com/apache/lucene/pull/11995#issuecomment-1339962667

   Thanks for the explanation of what this PR does. I'm not comfortable with 
the fact that with your change, stored fields are no longer stored in doc ID 
order. It's probably a good trade-off in your case and maybe something you can 
do in a custom codec, but I don't like doing it in the default codec as it 
would also mean that users can no longer leverage index sorting to improve data 
locality within stored fields, and that feeding this stored fields reader into 
another writer for a merge would trigger lots of random access.
   
   While it certainly wouldn't result in speedup that is as good, I'll point 
out that using a merge policy that only merges adjacent segments like 
LogByteMergePolicy should help better take advantage of 
https://github.com/apache/lucene/pull/134 and get slightly better merging 
performance when an index sort on timestamp is configured.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #11999: Add support for stored fields to MemoryIndex

2022-12-06 Thread GitBox


jpountz commented on code in PR #11999:
URL: https://github.com/apache/lucene/pull/11999#discussion_r1041429957


##
lucene/memory/src/java/org/apache/lucene/index/memory/StoredValues.java:
##
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index.memory;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.IndexableField;
+import org.apache.lucene.index.StoredFieldVisitor;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.OutputStreamDataOutput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.NumericUtils;
+
+class StoredValues {
+
+  private final ByteArrayOutputStream bytes = new ByteArrayOutputStream();
+  private final DataOutput out = new OutputStreamDataOutput(bytes);

Review Comment:
   I think we'd generally use a ByteBuffersDataOutput instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #11958: GITHUB-11868: Add FilterIndexInput and FilterIndexOutput wrapper classes

2022-12-06 Thread GitBox


jpountz commented on code in PR #11958:
URL: https://github.com/apache/lucene/pull/11958#discussion_r1041440269


##
lucene/core/src/java/org/apache/lucene/store/FilterIndexInput.java:
##
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+
+/**
+ * IndexInput implementation that delegates calls to another directory. This 
class can be used to
+ * add limitations on top of an existing {@link IndexInput} implementation or 
to add additional
+ * sanity checks for tests. However, if you plan to write your own {@link 
IndexInput}
+ * implementation, you should consider extending directly {@link IndexInput} 
or {@link DataInput}
+ * rather than try to reuse functionality of existing {@link IndexInput}s by 
extending this class.
+ *
+ * @lucene.internal
+ */
+public class FilterIndexInput extends IndexInput {
+
+  public static IndexInput unwrap(IndexInput in) {

Review Comment:
   why are we adding these unwrap methods, they don't seem used anywhere? I 
know we have them on some other `Filter` classes but it's a bug IMO.



##
lucene/core/src/test/org/apache/lucene/index/TestFilterIndexInput.java:
##
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.lang.reflect.Method;
+import java.util.HashSet;
+import java.util.Random;
+import java.util.Set;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FSDirectory;
+import org.apache.lucene.store.FilterIndexInput;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.junit.Test;
+
+public class TestFilterIndexInput extends TestIndexInput {
+
+  @Override
+  public IndexInput getIndexInput(long len) {
+return new FilterIndexInput("wrapped foo", new 
InterceptingIndexInput("foo", len));
+  }
+
+  public void testRawFilterIndexInputRead() throws IOException {
+for (int i = 0; i < 10; i++) {
+  Random random = random();
+  final Directory dir = newDirectory();
+  IndexOutput os = dir.createOutput("foo", newIOContext(random));
+  os.writeBytes(READ_TEST_BYTES, READ_TEST_BYTES.length);
+  os.close();
+  IndexInput is =
+  new FilterIndexInput("wrapped foo", dir.openInput("foo", 
newIOContext(random)));
+  checkReads(is, IOException.class);
+  checkSeeksAndSkips(is, random);
+  is.close();
+
+  os = dir.createOutput("bar", newIOContext(random));
+  os.writeBytes(RANDOM_TEST_BYTES, RANDOM_TEST_BYTES.length);
+  os.close();
+  is = new FilterIndexInput("wrapped bar", dir.openInput("bar", 
newIOContext(random)));
+  checkRandomReads(is);
+  checkSeeksAndSkips(is, random);
+  is.close();
+  dir.close();
+}
+  }
+
+  @Test
+  public void testOverrides() throws Exception {
+// verify that all abstract methods of IndexInput/DataInput are overridden 
by FilterDirectory,
+// except those under the 'exclude' list
+Set exclude = new HashSet<>();
+
+exclude.add(IndexInput.class.getMethod("toString"));
+exclude.add(IndexInput.class.getMethod("skipBytes", long.class));
+exclude.add(IndexInput.class.getDeclaredMethod("getFullSliceDescription", 
String.class));
+exclude.add(IndexInput.class.getMethod("randomAcc

[GitHub] [lucene] agorlenko commented on pull request #11946: add similarity threshold for hnsw

2022-12-06 Thread GitBox


agorlenko commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1340151458

   I've done some experiments with real data and it seems that it really 
doesn't work as I expected. If number of docs which exceed threshold is 
significant (for example 20% or more of previously accepted docs), the query 
works slow and it is better to perform exact search. And unfortunately it 
happens quite often. 
   
   So I agree with @msokolov and I think I should rewrite this PR with 
post-filtering approach. It allows us to preserve predictable performance and 
not modify LeafReader/IndexReader (just filter TopDocs in KnnVectorQuery).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] agorlenko commented on a diff in pull request #11946: add similarity threshold for hnsw

2022-12-06 Thread GitBox


agorlenko commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1041584638


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -37,6 +37,7 @@
  * @param  the type of query vector
  */
 public class HnswGraphSearcher {
+  private final int UNBOUNDED_QUEUE_INIT_SIZE = 10_000;

Review Comment:
   I wanted to set some quite big value of heap's initial size in order to 
reduce number of possible heap's grows. But it seems that post-filtering would 
be better: https://github.com/apache/lucene/pull/11946#issuecomment-1340151458 
   
   In this case we don't have to modify `HnswGraphSearcher` at all.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 commented on pull request #11995: enable fully directly copy merge/flush fdt files when index sorting

2022-12-06 Thread GitBox


wjp719 commented on PR #11995:
URL: https://github.com/apache/lucene/pull/11995#issuecomment-1340308711

   > It's probably a good trade-off in your case and maybe something you can do 
in a custom codec
   
   Thanks for your reply,  does that mean   I can add a new custom codec in 
Lucene? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 closed pull request #11995: enable fully directly copy merge/flush fdt files when index sorting

2022-12-06 Thread GitBox


wjp719 closed pull request #11995: enable fully directly copy merge/flush fdt 
files when index sorting  
URL: https://github.com/apache/lucene/pull/11995


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 closed pull request #11995: enable fully directly copy merge/flush fdt files when index sorting

2022-12-06 Thread GitBox


wjp719 closed pull request #11995: enable fully directly copy merge/flush fdt 
files when index sorting  
URL: https://github.com/apache/lucene/pull/11995


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-12-06 Thread GitBox


jpountz merged PR #11860:
URL: https://github.com/apache/lucene/pull/11860


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz closed issue #11830: Store HNSW graph connections more compactly

2022-12-06 Thread GitBox


jpountz closed issue #11830: Store HNSW graph connections more compactly
URL: https://github.com/apache/lucene/issues/11830


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org