[GitHub] [lucene] iverase opened a new pull request #685: LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count.

2022-02-16 Thread GitBox


iverase opened a new pull request #685:
URL: https://github.com/apache/lucene/pull/685


   These query wrappers do not modify the set of matching documents so they can 
delegate Weight#count.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a change in pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-16 Thread GitBox


mayya-sharipova commented on a change in pull request #656:
URL: https://github.com/apache/lucene/pull/656#discussion_r807653039



##
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##
@@ -24,19 +24,36 @@
 import java.util.Objects;
 import org.apache.lucene.codecs.KnnVectorsReader;
 import org.apache.lucene.document.KnnVectorField;
+import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.util.BitSet;
+import org.apache.lucene.util.BitSetIterator;
 import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.FixedBitSet;
 
-/** Uses {@link KnnVectorsReader#search} to perform nearest neighbour search. 
*/
+/**
+ * Uses {@link KnnVectorsReader#search} to perform nearest neighbour search.
+ *
+ * This query also allows for performing a kNN search subject to a filter. 
In this case, it first
+ * executes the filter for each leaf, then chooses a strategy dynamically:
+ *
+ * 
+ *   If the filter cost is less than k, just execute an exact search
+ *   Otherwise run a kNN search subject to the filter
+ *   the kNN search visits too many vectors without completing, stop and 
run an exact search

Review comment:
   **if** the KNN search ?

##
File path: lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java
##
@@ -455,6 +484,61 @@ public void testRandom() throws IOException {
 }
   }
 
+  /** Tests with random vectors and a random filter. Uses RandomIndexWriter. */
+  public void testRandomWithFilter() throws IOException {
+int numDocs = 200;
+int dimension = atLeast(5);
+int numIters = atLeast(10);
+try (Directory d = newDirectory()) {
+  RandomIndexWriter w = new RandomIndexWriter(random(), d);
+  for (int i = 0; i < numDocs; i++) {
+Document doc = new Document();
+doc.add(new KnnVectorField("field", randomVector(dimension)));
+doc.add(new NumericDocValuesField("tag", i));
+doc.add(new IntPoint("tag", i));
+w.addDocument(doc);
+  }
+  w.close();
+
+  try (IndexReader reader = DirectoryReader.open(d)) {
+IndexSearcher searcher = newSearcher(reader);
+for (int i = 0; i < numIters; i++) {
+  int lower = random().nextInt(50);
+
+  // Check that when filter is restrictive, we use exact search
+  Query filter = IntPoint.newRangeQuery("tag", lower, lower + 6);
+  KnnVectorQuery query = new KnnVectorQuery("field", 
randomVector(dimension), 5, filter);
+  TopDocs results = searcher.search(query, numDocs);
+  assertEquals(TotalHits.Relation.EQUAL_TO, 
results.totalHits.relation);
+  assertEquals(results.totalHits.value, 5);

Review comment:
   How do we know that we used the exact search?  Are we judging by the 
equality of `results.totalHits.value` and `results.scoreDocs.length`?  I guess 
in most cases this is true.
   
   Another idea is always use `TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO` for 
the approximate search results as returned in `KnnVectorQuery.searchLeaf`:
   ```java
   TopDocs results = approximateSearch(ctx, acceptDocs, visitedLimit);
 if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO) {
   return ;
 } else {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir opened a new pull request #686: LUCENE-10421: use Constant instead of relying upon timestamp

2022-02-16 Thread GitBox


rmuir opened a new pull request #686:
URL: https://github.com/apache/lucene/pull/686


   All the other uses of `System.currentTimeMillis` (both java and test code) 
are no good, but i'd rather tackle them in a followup issue (I will make a 
JIRA). Eventually, we can ban use of wall clock time with forbidden-apis.
   
   But for now, I'd just like to have nightly benchmarks again :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10391) Reuse data structures across HnswGraph invocations

2022-02-16 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493146#comment-17493146
 ] 

Robert Muir commented on LUCENE-10391:
--

Actually these nightly benchmarks have not even been running. See LUCENE-10421

> Reuse data structures across HnswGraph invocations
> --
>
> Key: LUCENE-10391
> URL: https://issues.apache.org/jira/browse/LUCENE-10391
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Julie Tibshirani
>Priority: Minor
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Creating HNSW graphs involves doing many repeated calls to HnswGraph#search. 
> Profiles from nightly benchmarks suggest that allocating data-structures 
> incurs both lots of heap allocations 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_heap)]
>  and CPU usage 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_cpu).]
>  It looks like reusing data structures across invocations would be a 
> low-hanging fruit that could help save significant CPU?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10423) Remove uses of wall-clock time in codebase

2022-02-16 Thread Robert Muir (Jira)
Robert Muir created LUCENE-10423:


 Summary: Remove uses of wall-clock time in codebase
 Key: LUCENE-10423
 URL: https://issues.apache.org/jira/browse/LUCENE-10423
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir


Followup to LUCENE-10421

Code in the library shouldn't rely on wall-clock time. If you look at all the 
places doing this, they are basically all bad news.

Most tests doing this are "iterating for some amount of wall-clock time" which 
causes them to instead just be non-reproducible. These should be changed to use 
a fixed number of loop iterations instead.

It would really be great to ban this stuff in forbidden apis. It is even in the 
configuration file, just currently commented out.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #681: LUCENE-10322: Enable -Xlint:path and -Xlint:-exports

2022-02-16 Thread GitBox


rmuir commented on pull request #681:
URL: https://github.com/apache/lucene/pull/681#issuecomment-1041345085


   Yeah, those are actually API bugs? We have public methods that have 
non-public classes in their signature. Looks like this will be more complex to 
fix up. 
   
   In this example of `ByteBufferIndexInput.newInstance` and its 
`ByteBufferGuard` parameter, I think a better solution is to make 
`ByteBufferIndexInput.newInstance` package-private. The only callers are 
`ByteBuffersDirectory` and `MMapDirectory` which are in the same package. Then 
we don't need to make `ByteBufferGuard` public.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mogui commented on pull request #679: Monitor Improvements LUCENE-10422

2022-02-16 Thread GitBox


mogui commented on pull request #679:
URL: https://github.com/apache/lucene/pull/679#issuecomment-1041408992


   @romseygeek I've updated with the requested changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #681: LUCENE-10322: Enable -Xlint:path and -Xlint:-exports

2022-02-16 Thread GitBox


dweiss commented on pull request #681:
URL: https://github.com/apache/lucene/pull/681#issuecomment-1041410379


   > Yeah, those are actually API bugs?
   
   They do look like API issues to me. Useful warning, by the way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

2022-02-16 Thread GitBox


mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r807992105



##
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
##
@@ -206,14 +203,22 @@ private void writeMeta(
 meta.writeVLong(vectorIndexOffset);
 meta.writeVLong(vectorIndexLength);
 meta.writeInt(field.getVectorDimension());
-meta.writeInt(docIds.length);
-for (int docId : docIds) {
-  // TODO: delta-encode, or write as bitset
-  meta.writeVInt(docId);
+
+// write docIDs
+int count = docsWithField.cardinality();
+meta.writeInt(count);
+if (count == maxDoc) {
+  meta.writeByte((byte) -1);
+  ; // dense marker, each document has a vector value

Review comment:
   Addressed in 47042f2




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

2022-02-16 Thread GitBox


mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r807992728



##
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##
@@ -424,38 +448,45 @@ public int docID() {
 
 @Override
 public int nextDoc() {
-  if (++ord >= size()) {
+  if (++ord >= size) {
 doc = NO_MORE_DOCS;
   } else {
-doc = ordToDoc[ord];
+doc = ordToDocOperator.applyAsInt(ord);
   }
   return doc;
 }
 
 @Override
 public int advance(int target) {
   assert docID() < target;
-  ord = Arrays.binarySearch(ordToDoc, ord + 1, ordToDoc.length, target);
-  if (ord < 0) {
-ord = -(ord + 1);
+
+  if (ordToDoc == null) {
+ord = target;
+  } else {
+ord = Arrays.binarySearch(ordToDoc, ord + 1, ordToDoc.length, target);
+if (ord < 0) {
+  ord = -(ord + 1);
+}
   }
-  assert ord <= ordToDoc.length;
-  if (ord == ordToDoc.length) {
+
+  assert ord <= size;
+  if (ord == size) {
 doc = NO_MORE_DOCS;
   } else {
-doc = ordToDoc[ord];
+doc = ordToDocOperator.applyAsInt(ord);
+;

Review comment:
   Addressed in 47042f2




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

2022-02-16 Thread GitBox


mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r807993536



##
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##
@@ -266,12 +268,12 @@ private Bits getAcceptOrds(Bits acceptDocs, FieldEntry 
fieldEntry) {
 return new Bits() {
   @Override
   public boolean get(int index) {
-return acceptDocs.get(fieldEntry.ordToDoc[index]);
+return acceptDocs.get(fieldEntry.ordToDoc(index));

Review comment:
   Great comment, addressed in 47042f2




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

2022-02-16 Thread GitBox


mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r807998586



##
File path: 
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
##
@@ -1018,4 +1020,57 @@ public void testAdvance() throws Exception {
   }
 }
   }
+
+  public void testVectorValuesReportCorrectDocs() throws Exception {
+final int numDocs = atLeast(1000);
+final int dim = random().nextInt(20) + 1;
+final VectorSimilarityFunction similarityFunction =
+VectorSimilarityFunction.values()[
+random().nextInt(VectorSimilarityFunction.values().length)];
+
+float fieldValuesCheckSum = 0f;
+int fieldDocCount = 0;
+long fieldSumDocIDs = 0;
+
+try (Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, 
newIndexWriterConfig())) {
+  for (int i = 0; i < numDocs; i++) {
+Document doc = new Document();
+int docID = random().nextInt(numDocs);
+doc.add(new StoredField("id", docID));
+if (random().nextInt(4) == 3) {
+  float[] vector = randomVector(dim);
+  doc.add(new KnnVectorField("knn_vector", vector, 
similarityFunction));
+  fieldValuesCheckSum += vector[0];
+  fieldDocCount++;
+  fieldSumDocIDs += docID;
+}
+w.addDocument(doc);
+  }
+
+  if (random().nextBoolean()) {
+w.forceMerge(1);
+  }
+
+  try (IndexReader r = w.getReader()) {
+float checksum = 0;

Review comment:
   @jtibshirani Thanks for your feedback and comment.  What did you mean by 
"vectors were out of order"?   `VectorValues` `extends DocIdSetIterator`  and 
are expected to be accessed in the increasing doc IDs order.
   
   Or did you mean `RandomAccessVectorValues`?  I think this class doesn't 
concern itself with doc Ids.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

2022-02-16 Thread GitBox


mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r807998586



##
File path: 
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
##
@@ -1018,4 +1020,57 @@ public void testAdvance() throws Exception {
   }
 }
   }
+
+  public void testVectorValuesReportCorrectDocs() throws Exception {
+final int numDocs = atLeast(1000);
+final int dim = random().nextInt(20) + 1;
+final VectorSimilarityFunction similarityFunction =
+VectorSimilarityFunction.values()[
+random().nextInt(VectorSimilarityFunction.values().length)];
+
+float fieldValuesCheckSum = 0f;
+int fieldDocCount = 0;
+long fieldSumDocIDs = 0;
+
+try (Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, 
newIndexWriterConfig())) {
+  for (int i = 0; i < numDocs; i++) {
+Document doc = new Document();
+int docID = random().nextInt(numDocs);
+doc.add(new StoredField("id", docID));
+if (random().nextInt(4) == 3) {
+  float[] vector = randomVector(dim);
+  doc.add(new KnnVectorField("knn_vector", vector, 
similarityFunction));
+  fieldValuesCheckSum += vector[0];
+  fieldDocCount++;
+  fieldSumDocIDs += docID;
+}
+w.addDocument(doc);
+  }
+
+  if (random().nextBoolean()) {
+w.forceMerge(1);
+  }
+
+  try (IndexReader r = w.getReader()) {
+float checksum = 0;

Review comment:
   @jtibshirani Thanks for your feedback and comment.  What did you mean by 
"vectors were out of order"?   `VectorValues` `extends DocIdSetIterator`  and 
are expected to be accessed in the increasing doc ID order.
   
   Or did you mean `RandomAccessVectorValues`?  I think this class doesn't 
concern itself with doc Ids.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

2022-02-16 Thread GitBox


mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r807998586



##
File path: 
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
##
@@ -1018,4 +1020,57 @@ public void testAdvance() throws Exception {
   }
 }
   }
+
+  public void testVectorValuesReportCorrectDocs() throws Exception {
+final int numDocs = atLeast(1000);
+final int dim = random().nextInt(20) + 1;
+final VectorSimilarityFunction similarityFunction =
+VectorSimilarityFunction.values()[
+random().nextInt(VectorSimilarityFunction.values().length)];
+
+float fieldValuesCheckSum = 0f;
+int fieldDocCount = 0;
+long fieldSumDocIDs = 0;
+
+try (Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, 
newIndexWriterConfig())) {
+  for (int i = 0; i < numDocs; i++) {
+Document doc = new Document();
+int docID = random().nextInt(numDocs);
+doc.add(new StoredField("id", docID));
+if (random().nextInt(4) == 3) {
+  float[] vector = randomVector(dim);
+  doc.add(new KnnVectorField("knn_vector", vector, 
similarityFunction));
+  fieldValuesCheckSum += vector[0];
+  fieldDocCount++;
+  fieldSumDocIDs += docID;
+}
+w.addDocument(doc);
+  }
+
+  if (random().nextBoolean()) {
+w.forceMerge(1);
+  }
+
+  try (IndexReader r = w.getReader()) {
+float checksum = 0;

Review comment:
   @jtibshirani Thanks for your feedback and comment.  What did you mean by 
"vectors were out of order"?   `VectorValues` `extends DocIdSetIterator`  and 
are expected to be accessed in the increasing doc ID order.
   
   Or did you mean `RandomAccessVectorValues`?  This class doesn't concern 
itself with doc Ids, so we should not worry about docIds in this case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

2022-02-16 Thread GitBox


mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r807998586



##
File path: 
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
##
@@ -1018,4 +1020,57 @@ public void testAdvance() throws Exception {
   }
 }
   }
+
+  public void testVectorValuesReportCorrectDocs() throws Exception {
+final int numDocs = atLeast(1000);
+final int dim = random().nextInt(20) + 1;
+final VectorSimilarityFunction similarityFunction =
+VectorSimilarityFunction.values()[
+random().nextInt(VectorSimilarityFunction.values().length)];
+
+float fieldValuesCheckSum = 0f;
+int fieldDocCount = 0;
+long fieldSumDocIDs = 0;
+
+try (Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, 
newIndexWriterConfig())) {
+  for (int i = 0; i < numDocs; i++) {
+Document doc = new Document();
+int docID = random().nextInt(numDocs);
+doc.add(new StoredField("id", docID));
+if (random().nextInt(4) == 3) {
+  float[] vector = randomVector(dim);
+  doc.add(new KnnVectorField("knn_vector", vector, 
similarityFunction));
+  fieldValuesCheckSum += vector[0];
+  fieldDocCount++;
+  fieldSumDocIDs += docID;
+}
+w.addDocument(doc);
+  }
+
+  if (random().nextBoolean()) {
+w.forceMerge(1);
+  }
+
+  try (IndexReader r = w.getReader()) {
+float checksum = 0;

Review comment:
   @jtibshirani Thanks for your feedback and comment.  What did you mean by 
"vectors were out of order"?   `VectorValues` `extends DocIdSetIterator`  and 
are expected to be accessed in the increasing doc ID order.
   
   Or did you mean `RandomAccessVectorValues`?  This class doesn't concern 
itself with doc Ids, so we should not worry docIds.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy commented on pull request #2641: SOLR-15965 Use proper signatures for SolrAuth

2022-02-16 Thread GitBox


janhoy commented on pull request #2641:
URL: https://github.com/apache/lucene-solr/pull/2641#issuecomment-1041761443


   So the benefit of backporting to 8x is that we get a more secure PKI for the 
lifetime of 8x (12+ months), and that you get an upgrade path 8.x -> 8.11.2 -> 
9.x where rolling upgrades will work ootb without any param settings. Fair 
enough.
   
   Perhaps add to the 8.11.2 release-notes (wiki) that this release makes 
rolling upgrade to 9.x easier.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10391) Reuse data structures across HnswGraph invocations

2022-02-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493305#comment-17493305
 ] 

Michael McCandless commented on LUCENE-10391:
-

Sorry for the nightly benchmarks down-time!  I think I [pushed a fix just 
now|https://github.com/mikemccand/luceneutil/commit/36eec79e5ea3cb336c38d53bd4ea35bd6847b4c5]
 that should get them running again ... cross fingers!

> Reuse data structures across HnswGraph invocations
> --
>
> Key: LUCENE-10391
> URL: https://issues.apache.org/jira/browse/LUCENE-10391
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Julie Tibshirani
>Priority: Minor
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Creating HNSW graphs involves doing many repeated calls to HnswGraph#search. 
> Profiles from nightly benchmarks suggest that allocating data-structures 
> incurs both lots of heap allocations 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_heap)]
>  and CPU usage 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_cpu).]
>  It looks like reusing data structures across invocations would be a 
> low-hanging fruit that could help save significant CPU?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-02-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493306#comment-17493306
 ] 

Michael McCandless commented on LUCENE-10421:
-

+1 for a constant.  42 seems good?

> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-02-16 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493320#comment-17493320
 ] 

Robert Muir commented on LUCENE-10421:
--

42 patch is here: https://github.com/apache/lucene/pull/686

> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10391) Reuse data structures across HnswGraph invocations

2022-02-16 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493345#comment-17493345
 ] 

Julie Tibshirani commented on LUCENE-10391:
---

Oh okay thanks, ignore my analysis above then. Funny how I managed to see 
improvement even when they weren't running!

> Reuse data structures across HnswGraph invocations
> --
>
> Key: LUCENE-10391
> URL: https://issues.apache.org/jira/browse/LUCENE-10391
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Julie Tibshirani
>Priority: Minor
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Creating HNSW graphs involves doing many repeated calls to HnswGraph#search. 
> Profiles from nightly benchmarks suggest that allocating data-structures 
> incurs both lots of heap allocations 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_heap)]
>  and CPU usage 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_cpu).]
>  It looks like reusing data structures across invocations would be a 
> low-hanging fruit that could help save significant CPU?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #678: LUCENE-10398: Add static method for getting Terms from LeafReader

2022-02-16 Thread GitBox


gsmiller commented on a change in pull request #678:
URL: https://github.com/apache/lucene/pull/678#discussion_r808256634



##
File path: lucene/core/src/java/org/apache/lucene/document/FeatureQuery.java
##
@@ -111,12 +111,9 @@ public Explanation explain(LeafReaderContext context, int 
doc) throws IOExceptio
 
   @Override
   public Scorer scorer(LeafReaderContext context) throws IOException {
-Terms terms = context.reader().terms(fieldName);
-if (terms == null) {
-  return null;
-}
+Terms terms = Terms.terms(context.reader(), fieldName);
 TermsEnum termsEnum = terms.iterator();
-if (termsEnum.seekExact(new BytesRef(featureName)) == false) {
+if (!termsEnum.seekExact(new BytesRef(featureName))) {

Review comment:
   As a side note, in case it's helpful, I know with IntelliJ at least you 
can disable the suggestion it likes to give to convert all these `== false` 
occurrences to `!` if that irritates you :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #678: LUCENE-10398: Add static method for getting Terms from LeafReader

2022-02-16 Thread GitBox


gsmiller commented on pull request #678:
URL: https://github.com/apache/lucene/pull/678#issuecomment-1041899789


   Thanks for the quick iteration! This looks good to me now. As I mentioned 
before, I'm going to wait a couple days before merging in case anyone else 
wants to chime in with feedback or opposition to adding this functionality, but 
I'd consider this ready to go from my perspective.
   
   As a side note, in the future, it makes it a little easier to review if you 
avoid force pushing changes and leave the git commit history in place. That way 
I can easily look at what's changed since I last reviewed. I know a lot of 
people are in the habit of squashing commit history to keep it clean, but 
github makes that super easy to do when actually merging your pull request, so 
no need to do that on your side. Just a future note.
   
   Thanks again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #685: LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count.

2022-02-16 Thread GitBox


gsmiller commented on a change in pull request #685:
URL: https://github.com/apache/lucene/pull/685#discussion_r808281208



##
File path: lucene/CHANGES.txt
##
@@ -615,6 +615,8 @@ Improvements
 
 * LUCENE-10062: Switch taxonomy faceting to use numeric doc values for storing 
ordinals instead of binary doc values
   with its own custom encoding. (Greg Miller)
+ 
+* LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate 
Weight#count. (Ignacio Vera)  

Review comment:
   This should go under 9.1 right?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-16 Thread GitBox


jtibshirani commented on a change in pull request #656:
URL: https://github.com/apache/lucene/pull/656#discussion_r808308244



##
File path: build.gradle
##
@@ -183,3 +183,5 @@ apply from: file('gradle/hacks/turbocharge-jvm-opts.gradle')
 apply from: file('gradle/hacks/dummy-outputs.gradle')
 
 apply from: file('gradle/pylucene/pylucene.gradle')
+sourceCompatibility = JavaVersion.VERSION_16

Review comment:
   Definitely not! Somehow this file gets automatically changed, and I 
accidentally included it with `git add -u`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-16 Thread GitBox


jtibshirani commented on a change in pull request #656:
URL: https://github.com/apache/lucene/pull/656#discussion_r808356718



##
File path: lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java
##
@@ -455,6 +484,61 @@ public void testRandom() throws IOException {
 }
   }
 
+  /** Tests with random vectors and a random filter. Uses RandomIndexWriter. */
+  public void testRandomWithFilter() throws IOException {
+int numDocs = 200;
+int dimension = atLeast(5);
+int numIters = atLeast(10);
+try (Directory d = newDirectory()) {
+  RandomIndexWriter w = new RandomIndexWriter(random(), d);
+  for (int i = 0; i < numDocs; i++) {
+Document doc = new Document();
+doc.add(new KnnVectorField("field", randomVector(dimension)));
+doc.add(new NumericDocValuesField("tag", i));
+doc.add(new IntPoint("tag", i));
+w.addDocument(doc);
+  }
+  w.close();
+
+  try (IndexReader reader = DirectoryReader.open(d)) {
+IndexSearcher searcher = newSearcher(reader);
+for (int i = 0; i < numIters; i++) {
+  int lower = random().nextInt(50);
+
+  // Check that when filter is restrictive, we use exact search
+  Query filter = IntPoint.newRangeQuery("tag", lower, lower + 6);
+  KnnVectorQuery query = new KnnVectorQuery("field", 
randomVector(dimension), 5, filter);
+  TopDocs results = searcher.search(query, numDocs);
+  assertEquals(TotalHits.Relation.EQUAL_TO, 
results.totalHits.relation);
+  assertEquals(results.totalHits.value, 5);

Review comment:
   Thanks for catching this. I actually got confused here and wrote test 
assertions that are misleading. Since `KnnVectorQuery` is rewritten to 
`DocAndScoreQuery`, none of the information about visited nodes is preserved. 
Therefore we can't tell if exact or approximate search was used. I will rework 
this test.
   
   I will open a follow-up issue to discuss this. I don't feel like we have a 
perfect grasp on what total hits should mean in the context of kNN search, 
especially since it differs between `LeafReader#searchNearestVectors` and the 
output of `KnnVectorQuery`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on pull request #677: LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field

2022-02-16 Thread GitBox


vigyasharma commented on pull request #677:
URL: https://github.com/apache/lucene/pull/677#issuecomment-1042264970


   > This looks great. I left a tiny comment related to tests. Could you also 
add an entry to `CHANGES.txt` under "Lucene 9.1.0"?
   
   Thank you for reviewing this PR, @jtibshirani. I've added the Changes entry 
and updates UTs to not assert on the search result.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10176) Remove VectorValues#size()

2022-02-16 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493518#comment-17493518
 ] 

Julie Tibshirani commented on LUCENE-10176:
---

Sorry to be jumping in late with a question. What is the motivation for 
removing VectorValues#size()? We have the information available and it could be 
helpful in some contexts. For example https://github.com/apache/lucene/pull/656 
proposes to add a query KnnVectorFieldExistsQuery. This query could benefit 
from VectorValues#size() to try to rewrite to MatchAllDocsQuery when all docs 
have a vector.

> Remove VectorValues#size()
> --
>
> Key: LUCENE-10176
> URL: https://issues.apache.org/jira/browse/LUCENE-10176
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Major
>
> This method doesn't seem to be used anywhere except by 
> SimpleTextKnnVectorsReader#search, which uses it in an incorrect way by using 
> it as the total number of hits matching a nearest-neighbor search (it is 
> incorrect because this number might be higher than the number of vectors 
> having a value because of deletes).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

2022-02-16 Thread GitBox


jtibshirani commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r808504111



##
File path: 
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
##
@@ -1018,4 +1020,57 @@ public void testAdvance() throws Exception {
   }
 }
   }
+
+  public void testVectorValuesReportCorrectDocs() throws Exception {
+final int numDocs = atLeast(1000);
+final int dim = random().nextInt(20) + 1;
+final VectorSimilarityFunction similarityFunction =
+VectorSimilarityFunction.values()[
+random().nextInt(VectorSimilarityFunction.values().length)];
+
+float fieldValuesCheckSum = 0f;
+int fieldDocCount = 0;
+long fieldSumDocIDs = 0;
+
+try (Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, 
newIndexWriterConfig())) {
+  for (int i = 0; i < numDocs; i++) {
+Document doc = new Document();
+int docID = random().nextInt(numDocs);
+doc.add(new StoredField("id", docID));
+if (random().nextInt(4) == 3) {
+  float[] vector = randomVector(dim);
+  doc.add(new KnnVectorField("knn_vector", vector, 
similarityFunction));
+  fieldValuesCheckSum += vector[0];
+  fieldDocCount++;
+  fieldSumDocIDs += docID;
+}
+w.addDocument(doc);
+  }
+
+  if (random().nextBoolean()) {
+w.forceMerge(1);
+  }
+
+  try (IndexReader r = w.getReader()) {
+float checksum = 0;

Review comment:
   Sorry I read this too fast and wrote a confusing comment :) This check 
looks good to me.

##
File path: lucene/CHANGES.txt
##
@@ -204,6 +204,8 @@ Optimizations
 * LUCENE-10367: Optimize CoveringQuery for the case when the minimum number of
   matching clauses is a constant. (LuYunCheng via Adrien Grand)
 
+* LUCENE-10408 Better encoding of doc Ids in vectors (Mayya Sharipova, Julie 
Tibshirani, Adrien Grand)

Review comment:
   Thanks for including me! I'm also fine if you omit me when I'm a 
reviewer.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #685: LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count.

2022-02-16 Thread GitBox


iverase commented on a change in pull request #685:
URL: https://github.com/apache/lucene/pull/685#discussion_r808708059



##
File path: lucene/CHANGES.txt
##
@@ -615,6 +615,8 @@ Improvements
 
 * LUCENE-10062: Switch taxonomy faceting to use numeric doc values for storing 
ordinals instead of binary doc values
   with its own custom encoding. (Greg Miller)
+ 
+* LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate 
Weight#count. (Ignacio Vera)  

Review comment:
   of course, what an oversight :)
   
   thanks @gsmiller!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase merged pull request #685: LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count.

2022-02-16 Thread GitBox


iverase merged pull request #685:
URL: https://github.com/apache/lucene/pull/685


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10415) FunctionScoreQuery and IndexOrDocValuesQuery should delegate Weight#count

2022-02-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493705#comment-17493705
 ] 

ASF subversion and git services commented on LUCENE-10415:
--

Commit 84e34dc4683ba43a0ebe5e942ee117b64b29cdec in lucene's branch 
refs/heads/main from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=84e34dc ]

LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate 
Weight#count. (#685)

These query wrappers do not modify the set of matching documents so they can 
delegate Weight#count.

> FunctionScoreQuery and IndexOrDocValuesQuery should delegate Weight#count
> -
>
> Key: LUCENE-10415
> URL: https://issues.apache.org/jira/browse/LUCENE-10415
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We have a number of query wrappers that do not modify the set of matching 
> documents like FunctionScoreQuery and IndexOrDocValuesQuery. These queries 
> should delegate Weight#count.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10415) FunctionScoreQuery and IndexOrDocValuesQuery should delegate Weight#count

2022-02-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493709#comment-17493709
 ] 

ASF subversion and git services commented on LUCENE-10415:
--

Commit 423573759f74645e0f2cf4a092d8e2d51b75b559 in lucene's branch 
refs/heads/branch_9x from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4235737 ]

LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate 
Weight#count. (#685)

These query wrappers do not modify the set of matching documents so they can 
delegate Weight#count.

> FunctionScoreQuery and IndexOrDocValuesQuery should delegate Weight#count
> -
>
> Key: LUCENE-10415
> URL: https://issues.apache.org/jira/browse/LUCENE-10415
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We have a number of query wrappers that do not modify the set of matching 
> documents like FunctionScoreQuery and IndexOrDocValuesQuery. These queries 
> should delegate Weight#count.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10415) FunctionScoreQuery and IndexOrDocValuesQuery should delegate Weight#count

2022-02-16 Thread Ignacio Vera (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Vera resolved LUCENE-10415.
---
Fix Version/s: 9.1
 Assignee: Ignacio Vera
   Resolution: Fixed

> FunctionScoreQuery and IndexOrDocValuesQuery should delegate Weight#count
> -
>
> Key: LUCENE-10415
> URL: https://issues.apache.org/jira/browse/LUCENE-10415
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Ignacio Vera
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We have a number of query wrappers that do not modify the set of matching 
> documents like FunctionScoreQuery and IndexOrDocValuesQuery. These queries 
> should delegate Weight#count.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org