date:20240102

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-02 Thread via GitHub



tveasey commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1873862125

   IMO we shouldn't focus too much on recall since the greediness of 
non-competitive search allows us to tune this. My main concern is does 
contention on the queue updates cause slow down. This aside, I think the queue 
is strictly better.
   
   The search might wind up visiting fewer vertices for min score sharing, 
because of earlier decisions might mean it by chance gets transiently better 
bounds, but this should be low probability particularly when the search has to 
visit many vertices. And indeed these cases are where we see big wins from 
using a queue.
   
   There appears to be some evidence of contention. This is suggested by 
looking at the runtime vs expected runtime from vertices visited, e.g.
   
   | scenario | QPS(score) / QPS(queue) | Visited(queue) / Visited(score) |
   | --- | --- | --- |
   | n=10M, dim=100, k = 100, fo = 900 | 0.83 | 0.65 |
   | n=10M, dim=768, k = 100, fo = 900 | 0.76 | 0.68 |
   
   Note that the direction of this effect is consistent, but the size is not 
(fo = 900 shows the largest effect). However, all that said we still get 
significant wins in performance, so my vote would be to use the queue and work 
on strategies for reducing contention, there are various ideas we had for ways 
to achieve this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Introduce Bloom Filter as non-experimental/core postings format [lucene]

2024-01-02 Thread via GitHub



mikemccand commented on issue #12986:
URL: https://github.com/apache/lucene/issues/12986#issuecomment-1873933126

   I agree with @rmuir -- promising backwards compatibility (API or index 
format) is a huge burden on Lucene developers, and it's hard enough with the 
default Codec today.
   
   Given that the bloom postings format is still very much in flux, let's wait 
on removing the experimental tag.  E.g. we are also [pursuing another 
experimental codec (inspired by 
Tantivy)](https://github.com/apache/lucene/pull/12688) that also seems to speed 
up the primary-key lookup use case.
   
   Note that OpenSearch devs could also choose to offer this backwards 
compatibility to its users.  The promise need not be implemented only in Lucene.
   
   Thank you for sharing those benchmark results.  That is indeed quite a 
sizable impact on indexing throughput / long-pole latencies, especially as you 
greatly increase the bloom filter size to lower the false-positive rate.  It 
looks like that test was 100% `updateDocument` calls with 25% of the updates 
being updates not appends?
   
   +1 to pursue the linked improvements (off-heap option) -- the [linked 
PR](https://github.com/opensearch-project/OpenSearch/pull/11027) looks 
interesting -- maybe open a PR here for that?  Or is that change somehow 
OpenSearch specific?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Introduce Bloom Filter as non-experimental/core postings format [lucene]

2024-01-02 Thread via GitHub



shwetathareja commented on issue #12986:
URL: https://github.com/apache/lucene/issues/12986#issuecomment-1873988375

   Thanks @mikemccand for the feedback. We can pursue the route to offer the 
backward compatibility in OpenSearch directly if there are no other takers 
among Lucene users.
   
   💯 for contributing the off heap implementation to Lucene to prevent memory 
overhead due to large bloom filter size.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity fails intermittently [lucene]

2024-01-02 Thread via GitHub



benwtrent commented on issue #12955:
URL: https://github.com/apache/lucene/issues/12955#issuecomment-1874009723

   @kaivalnp this does indeed seem related to disconnectedness. That is a 
larger effort. I would suggest updating the graph parameters for this 
particular test to reduce the chance of the failure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Reduce number of dimensions for Test[Byte|Float]VectorSimilarityQuery [lucene]

2024-01-02 Thread via GitHub



kaivalnp opened a new pull request, #12988:
URL: https://github.com/apache/lucene/pull/12988

   ### Description
   
   Identified in #12955, where 
`TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity` fails because of a 
disconnected HNSW graph
   
   This is a bigger issue, but we can reduce intermittent failures by keeping 
the number of docs and dimensions same as 
[`BaseKnnVectorQueryTestCase.testRandom`](https://github.com/apache/lucene/blob/dc9f154aa574e8cd0e60070a1814c1d221fbec5d/lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java#L470)
 (similar test for KNN with random vectors)
   
   ### Command to reproduce
   
   ```
   ./gradlew :lucene:core:test --tests 
"org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity"
 -Ptests.jvms=12 -Ptests.jvmargs= -Ptests.seed=1A1CDC0974AF361
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity fails intermittently [lucene]

2024-01-02 Thread via GitHub



kaivalnp commented on issue #12955:
URL: https://github.com/apache/lucene/issues/12955#issuecomment-1874198612

   Makes sense @benwtrent.. Opened #12988 to fix this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Refactoring: Rename Levenstein to Levenshtein [LUCENE-7370] [lucene]

2024-01-02 Thread via GitHub



shaikhu commented on issue #8424:
URL: https://github.com/apache/lucene/issues/8424#issuecomment-1874242117

   @mikemccand I think this can be closed now 😉 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Update copyright year in NOTICE.txt file. [lucene]

2024-01-02 Thread via GitHub



cpoerschke closed pull request #12065: Update copyright year in NOTICE.txt file.
URL: https://github.com/apache/lucene/pull/12065


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Update copyright year in NOTICE.txt file. [lucene]

2024-01-02 Thread via GitHub



cpoerschke commented on PR #12065:
URL: https://github.com/apache/lucene/pull/12065#issuecomment-1874287516

   Happy New Year 2024!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Reduce number of dimensions for Test[Byte|Float]VectorSimilarityQuery [lucene]

2024-01-02 Thread via GitHub



benwtrent merged PR #12988:
URL: https://github.com/apache/lucene/pull/12988


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Refactoring: Rename Levenstein to Levenshtein [LUCENE-7370] [lucene]

2024-01-02 Thread via GitHub



mikemccand closed issue #8424: Refactoring: Rename Levenstein to Levenshtein 
[LUCENE-7370]
URL: https://github.com/apache/lucene/issues/8424


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Refactoring: Rename Levenstein to Levenshtein [LUCENE-7370] [lucene]

2024-01-02 Thread via GitHub



mikemccand commented on issue #8424:
URL: https://github.com/apache/lucene/issues/8424#issuecomment-1874398060

   Great, thanks @shaikhu!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for index sorting with document blocks [lucene]

2024-01-02 Thread via GitHub



mikemccand commented on code in PR #12829:
URL: https://github.com/apache/lucene/pull/12829#discussion_r1439716624


##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -1176,34 +1176,44 @@ public static Status.IndexSortStatus testSort(
 comparators[i] = fields[i].getComparator(1, 
Pruning.NONE).getLeafComparator(readerContext);
   }
 
-  int maxDoc = reader.maxDoc();
-
   try {
-
-for (int docID = 1; docID < maxDoc; docID++) {
-
+LeafMetaData metaData = reader.getMetaData();
+FieldInfos fieldInfos = reader.getFieldInfos();
+if (metaData.hasBlocks()
+&& fieldInfos.getParentField() == null
+&& metaData.getCreatedVersionMajor() >= 
Version.LUCENE_10_0_0.major) {
+  throw new IllegalStateException(
+  "parent field is not set but the index has document blocks and 
was created with version: "
+  + metaData.getCreatedVersionMajor());
+}
+final DocIdSetIterator iter =

Review Comment:
   Hmm maybe un-ternary this one?



##
lucene/CHANGES.txt:
##
@@ -90,6 +90,11 @@ New Features
 * LUCENE-10626 Hunspell: add tools to aid dictionary editing:
   analysis introspection, stem expansion and stem/flag suggestion (Peter 
Gromov)
 
+* GITHUB#12829: IndexWriter now preserves document blocks indexed via 
IndexWriter#addDocuments
+  et.al. when index sorting is configured. Document blocks are maintained 
alongside their
+  parent documents during sort and merge. IndexWriterConfig requires a parent 
field to be specified
+  if index sorting is used together with documents blocks. (Simon Willnauer)

Review Comment:
   `documents blocks` -> `document blocks`



##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##
@@ -134,6 +137,8 @@ void abort() throws IOException {
   private final ReentrantLock lock = new ReentrantLock();
   private int[] deleteDocIDs = new int[0];
   private int numDeletedDocIds = 0;
+  private final int indexVersionCreated;

Review Comment:
   Rename to `indexMajorVersionCreated`?



##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##
@@ -231,7 +244,23 @@ long updateDocuments(
   final int docsInRamBefore = numDocsInRAM;
   boolean allDocsIndexed = false;
   try {
-for (Iterable doc : docs) {
+final Iterator> iterator 
= docs.iterator();
+while (iterator.hasNext()) {
+  Iterable doc = iterator.next();
+  if (parentField != null) {
+if (iterator.hasNext() == false) {
+  doc = addParentField(doc, parentField);
+}
+  } else if (segmentInfo.getIndexSort() != null
+  && iterator.hasNext()
+  && indexVersionCreated >= Version.LUCENE_10_0_0.major) {
+// sort is configured but parent field is missing, yet we have a 
doc-block
+// yet we must not fail if this index was created in an earlier 
version where this
+// behavior was permitted.
+throw new IllegalArgumentException(
+"a parent field must be set in order to use document blocks 
with index sorting; see IndexWriterConfig#getParentField");

Review Comment:
   Maybe `#setParentField` instead?  Or is the javadoc mostly on 
`getParentField` if so leave this be.



##
lucene/core/src/java/org/apache/lucene/index/IndexingChain.java:
##
@@ -1512,4 +1557,77 @@ void assertSameSchema(FieldInfo fi) {
   assertSame("point num bytes", fi.getPointNumBytes(), pointNumBytes);
 }
   }
+
+  /**
+   * Wraps the given field in a reserved field and registers it as reserved. 
Only DWPT should do
+   * this to mark fields as private / reserved to prevent this fieldname to be 
used from the outside

Review Comment:
   +1 to tighten up the language.



##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##
@@ -245,10 +274,11 @@ long updateDocuments(
 onNewDocOnRAM.run();
   }
 }
-allDocsIndexed = true;
-if (numDocsInRAM - docsInRamBefore > 1) {
+final int numDocs = numDocsInRAM - docsInRamBefore;
+if (numDocs > 1) {
   segmentInfo.setHasBlocks();

Review Comment:
   This might mean some segments don't have `hasBlocks` set but some do, if all 
doc blocks in a given segment happened to have just one doc.  That should be 
OK?  Nothing checks that all segments have the same boolean value for this?  
Maybe add a test case to confirm?



##
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java:
##
@@ -437,6 +488,31 @@ private void verifySoftDeletedFieldName(String fieldName, 
boolean isSoftDeletesF
   }
 }
 
+private void verifyParentFieldName(String fieldName, boolean 
isParentField) {
+  if (isParentField) {
+if (parentFieldName == null) {
+  throw new Il

Re: [PR] LUCENE-10641: IndexSearcher#setTimeout should also abort query rewrites, point ranges and vector searches [lucene]

2024-01-02 Thread via GitHub



mikemccand commented on code in PR #12345:
URL: https://github.com/apache/lucene/pull/12345#discussion_r1439744329


##
lucene/core/src/java/org/apache/lucene/index/ExitableIndexReader.java:
##
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.automaton.CompiledAutomaton;
+
+/**
+ * The {@link ExitableIndexReader} is used to timeout I/O operation which is 
done during query
+ * rewrite. After this time is exceeded, the search thread is stopped by 
throwing a {@link
+ * ExitableIndexReader.TimeExceededException}
+ */
+public final class ExitableIndexReader extends IndexReader {
+  private final IndexReader indexReader;
+  private final QueryTimeout queryTimeout;
+
+  /**
+   * Create a ExitableIndexReader wrapper over another {@link IndexReader} 
with a specified timeout.
+   *
+   * @param indexReader the wrapped {@link IndexReader}
+   * @param queryTimeout max time allowed for collecting hits after which 
{@link
+   * ExitableIndexReader.TimeExceededException} is thrown
+   */
+  public ExitableIndexReader(IndexReader indexReader, QueryTimeout 
queryTimeout) {
+this.indexReader = indexReader;
+this.queryTimeout = queryTimeout;
+doWrapIndexReader(indexReader, queryTimeout);
+  }
+
+  /** Returns queryTimeout instance. */
+  public QueryTimeout getQueryTimeout() {
+return queryTimeout;
+  }
+
+  /** Thrown when elapsed search time exceeds allowed search time. */
+  @SuppressWarnings("serial")
+  static class TimeExceededException extends RuntimeException {
+private TimeExceededException() {
+  super("TimeLimit Exceeded");
+}
+
+private TimeExceededException(Exception e) {
+  super(e);
+}
+  }
+
+  @Override
+  public TermVectors termVectors() throws IOException {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.termVectors();
+  }
+
+  @Override
+  public int numDocs() {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.numDocs();
+  }
+
+  @Override
+  public int maxDoc() {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.maxDoc();
+  }
+
+  @Override
+  public StoredFields storedFields() throws IOException {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.storedFields();
+  }
+
+  @Override
+  protected void doClose() throws IOException {
+indexReader.doClose();
+  }
+
+  @Override
+  public IndexReaderContext getContext() {
+return indexReader.getContext();
+  }
+
+  @Override
+  public CacheHelper getReaderCacheHelper() {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.getReaderCacheHelper();
+  }
+
+  @Override
+  public int docFreq(Term term) throws IOException {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.docFreq(term);
+  }
+
+  @Override
+  public long totalTermFreq(Term term) throws IOException {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.totalTermFreq(term);
+  }
+
+  @Override
+  public long getSumDocFreq(String field) throws IOException {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.getSumDocFreq(field);
+  }
+
+  @Override
+  public int getDocCount(String field) throws IOException {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.getDocCount(field);
+  }
+
+  @Override
+  public long getSumTotalTermFreq(String fi

[I] Update package info for HNSW [lucene]

2024-01-02 Thread via GitHub



znnahiyan opened a new issue, #12990:
URL: https://github.com/apache/lucene/issues/12990

   According to PR #608 #629, the HNSW package had been made hierarchical for 
Lucene 9.1.0, so it's not single-layer anymore as per the package info 
description:
   
   
https://github.com/apache/lucene/blob/248f067d5207da653f94751eb5d4e8a42ae1238e/lucene/core/src/java/org/apache/lucene/util/hnsw/package-info.java#L17-L22
   
   All of the javadoc pages from 9.1.0 up to the latest (9.9.1) don't reflect 
this change, so it would be preferred to merge the commit(s) to those earlier 
version branches as well.
   
   * 
https://lucene.apache.org/core/9_1_0/core/org/apache/lucene/util/hnsw/package-summary.html
   * 
https://lucene.apache.org/core/9_9_1/core/org/apache/lucene/util/hnsw/package-summary.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] NullPointerException in IndexSearcher.search() when searching with SpanfirstQuery and a customized collector [lucene]

2024-01-02 Thread via GitHub



luozhuang opened a new issue, #12991:
URL: https://github.com/apache/lucene/issues/12991

   ### Description
   
   I encountered a NullPointerException when I searched with SpanfirstQuery.   
The Lucene version is 8.10. 
   The example call stack is
   ``` 
   A Java Exception: java.lang.NullPointerException
   at #1 
org.apache.lucene.search.spans.SpanScorer.scoreCurrentDoc(SpanScorer.java:76)
   #2 org.apache.lucene.search.spans.SpanScorer.score(SpanScorer.java:134)
   #3 
com.dummy.search.mySearcher.DocScoreCollector.collect(MyDocScoreCollector.java:59)
   #4 
com.dummy.search.mySearcher.DocScoreCollector$1.collect(MyDocScoreCollector.java:108)
   #5 
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:283) 
   #6 org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:232) 
   #7 org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
   #8 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:659) 
   #9 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
   ```
   
   The frames #3 and #4 are within my customized collector.  This collector has 
a SpanScorer.  The example code of this class looks like
   ```
   public class MyDocScoreCollector implements Collector {
ScoreMode collectScore;
Scorable scorer;
ArrayList scoreDocs = new ArrayList(100);
int docBase;
 

public MyDocScoreCollector(boolean collectScore) {
if (collectScore) {
this.collectScore = ScoreMode.COMPLETE;
}
else {
this.collectScore = ScoreMode.COMPLETE_NO_SCORES;
}
}
   
public void collect(int doc) throws IOException {
float score = 0.0f;
if(collectScore==ScoreMode.COMPLETE)
score = scorer.score();

ScoreDoc sd = new ScoreDoc(docBase + doc, score);
scoreDocs.add(sd);  
}
   
public void setScorer(Scorable scorer) throws IOException {
this.scorer = scorer;

}

@Override
public ScoreMode scoreMode() {
return ScoreMode.COMPLETE_NO_SCORES;
}
   
public void setNextReader(LeafReaderContext context) throws IOException 
{
this.docBase = context.docBase;
}

   abstract static class ScorerLeafCollector implements LeafCollector {

   public MyDocScoreCollector docScoreCollector = null;
   @Override
   public void setScorer(Scorable scorer) throws IOException {
 docScoreCollector.setScorer(scorer);
   }
   
   ScorerLeafCollector(MyDocScoreCollector collector) {
   this.docScoreCollector = collector;
   }
   }


@Override
public LeafCollector getLeafCollector(LeafReaderContext context) throws 
IOException {

docBase = context.docBase;
return new ScorerLeafCollector(this) {

@Override
public void setScorer(Scorable scorer) throws IOException {
super.setScorer(scorer);
}

@Override
public void collect(int doc) throws IOException {
docScoreCollector.collect(doc);
}

};
}

   }
   ```
   
   I found the NullPointerException is caused by the empty **SimScorer** on 
**SpanWeight**.  
   Due to the empty **SimScorer** on **SpanWeight**, when getting a 
**SpanScorer** from it, it returns a **SpanScorer** with a null 
**LeafSimScorer**.   Finally, when scoring the doc with the SpanScorer, it 
encounters the NullPointerException. 
   Within the method SpanScorer.scoreCurrentDoc(), it doesn't check if the 
**LeafSimScorer** field is null. Instead, only an assertion is there. 
   
   I also found there is another [similar issue 
](https://github.com/apache/lucene/issues/10564)related to the empty 
**SimScorer** on **SpanWeight**.  But it has been fixed by adding some 
defensive code.  
   Seems this is another similar case.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

Re: [I] Introduce Bloom Filter as non-experimental/core postings format [lucene]

Re: [I] Introduce Bloom Filter as non-experimental/core postings format [lucene]

Re: [I] org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity fails intermittently [lucene]

[PR] Reduce number of dimensions for Test[Byte|Float]VectorSimilarityQuery [lucene]

Re: [I] org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity fails intermittently [lucene]

Re: [I] Refactoring: Rename Levenstein to Levenshtein [LUCENE-7370] [lucene]

Re: [PR] Update copyright year in NOTICE.txt file. [lucene]

Re: [PR] Update copyright year in NOTICE.txt file. [lucene]

Re: [PR] Reduce number of dimensions for Test[Byte|Float]VectorSimilarityQuery [lucene]

Re: [I] Refactoring: Rename Levenstein to Levenshtein [LUCENE-7370] [lucene]

Re: [I] Refactoring: Rename Levenstein to Levenshtein [LUCENE-7370] [lucene]

Re: [PR] Add support for index sorting with document blocks [lucene]

Re: [PR] LUCENE-10641: IndexSearcher#setTimeout should also abort query rewrites, point ranges and vector searches [lucene]

[I] Update package info for HNSW [lucene]

[I] NullPointerException in IndexSearcher.search() when searching with SpanfirstQuery and a customized collector [lucene]

16 matches

Site Navigation

Mail list logo

Footer information