Re: [PR] Rename TestWildcard to TestWildcardQuery. [lucene]

2024-03-08 Thread via GitHub


vsop-479 commented on PR #13159:
URL: https://github.com/apache/lucene/pull/13159#issuecomment-1985346755

   @rmuir 
   If there is no problem with this change, would you like to merge it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix NPE when LeafReader return null VectorValues [lucene]

2024-03-08 Thread via GitHub


bugmakerr commented on code in PR #13162:
URL: https://github.com/apache/lucene/pull/13162#discussion_r1517586857


##
lucene/core/src/java/org/apache/lucene/search/FloatVectorSimilarityValuesSource.java:
##
@@ -40,6 +40,9 @@ public FloatVectorSimilarityValuesSource(float[] vector, 
String fieldName) {
   @Override
   public DoubleValues getValues(LeafReaderContext ctx, DoubleValues scores) 
throws IOException {
 final FloatVectorValues vectorValues = 
ctx.reader().getFloatVectorValues(fieldName);
+if (vectorValues == null) {
+  return DoubleValues.EMPTY;
+}

Review Comment:
   Nice catch. I will fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Simplify BooleanWeight#optCount [lucene]

2024-03-08 Thread via GitHub


javanna opened a new pull request, #13167:
URL: https://github.com/apache/lucene/pull/13167

   Not a functional change: I think we can improve readability of optCount, and 
return a value as soon as we can instead of looping through clauses further for 
nothing.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Simplify BooleanWeight#optCount [lucene]

2024-03-08 Thread via GitHub


javanna closed pull request #13167: Simplify BooleanWeight#optCount
URL: https://github.com/apache/lucene/pull/13167


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Simplify BooleanWeight#optCount [lucene]

2024-03-08 Thread via GitHub


javanna commented on PR #13167:
URL: https://github.com/apache/lucene/pull/13167#issuecomment-1985593882

   I missed that we may still have count == numDocs hence return count although 
unknownCount is set to true, which means we cannot immediately return -1 like I 
was proposing. Closing. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Rename TestWildcard to TestWildcardQuery. [lucene]

2024-03-08 Thread via GitHub


rmuir merged PR #13159:
URL: https://github.com/apache/lucene/pull/13159


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]

2024-03-08 Thread via GitHub


mikemccand commented on code in PR #13149:
URL: https://github.com/apache/lucene/pull/13149#discussion_r1517878022


##
lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java:
##
@@ -36,6 +36,66 @@ final class DocIdsWriter {
 
   private final int[] scratch;
 
+  private final ScratchDocIdSetIterator scratchDocIdSetIterator = new 
ScratchDocIdSetIterator();
+
+  /**
+   * DocIdSetIterator to be used to iterate over the scratch buffer. A single 
instance is reused to
+   * avoid re-allocating the object. The reset method should be called before 
each use with the
+   * count.
+   *
+   * The main reason for existing is to be able to call the {@link
+   * IntersectVisitor#visit(DocIdSetIterator)} method rather than the {@link
+   * IntersectVisitor#visit(int)} method. This seems to make a difference in 
performance, probably
+   * due to fewer virtual calls then happening (once per read call rather than 
once per doc).
+   */
+  private class ScratchDocIdSetIterator extends DocIdSetIterator {
+
+private int index = -1;
+private int count = -1;
+
+@Override
+public int docID() {
+  if (index < 0) {
+return -1;
+  }
+  if (index >= count) {
+return NO_MORE_DOCS;
+  }
+  return scratch[index];
+}
+
+@Override
+public int nextDoc() throws IOException {
+  index++;
+  if (index >= count) {
+return NO_MORE_DOCS;
+  }
+  return scratch[index];
+}
+
+@Override
+public int advance(int target) throws IOException {
+  while (index < count && scratch[index] < target) {

Review Comment:
   > It might be that a DISI is not the right interface for visiting docids 
within the BKD tree - but I'm guessing changing that would be larger and more 
controversial.
   
   Yeah indeed DISI may not be right, but let's not try to fix that here/now?  
Perhaps there are use cases that might want to call `.advance` on the DISI?  
Maybe some sort of curious combined filtering / points query?
   
   OK you convinced me!: let's leave the method as is.  You're right that if 
such a use case existed today, it is suffering through the linear scan of 
`.visit` calls already, so this is no worse.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestIDVersionPostingsFormat failure [lucene]

2024-03-08 Thread via GitHub


benwtrent commented on issue #13127:
URL: https://github.com/apache/lucene/issues/13127#issuecomment-1986293179

   Looking at the code in `DocumentsWriterDeleteQueue#close()`, we trip if 
`seqNo` is ever larger than `maxSeqNo`. 
   
   `maxSeqNo` is set in `DocumentsWriterDeleteQueue#advanceQueue(int)`, which 
is synchronized. Internally `maxSeqNo` is set to `getLastSequenceNumber() + 
maxNumPendingOps + 1;`
   
   From what I can see `maxNumPendingOps` is synchronized and unchanging as 
well as it is passed in via: `DocumentsWriterFlushControl#markForFullFlush`. 
   
   However, `getLastSequenceNumber()` is NOT synchronized with 
`getNextSequenceNumber()`. It seems to me there may be a race condition where:
   
- DocumentsWriterDeleteQueue#advanceQueue(int) is entered by one thread
- The line `long seqNo = getLastSequenceNumber() + maxNumPendingOps + 1;` 
is executed
- Then another thread calls `getNextSequenceNumber()`
   
   I have to trace up where all these are used, but this is the first thing I 
saw that seemed suspicious to me. If this happened, it seems possible to me 
that the new `DocumentsWriterDeleteQueue` returned from `advanceQueue` could 
have a `maxSeqNo` set to too few given a number of parallel calls to 
`getNextSequenceNumber()` during the `advanceQueue` action.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestIDVersionPostingsFormat failure [lucene]

2024-03-08 Thread via GitHub


benwtrent commented on issue #13127:
URL: https://github.com/apache/lucene/issues/13127#issuecomment-1986327081

   OK, looking at where they are used, it seems like they were attempted to be 
synchronized, but we aren't synchronizing on the same things, which could cause 
a race condition. 
   
   Particularly right here:
   
   
https://github.com/apache/lucene/blob/40060f8b7080d06a218518445a0a1dfc520c812a/lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java#L572-L576
   
   ```
   documentsWriter.resetDeleteQueue(newQueue);
   ```
   Is synchronized, and so is 
   
   ```
   documentsWriter.getNextSequenceNumber()
   ```
   
   However, in this window of resetting the queue & gathering seqNo, 
`documentsWriter` isn't locked. So `getNextSequenceNumber` could be called. 
Thus calling `getNextSequenceNumber()` on the old queue, after maxSeqNo is 
figured out for the NEW queue.
   
   I think the solution might be to encapsulate the `resetDeleteQueue` so that 
`deleteQueue.advanceQueue` is synchronized with `getNextSequenceNumber`. 
   
   I will push a PR to that effect. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Fix potential race condition in DocumentsWriter & DocumentsWriterDeleteQueue [lucene]

2024-03-08 Thread via GitHub


benwtrent opened a new pull request, #13169:
URL: https://github.com/apache/lucene/pull/13169

   13127 is failing due to seqNo being larger than `maxSeqNo` on `close()`. 
   
   `maxSeqNo` is set during `DocumentsWriterDeleteQueue#advanceQueue`, 
synchronized on self. It utilizes `getLastSequenceNumber()`, which reads the 
`AtomicLong` for `seqNo`. 
   
   However, `DocumentsWriterDeleteQueue#getNextSequenceNumber()` is not 
synchronized on self. Meaning after (or before) reading the `AtomicLong`, it 
could be incremented by this method. 
   
   `DocumentsWriterDeleteQueue#getNextSequenceNumber()` is called in other 
synchronized methods, the one external one being 
`DocumentsWriter#getNextSequenceNumber`, which is synchronized on self (e.g. 
DocumentsWriter). 
   
   When calling `DocumentsWriterFlushControl#markFullFlush`, neither the 
`DocumentsWriterDeleteQueue` nor `DocumentsWriterDeleteQueue` are locked, and 
it is possible for `documentsWriter.getNextSequenceNumber()` to be called after 
`documentsWriter.deleteQueue.advanceQueue` but before 
`documentsWriter.resetDeleteQueue`.
   
   This commit moves the `documentsWriter.deleteQueue.advanceQueue` call into 
the already synchronized `documentsWriter.resetDeleteQueue` to ensure that 
`getNextSequenceNumber` cannot be called while the queue is being reset.
   
   This is admittedly difficult to fully test. But I have seen many failures in 
continuous testing. So, if this commit does fix that test, I would expect those 
to disappear.
   
   closes: https://github.com/apache/lucene/issues/13127


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Made the UnifiedHighlighter's hasUnrecognizedQuery function processes FunctionQuery the same way as MatchAllDocsQuery and MatchNoDocsQuery queries for performance reasons. [lucene]

2024-03-08 Thread via GitHub


romseygeek commented on code in PR #13165:
URL: https://github.com/apache/lucene/pull/13165#discussion_r1518350668


##
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java:
##
@@ -1130,7 +1134,16 @@ public boolean acceptField(String field) {
   @Override
   public void visitLeaf(Query query) {
 if (MultiTermHighlighting.canExtractAutomataFromLeafQuery(query) 
== false) {
-  if (!(query instanceof MatchAllDocsQuery || query instanceof 
MatchNoDocsQuery)) {
+  boolean no_effect_query = false;

Review Comment:
   This would be simpler as 
`QUERIES_WITH_NO_HL_EFFECT.contains(query.getClass())`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make `OneMerge#reorder` preserve blocks. [lucene]

2024-03-08 Thread via GitHub


github-actions[bot] commented on PR #13128:
URL: https://github.com/apache/lucene/pull/13128#issuecomment-1986591018

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make BP work on indexes that have blocks. [lucene]

2024-03-08 Thread via GitHub


github-actions[bot] commented on PR #13125:
URL: https://github.com/apache/lucene/pull/13125#issuecomment-1986591048

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve set deletions percentage javadoc [lucene]

2024-03-08 Thread via GitHub


github-actions[bot] commented on PR #12828:
URL: https://github.com/apache/lucene/pull/12828#issuecomment-1986591236

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add maximum clause count check to IndexSearcher rather than BooleanQuery [LUCENE-8811] [lucene]

2024-03-08 Thread via GitHub


dsmiley commented on issue #9855:
URL: https://github.com/apache/lucene/issues/9855#issuecomment-1986742857

   I don't think I follow the rationale for why this was done.  I believe the 
value of the limit is either (a) end-users abusing the system or (b) developer 
users not recognizing they should use TermInSetQuery.  Both of those are 
probably best addressed in a query parser as this is their touch-point with 
search.  This is where Solr applies such a limit, BTW.  Do we really need a 
totally comprehensive max number of clauses?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]

2024-03-08 Thread via GitHub


antonha commented on code in PR #13149:
URL: https://github.com/apache/lucene/pull/13149#discussion_r1518509612


##
lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java:
##
@@ -36,6 +36,66 @@ final class DocIdsWriter {
 
   private final int[] scratch;
 
+  private final ScratchDocIdSetIterator scratchDocIdSetIterator = new 
ScratchDocIdSetIterator();
+
+  /**
+   * DocIdSetIterator to be used to iterate over the scratch buffer. A single 
instance is reused to
+   * avoid re-allocating the object. The reset method should be called before 
each use with the
+   * count.
+   *
+   * The main reason for existing is to be able to call the {@link
+   * IntersectVisitor#visit(DocIdSetIterator)} method rather than the {@link
+   * IntersectVisitor#visit(int)} method. This seems to make a difference in 
performance, probably
+   * due to fewer virtual calls then happening (once per read call rather than 
once per doc).
+   */
+  private class ScratchDocIdSetIterator extends DocIdSetIterator {
+
+private int index = -1;
+private int count = -1;
+
+@Override
+public int docID() {
+  if (index < 0) {
+return -1;
+  }
+  if (index >= count) {
+return NO_MORE_DOCS;
+  }
+  return scratch[index];
+}
+
+@Override
+public int nextDoc() throws IOException {
+  index++;
+  if (index >= count) {
+return NO_MORE_DOCS;
+  }
+  return scratch[index];
+}
+
+@Override
+public int advance(int target) throws IOException {
+  while (index < count && scratch[index] < target) {

Review Comment:
   > Yeah indeed DISI may not be right, but let's not try to fix that here/now? 
   Agreed, but I'm tempted to experiment with it in another PR :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]

2024-03-08 Thread via GitHub


antonha commented on PR #13149:
URL: https://github.com/apache/lucene/pull/13149#issuecomment-1986780683

   Thanks for the approval! 
   
   I do however think that it would be good to prove this performance 
improvement in luceneutil before merging, to make the benchmark more easily 
reproducible than the benchmarking I have done. I have started looking into it, 
but will have more time for it next week. Does that sound reasonable? The only 
downside I see is a later merge of this PR.
   
   > @antonha could you add a `CHANGES.txt` entry explaining the optimization? 
Thanks!
   
   Yes, will do once I have run some luceneutil benchmarks (if we want to wait 
for that)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org