Re: [PR] Rename TestWildcard to TestWildcardQuery. [lucene]
vsop-479 commented on PR #13159: URL: https://github.com/apache/lucene/pull/13159#issuecomment-1985346755 @rmuir If there is no problem with this change, would you like to merge it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix NPE when LeafReader return null VectorValues [lucene]
bugmakerr commented on code in PR #13162: URL: https://github.com/apache/lucene/pull/13162#discussion_r1517586857 ## lucene/core/src/java/org/apache/lucene/search/FloatVectorSimilarityValuesSource.java: ## @@ -40,6 +40,9 @@ public FloatVectorSimilarityValuesSource(float[] vector, String fieldName) { @Override public DoubleValues getValues(LeafReaderContext ctx, DoubleValues scores) throws IOException { final FloatVectorValues vectorValues = ctx.reader().getFloatVectorValues(fieldName); +if (vectorValues == null) { + return DoubleValues.EMPTY; +} Review Comment: Nice catch. I will fix it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Simplify BooleanWeight#optCount [lucene]
javanna opened a new pull request, #13167: URL: https://github.com/apache/lucene/pull/13167 Not a functional change: I think we can improve readability of optCount, and return a value as soon as we can instead of looping through clauses further for nothing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Simplify BooleanWeight#optCount [lucene]
javanna closed pull request #13167: Simplify BooleanWeight#optCount URL: https://github.com/apache/lucene/pull/13167 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Simplify BooleanWeight#optCount [lucene]
javanna commented on PR #13167: URL: https://github.com/apache/lucene/pull/13167#issuecomment-1985593882 I missed that we may still have count == numDocs hence return count although unknownCount is set to true, which means we cannot immediately return -1 like I was proposing. Closing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Rename TestWildcard to TestWildcardQuery. [lucene]
rmuir merged PR #13159: URL: https://github.com/apache/lucene/pull/13159 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]
mikemccand commented on code in PR #13149: URL: https://github.com/apache/lucene/pull/13149#discussion_r1517878022 ## lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java: ## @@ -36,6 +36,66 @@ final class DocIdsWriter { private final int[] scratch; + private final ScratchDocIdSetIterator scratchDocIdSetIterator = new ScratchDocIdSetIterator(); + + /** + * DocIdSetIterator to be used to iterate over the scratch buffer. A single instance is reused to + * avoid re-allocating the object. The reset method should be called before each use with the + * count. + * + * The main reason for existing is to be able to call the {@link + * IntersectVisitor#visit(DocIdSetIterator)} method rather than the {@link + * IntersectVisitor#visit(int)} method. This seems to make a difference in performance, probably + * due to fewer virtual calls then happening (once per read call rather than once per doc). + */ + private class ScratchDocIdSetIterator extends DocIdSetIterator { + +private int index = -1; +private int count = -1; + +@Override +public int docID() { + if (index < 0) { +return -1; + } + if (index >= count) { +return NO_MORE_DOCS; + } + return scratch[index]; +} + +@Override +public int nextDoc() throws IOException { + index++; + if (index >= count) { +return NO_MORE_DOCS; + } + return scratch[index]; +} + +@Override +public int advance(int target) throws IOException { + while (index < count && scratch[index] < target) { Review Comment: > It might be that a DISI is not the right interface for visiting docids within the BKD tree - but I'm guessing changing that would be larger and more controversial. Yeah indeed DISI may not be right, but let's not try to fix that here/now? Perhaps there are use cases that might want to call `.advance` on the DISI? Maybe some sort of curious combined filtering / points query? OK you convinced me!: let's leave the method as is. You're right that if such a use case existed today, it is suffering through the linear scan of `.visit` calls already, so this is no worse. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestIDVersionPostingsFormat failure [lucene]
benwtrent commented on issue #13127: URL: https://github.com/apache/lucene/issues/13127#issuecomment-1986293179 Looking at the code in `DocumentsWriterDeleteQueue#close()`, we trip if `seqNo` is ever larger than `maxSeqNo`. `maxSeqNo` is set in `DocumentsWriterDeleteQueue#advanceQueue(int)`, which is synchronized. Internally `maxSeqNo` is set to `getLastSequenceNumber() + maxNumPendingOps + 1;` From what I can see `maxNumPendingOps` is synchronized and unchanging as well as it is passed in via: `DocumentsWriterFlushControl#markForFullFlush`. However, `getLastSequenceNumber()` is NOT synchronized with `getNextSequenceNumber()`. It seems to me there may be a race condition where: - DocumentsWriterDeleteQueue#advanceQueue(int) is entered by one thread - The line `long seqNo = getLastSequenceNumber() + maxNumPendingOps + 1;` is executed - Then another thread calls `getNextSequenceNumber()` I have to trace up where all these are used, but this is the first thing I saw that seemed suspicious to me. If this happened, it seems possible to me that the new `DocumentsWriterDeleteQueue` returned from `advanceQueue` could have a `maxSeqNo` set to too few given a number of parallel calls to `getNextSequenceNumber()` during the `advanceQueue` action. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestIDVersionPostingsFormat failure [lucene]
benwtrent commented on issue #13127: URL: https://github.com/apache/lucene/issues/13127#issuecomment-1986327081 OK, looking at where they are used, it seems like they were attempted to be synchronized, but we aren't synchronizing on the same things, which could cause a race condition. Particularly right here: https://github.com/apache/lucene/blob/40060f8b7080d06a218518445a0a1dfc520c812a/lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java#L572-L576 ``` documentsWriter.resetDeleteQueue(newQueue); ``` Is synchronized, and so is ``` documentsWriter.getNextSequenceNumber() ``` However, in this window of resetting the queue & gathering seqNo, `documentsWriter` isn't locked. So `getNextSequenceNumber` could be called. Thus calling `getNextSequenceNumber()` on the old queue, after maxSeqNo is figured out for the NEW queue. I think the solution might be to encapsulate the `resetDeleteQueue` so that `deleteQueue.advanceQueue` is synchronized with `getNextSequenceNumber`. I will push a PR to that effect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix potential race condition in DocumentsWriter & DocumentsWriterDeleteQueue [lucene]
benwtrent opened a new pull request, #13169: URL: https://github.com/apache/lucene/pull/13169 13127 is failing due to seqNo being larger than `maxSeqNo` on `close()`. `maxSeqNo` is set during `DocumentsWriterDeleteQueue#advanceQueue`, synchronized on self. It utilizes `getLastSequenceNumber()`, which reads the `AtomicLong` for `seqNo`. However, `DocumentsWriterDeleteQueue#getNextSequenceNumber()` is not synchronized on self. Meaning after (or before) reading the `AtomicLong`, it could be incremented by this method. `DocumentsWriterDeleteQueue#getNextSequenceNumber()` is called in other synchronized methods, the one external one being `DocumentsWriter#getNextSequenceNumber`, which is synchronized on self (e.g. DocumentsWriter). When calling `DocumentsWriterFlushControl#markFullFlush`, neither the `DocumentsWriterDeleteQueue` nor `DocumentsWriterDeleteQueue` are locked, and it is possible for `documentsWriter.getNextSequenceNumber()` to be called after `documentsWriter.deleteQueue.advanceQueue` but before `documentsWriter.resetDeleteQueue`. This commit moves the `documentsWriter.deleteQueue.advanceQueue` call into the already synchronized `documentsWriter.resetDeleteQueue` to ensure that `getNextSequenceNumber` cannot be called while the queue is being reset. This is admittedly difficult to fully test. But I have seen many failures in continuous testing. So, if this commit does fix that test, I would expect those to disappear. closes: https://github.com/apache/lucene/issues/13127 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Made the UnifiedHighlighter's hasUnrecognizedQuery function processes FunctionQuery the same way as MatchAllDocsQuery and MatchNoDocsQuery queries for performance reasons. [lucene]
romseygeek commented on code in PR #13165: URL: https://github.com/apache/lucene/pull/13165#discussion_r1518350668 ## lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java: ## @@ -1130,7 +1134,16 @@ public boolean acceptField(String field) { @Override public void visitLeaf(Query query) { if (MultiTermHighlighting.canExtractAutomataFromLeafQuery(query) == false) { - if (!(query instanceof MatchAllDocsQuery || query instanceof MatchNoDocsQuery)) { + boolean no_effect_query = false; Review Comment: This would be simpler as `QUERIES_WITH_NO_HL_EFFECT.contains(query.getClass())` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make `OneMerge#reorder` preserve blocks. [lucene]
github-actions[bot] commented on PR #13128: URL: https://github.com/apache/lucene/pull/13128#issuecomment-1986591018 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make BP work on indexes that have blocks. [lucene]
github-actions[bot] commented on PR #13125: URL: https://github.com/apache/lucene/pull/13125#issuecomment-1986591048 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve set deletions percentage javadoc [lucene]
github-actions[bot] commented on PR #12828: URL: https://github.com/apache/lucene/pull/12828#issuecomment-1986591236 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add maximum clause count check to IndexSearcher rather than BooleanQuery [LUCENE-8811] [lucene]
dsmiley commented on issue #9855: URL: https://github.com/apache/lucene/issues/9855#issuecomment-1986742857 I don't think I follow the rationale for why this was done. I believe the value of the limit is either (a) end-users abusing the system or (b) developer users not recognizing they should use TermInSetQuery. Both of those are probably best addressed in a query parser as this is their touch-point with search. This is where Solr applies such a limit, BTW. Do we really need a totally comprehensive max number of clauses? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]
antonha commented on code in PR #13149: URL: https://github.com/apache/lucene/pull/13149#discussion_r1518509612 ## lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java: ## @@ -36,6 +36,66 @@ final class DocIdsWriter { private final int[] scratch; + private final ScratchDocIdSetIterator scratchDocIdSetIterator = new ScratchDocIdSetIterator(); + + /** + * DocIdSetIterator to be used to iterate over the scratch buffer. A single instance is reused to + * avoid re-allocating the object. The reset method should be called before each use with the + * count. + * + * The main reason for existing is to be able to call the {@link + * IntersectVisitor#visit(DocIdSetIterator)} method rather than the {@link + * IntersectVisitor#visit(int)} method. This seems to make a difference in performance, probably + * due to fewer virtual calls then happening (once per read call rather than once per doc). + */ + private class ScratchDocIdSetIterator extends DocIdSetIterator { + +private int index = -1; +private int count = -1; + +@Override +public int docID() { + if (index < 0) { +return -1; + } + if (index >= count) { +return NO_MORE_DOCS; + } + return scratch[index]; +} + +@Override +public int nextDoc() throws IOException { + index++; + if (index >= count) { +return NO_MORE_DOCS; + } + return scratch[index]; +} + +@Override +public int advance(int target) throws IOException { + while (index < count && scratch[index] < target) { Review Comment: > Yeah indeed DISI may not be right, but let's not try to fix that here/now? Agreed, but I'm tempted to experiment with it in another PR :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]
antonha commented on PR #13149: URL: https://github.com/apache/lucene/pull/13149#issuecomment-1986780683 Thanks for the approval! I do however think that it would be good to prove this performance improvement in luceneutil before merging, to make the benchmark more easily reproducible than the benchmarking I have done. I have started looking into it, but will have more time for it next week. Does that sound reasonable? The only downside I see is a later merge of this PR. > @antonha could you add a `CHANGES.txt` entry explaining the optimization? Thanks! Yes, will do once I have run some luceneutil benchmarks (if we want to wait for that) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org