Re: [I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]
msokolov commented on issue #14431: URL: https://github.com/apache/lucene/issues/14431#issuecomment-2781449497 We're operating in a setup where we have an initial phase that builds an index while it is offline, not accepting query traffic. Once that is complete, we enable the index to take queries, and it continues to receive updates. We would like to allow the initial phase to proceed with minimal merge activity, then run forceMergeDeletes , and then proceed to bring the index on line. If forceMergeDeletes fails to complete in a timely fashion, we need to go ahead and bring the index online regardless. Agreed, that no deletions will have been merged away at that point, but they will eventually, and we would rather suffer with an index having a high delete ratio for some time than sacrifice the bounded time window (that we can usually live within). Basically the timeout will give us a bound on the worst case merge time. I think in general whenever we one provides a blocking API it's good practice to offer a version with a timeout so that callers can have some control over whether to term inate, keep waiting, or perhaps allow it to continue running in the background. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding profiling support for concurrent segment search [lucene]
jainankitk commented on code in PR #14413: URL: https://github.com/apache/lucene/pull/14413#discussion_r2030073871 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/search/QueryProfilerBreakdown.java: ## @@ -17,46 +17,113 @@ package org.apache.lucene.sandbox.search; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; import java.util.Collections; +import java.util.HashMap; +import java.util.List; import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.ConcurrentMap; +import org.apache.lucene.search.Query; import org.apache.lucene.util.CollectionUtil; /** * A record of timings for the various operations that may happen during query execution. A node's * time may be composed of several internal attributes (rewriting, weighting, scoring, etc). */ class QueryProfilerBreakdown { - - /** The accumulated timings for this query node */ - private final QueryProfilerTimer[] timers; + private static final Collection QUERY_LEVEL_TIMING_TYPE = + Arrays.stream(QueryProfilerTimingType.values()).filter(t -> !t.isSliceLevel()).toList(); + private final Map queryProfilerTimers; + private final ConcurrentMap threadToSliceBreakdown; Review Comment: I have added some changes to make it more explicit that the breakdowns are per thread and not per slice. Although, underlying class is still `QuerySliceProfileBreakdown` as it can be used for measuring slice level breakdown: ``` "query": [ <-- for list of root queries { "type": "TermQuery", "description": "foo:bar", "startTime" : 11972972, "totalTime": 354343, "queryLevelBreakdowns" : {.}, <-- query level breakdown like weight count and time "threadLevelBreakdowns": [ {.},<-- first thread information {.}]<-- second thread information "queryChildren": [ {..}, <-- recursive repetition of above structure {..}] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Term Query are slower post Lucene 9.12 for fields with IndexOptions.DOCS [lucene]
expani opened a new issue, #14445: URL: https://github.com/apache/lucene/issues/14445 ### Description A simple term query `searcher.search(new TermQuery(new Term(fieldName, fieldValue)), hits)` on a field which is only indexed with `IndexOptions.DOCS` is slower from 9.12.0 onwards. Root cause : - Collector sets the minimum competitive score on DISI after it has gathered the required hits. [TopScoreDocCollector sets it here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java#L173). It is an indication to the DISI that when scorer calls the `nextDoc` again it should return a document that has a score >= minCompetitiveScore. - ImpactsDISI does a [shallowAdvance via MaxScoreCache during nextDoc](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MaxScoreCache.java#L58) ( doesn't load the docIds into buffer just moves the file pointers and updates other internal variables in the underlying ImpactsEnum ) after the minCompetitiveScore is set by collector. It tries to move to the block in postings that will contain the target ( docId + 1 ) with a higher score. The shallowAdvance by `ImpactsSource` is working as expected in all versions. - The issue occurs when we try to [compute the max score at level zero after the shallowAdvance](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MaxScoreCache.java#L59-L60) This happens because from 9.12.0 the `Impact` returned is always `DUMMY_IMPACTS` for fields with `IndexOptions.DOCS`. This causes the maxScore calculated by [Similarity$SimScorer](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MaxScoreCache.java#L72-L79) to be always greater than the minimum competitive score which leads to all docIds being treated as competitive even though they aren't. Possible Solutions : - Store the field with `IndexOptions.DOCS_AND_FREQS` as this doesn't return `DUMMY_IMPACTS` but actually reads the impact value. - If we don't care about the score and are using a TermQuery as a pure filter, then wrapping the query with a ConstantScoreQuery resolves the issue. This works because [ConstantScoreScorer sets it's delegate as empty](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/ConstantScoreScorer.java#L124) once minimum competitive score is set. Thanks @msfroh for suggesting this alternative. - Another way of resolving the issue is [returning a term frequency of 1](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java#L74) instead of `NO_MORE_DOCS` in DummyImpacts returned by BlockPostingsEnum. Thanks @jpountz for suggesting this alternative. However, we would need to handle cases like [ExactPhraseMatcher which depend on the frequency of dummy impacts](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/ExactPhraseMatcher.java#L282-L284). - I also tried a different approach of not returning `DUMMY_IMPACTS` for fields with `IndexOptions.DOCS` since we already have the Impact information required. Expand for Git Diff ``` diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java index 1efcdf554dd..0a7c048ee95 100644 --- a/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java +++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java @@ -422,7 +422,7 @@ public final class Lucene101PostingsReader extends PostingsReaderBase { Arrays.fill(freqBuffer, 1); } - if (needsFreq && needsImpacts) { + if (needsImpacts) { level0SerializedImpacts = new BytesRef(maxImpactNumBytesAtLevel0); level1SerializedImpacts = new BytesRef(maxImpactNumBytesAtLevel1); level0Impacts = new MutableImpactList(maxNumImpactsAtLevel0); @@ -1107,9 +1107,6 @@ public final class Lucene101PostingsReader extends PostingsReaderBase { @Override public int getDocIdUpTo(int level) { -if (indexHasFreq == false) { - return NO_MORE_DOCS; -} if (level == 0) { return level0LastDocID; } @@ -1118,14 +1115,12 @@ public final class Lucene101PostingsReader extends PostingsReaderBase { @Override public List getImpacts(int level) { -if (indexHasFreq) { if (level == 0 && level0LastDocID != NO_MORE_DOCS) { return readImpacts(level0SerializedImpacts, level0Impacts);