Re: [I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]

2025-04-06 Thread via GitHub


msokolov commented on issue #14431:
URL: https://github.com/apache/lucene/issues/14431#issuecomment-2781449497

   We're operating in a setup where we have an initial phase that builds an 
index while it is offline, not accepting query traffic. Once that is complete, 
we enable the index to take queries, and it continues to receive updates. We 
would like to allow the initial phase to proceed with minimal merge activity, 
then run forceMergeDeletes , and then proceed to bring the index on line.  If 
forceMergeDeletes fails to complete in a timely fashion, we need to go ahead 
and bring the index online regardless. Agreed, that no deletions will have been 
merged away at that point, but they will eventually, and we would rather suffer 
with an index having a high delete ratio for some time than sacrifice the 
bounded time window (that we can usually live within). Basically the timeout 
will give us a bound on the worst case merge time.  I think in general whenever 
we one provides a blocking API it's good practice to offer a version with a 
timeout so that callers can have some control over whether to term
 inate, keep waiting, or perhaps allow it to continue running in the background.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding profiling support for concurrent segment search [lucene]

2025-04-06 Thread via GitHub


jainankitk commented on code in PR #14413:
URL: https://github.com/apache/lucene/pull/14413#discussion_r2030073871


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/QueryProfilerBreakdown.java:
##
@@ -17,46 +17,113 @@
 
 package org.apache.lucene.sandbox.search;
 
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
 import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
 import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ConcurrentMap;
+import org.apache.lucene.search.Query;
 import org.apache.lucene.util.CollectionUtil;
 
 /**
  * A record of timings for the various operations that may happen during query 
execution. A node's
  * time may be composed of several internal attributes (rewriting, weighting, 
scoring, etc).
  */
 class QueryProfilerBreakdown {
-
-  /** The accumulated timings for this query node */
-  private final QueryProfilerTimer[] timers;
+  private static final Collection 
QUERY_LEVEL_TIMING_TYPE =
+  Arrays.stream(QueryProfilerTimingType.values()).filter(t -> 
!t.isSliceLevel()).toList();
+  private final Map 
queryProfilerTimers;
+  private final ConcurrentMap 
threadToSliceBreakdown;

Review Comment:
   I have added some changes to make it more explicit that the breakdowns are 
per thread and not per slice. Although, underlying class is still 
`QuerySliceProfileBreakdown` as it can be used for measuring slice level 
breakdown:
   
   ```
   "query": [ <-- for list of root queries
 {
   "type": "TermQuery",
   "description": "foo:bar",
   "startTime" : 11972972,
   "totalTime": 354343,
   "queryLevelBreakdowns" :   {.}, <-- query level 
breakdown like weight count and time
   "threadLevelBreakdowns": [
   {.},<-- first thread information
   {.}]<-- second thread information
   "queryChildren": [
   {..},  <-- recursive repetition of above structure
   {..}]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Term Query are slower post Lucene 9.12 for fields with IndexOptions.DOCS [lucene]

2025-04-06 Thread via GitHub


expani opened a new issue, #14445:
URL: https://github.com/apache/lucene/issues/14445

   ### Description
   
   
   A simple term query `searcher.search(new TermQuery(new Term(fieldName, 
fieldValue)), hits)` on a field which is only indexed with `IndexOptions.DOCS` 
is slower from 9.12.0 onwards. 
   
   Root cause : 
   
   - Collector sets the minimum competitive score on DISI after it has gathered 
the required hits. [TopScoreDocCollector sets it 
here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java#L173).
 
   It is an indication to the DISI that when scorer calls the `nextDoc` again 
it should return a document that has a score >= minCompetitiveScore. 
   
   - ImpactsDISI does a [shallowAdvance via MaxScoreCache during 
nextDoc](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MaxScoreCache.java#L58)
 ( doesn't load the docIds into buffer just moves the file pointers and updates 
other internal variables in the underlying ImpactsEnum ) after the 
minCompetitiveScore is set by collector. It tries to move to the block in 
postings that will contain the target ( docId + 1 ) with a higher score.
   The shallowAdvance by `ImpactsSource` is working as expected in all 
versions. 
   
   - The issue occurs when we try to [compute the max score at level zero after 
the 
shallowAdvance](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MaxScoreCache.java#L59-L60)
 This happens because from 9.12.0 the `Impact` returned is always 
`DUMMY_IMPACTS` for fields with `IndexOptions.DOCS`. This causes the maxScore 
calculated by 
[Similarity$SimScorer](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MaxScoreCache.java#L72-L79)
 to be always greater than the minimum competitive score which leads to all 
docIds being treated as competitive even though they aren't. 
   
   Possible Solutions : 
   
   - Store the field with `IndexOptions.DOCS_AND_FREQS` as this doesn't return 
`DUMMY_IMPACTS` but actually reads the impact value. 
   
   - If we don't care about the score and are using a TermQuery as a pure 
filter, then wrapping the query with a ConstantScoreQuery resolves the issue. 
This works because [ConstantScoreScorer sets it's delegate as 
empty](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/ConstantScoreScorer.java#L124)
 once minimum competitive score is set. Thanks @msfroh for suggesting this 
alternative.
   
   - Another way of resolving the issue is [returning a term frequency of 
1](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java#L74)
 instead of `NO_MORE_DOCS` in DummyImpacts returned by BlockPostingsEnum. 
Thanks @jpountz for suggesting this alternative.
   However, we would need to handle cases like [ExactPhraseMatcher which depend 
on the frequency of dummy 
impacts](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/ExactPhraseMatcher.java#L282-L284).
   
   - I also tried a different approach of not returning `DUMMY_IMPACTS` for 
fields with `IndexOptions.DOCS` since we already have the Impact information 
required. 
   
   
   
   Expand for Git Diff
   
   ```
   diff --git 
a/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java
 
b/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java
   index 1efcdf554dd..0a7c048ee95 100644
   --- 
a/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java
   +++ 
b/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java
   @@ -422,7 +422,7 @@ public final class Lucene101PostingsReader extends 
PostingsReaderBase {
Arrays.fill(freqBuffer, 1);
  }
   
   -  if (needsFreq && needsImpacts) {
   +  if (needsImpacts) {
level0SerializedImpacts = new BytesRef(maxImpactNumBytesAtLevel0);
level1SerializedImpacts = new BytesRef(maxImpactNumBytesAtLevel1);
level0Impacts = new MutableImpactList(maxNumImpactsAtLevel0);
   @@ -1107,9 +1107,6 @@ public final class Lucene101PostingsReader extends 
PostingsReaderBase {
   
  @Override
  public int getDocIdUpTo(int level) {
   -if (indexHasFreq == false) {
   -  return NO_MORE_DOCS;
   -}
if (level == 0) {
  return level0LastDocID;
}
   @@ -1118,14 +1115,12 @@ public final class Lucene101PostingsReader extends 
PostingsReaderBase {
   
  @Override
  public List getImpacts(int level) {
   -if (indexHasFreq) {
  if (level == 0 && level0LastDocID != NO_MORE_DOCS) {
return readImpacts(level0SerializedImpacts, level0Impacts);