[ https://issues.apache.org/jira/browse/LUCENE-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456221#comment-17456221 ]
Ignacio Vera commented on LUCENE-10299: --------------------------------------- Yikes, thanks for catching this up! Before reverting let me expose the reasons I think this change is correct: > I will admit, I'm a bit confused why we made this change since lucene docids > can only be {{{}int{}}}. I think it actually makes sense and it is consistent with other parts of the code. Here are my two reasons: 1)The cost() of a DocIdSetIterator is a long. I guess we accept DocIdsSetIterator that contains duplicated values and therefore they can hold more than Integer,MAX_VALUE values? {code:java} /** * Returns the estimated cost of this {@link DocIdSetIterator}. * * <p>This is generally an upper bound of the number of documents this iterator might match, but * may be a rough heuristic, hardcoded value, or otherwise completely inaccurate. */ public abstract long cost(); {code} 2) DocIdSetBuilder expects that #grow() can be called more than Integer.MAX_VALUE. We have an internal counter which is a long which is used to compute the final cost() of the produced iterator as we can be adding duplicated values. Note that the previous implementation is not considering cost bigger than Integer,.MAX_VALUE which looks wrong: {code:java} int cost = (int) Math.min(Integer.MAX_VALUE, iter.cost()); {code} All in all I think the change make sense. If the performance regression is doing to loop using a long we can check the cost of the iterator and if it can fit in an int the we can loop using an int, otherwise we loop using along? > investigate prefix/wildcard perf drop in nightly benchmarks > ----------------------------------------------------------- > > Key: LUCENE-10299 > URL: https://issues.apache.org/jira/browse/LUCENE-10299 > Project: Lucene - Core > Issue Type: Task > Reporter: Robert Muir > Priority: Major > > Recently the prefix/wildcard dropped. As these are super simple and not > impacted by cleanups being done around RegExp, I think instead the > perf-difference is in the guts of MultiTermQuery where it uses > DocIdSetBuilder? > *note that I haven't confirmed this and it is just a suspicion* > So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} > instead of {{int}} like before, we know these are slower in java. > I will admit, I'm a bit confused why we made this change since lucene docids > can only be {{int}}. > Maybe we get the performance back for free, with JDK18/19 which are > optimizing loops on {{long}} better? So I'm not arguing that we burn a bunch > of time to fix this, but just opening the issue. > cc [~ivera] -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org