[jira] [Commented] (LUCENE-10299) investigate prefix/wildcard perf drop in nightly benchmarks

Ignacio Vera (Jira) Thu, 09 Dec 2021 00:08:06 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456221#comment-17456221
 ]


Ignacio Vera commented on LUCENE-10299:
---------------------------------------

Yikes, thanks for catching this up! Before reverting let me expose the reasons 
I think this change is correct:

> I will admit, I'm a bit confused why we made this change since lucene docids 
> can only be {{{}int{}}}.

I think it actually makes sense and it is consistent with other parts of the 
code. Here are my two reasons:

1)The cost() of a DocIdSetIterator is a long. I guess we accept 
DocIdsSetIterator that contains duplicated values and therefore they can hold 
more than Integer,MAX_VALUE values?
{code:java}
/**
 * Returns the estimated cost of this {@link DocIdSetIterator}.
 *
 * <p>This is generally an upper bound of the number of documents this iterator 
might match, but
 * may be a rough heuristic, hardcoded value, or otherwise completely 
inaccurate.
 */
public abstract long cost(); {code}
2) DocIdSetBuilder expects that #grow() can be called more than 
Integer.MAX_VALUE. We have an internal counter which is a long which is used to 
compute the final cost() of the produced iterator as we can be adding 
duplicated values. Note that the previous implementation is not considering 
cost bigger than Integer,.MAX_VALUE which looks wrong:
{code:java}
int cost = (int) Math.min(Integer.MAX_VALUE, iter.cost()); {code}
All in all I think the change make sense. If the performance regression is 
doing to loop using a long we can check the cost of the iterator and if it can 
fit in an int the we can loop using an int, otherwise we loop using along?

 

> investigate prefix/wildcard perf drop in nightly benchmarks
> -----------------------------------------------------------
>
>                 Key: LUCENE-10299
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10299
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Robert Muir
>            Priority: Major
>
> Recently the prefix/wildcard dropped. As these are super simple and not 
> impacted by cleanups being done around RegExp, I think instead the 
> perf-difference is in the guts of MultiTermQuery where it uses 
> DocIdSetBuilder?
> *note that I haven't confirmed this and it is just a suspicion*
> So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} 
> instead of {{int}} like before, we know these are slower in java.
> I will admit, I'm a bit confused why we made this change since lucene docids 
> can only be {{int}}.
> Maybe we get the performance back for free, with JDK18/19 which are 
> optimizing loops on {{long}} better? So I'm not arguing that we burn a bunch 
> of time to fix this, but just opening the issue.
> cc [~ivera]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-10299) investigate prefix/wildcard perf drop in nightly benchmarks

Reply via email to