[jira] [Commented] (LUCENE-10299) investigate prefix/wildcard perf drop in nightly benchmarks

Ignacio Vera (Jira) Thu, 09 Dec 2021 04:23:08 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456386#comment-17456386
 ]


Ignacio Vera commented on LUCENE-10299:
---------------------------------------

I am concern as well with the API. Let's try to think about the motivation.

We have the PointValues API that uses an IntersectVisitor to navigate the tree. 
The intersects visitor has a grow method that is just a way to be able to tell 
the DocIdBuilder of how many docs we are about to add.

We don't know how many documents are below a tree node but how many points 
which is a long as we can have documents with many points. When we want to call 
grow we check the number of points below the node and if it does not fit in an 
int we navigate down the tree hoping the at some level the number of points 
will fit in an int.

Here is the real issue, the PointValues API does not limit the number of points 
we can have in a leaf so theoretically we can have more than Integer.MAX_VALUE 
points and then we are stuck.

I have proposed to call grow using the following logic:
{code:java}
visitor.grow((int) Math.min(Integer.MAX_VALUE, pointTree.size())); {code}
or even better
{code:java}
visitor.grow((int) Math.min(getDocCount(), pointTree.size())) {code}
This is a very similar fashion of what we are doing with the iterator above. 
Still the downside here is that we are using this grow to compute the cost of 
the iterator so we will underestimate the number of points in these cases?

The "docidset"  is expecting that we can grow it more than Integer.MAX_VALUE 
(we have a internal counter that is a long) we are just saying you can grow it 
with longs, but only using int? that is weird as well.

 

The grow method is just telling us how many times we are going to call add(int) 
not how many documents we are going to add. Maybe we should update javadocs 
reflecting the reality?

 

 

 

 

> investigate prefix/wildcard perf drop in nightly benchmarks
> -----------------------------------------------------------
>
>                 Key: LUCENE-10299
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10299
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Robert Muir
>            Priority: Major
>
> Recently the prefix/wildcard dropped. As these are super simple and not 
> impacted by cleanups being done around RegExp, I think instead the 
> perf-difference is in the guts of MultiTermQuery where it uses 
> DocIdSetBuilder?
> *note that I haven't confirmed this and it is just a suspicion*
> So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} 
> instead of {{int}} like before, we know these are slower in java.
> I will admit, I'm a bit confused why we made this change since lucene docids 
> can only be {{int}}.
> Maybe we get the performance back for free, with JDK18/19 which are 
> optimizing loops on {{long}} better? So I'm not arguing that we burn a bunch 
> of time to fix this, but just opening the issue.
> cc [~ivera]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10299) investigate prefix/wildcard perf drop in nightly benchmarks

Reply via email to