[ 
https://issues.apache.org/jira/browse/SOLR-13890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007449#comment-17007449
 ] 

Jason Gerlowski commented on SOLR-13890:
----------------------------------------

bq.  If other lower cost queries are in play then TPI matches() won't be called 
if the document can be excluded by them.
Did some quick testing on this, and it looks like you're right (and my initial 
reading was wrong).  At least, with a caveat.  The behavior is as you described 
when run with {{cache=false}}.  When the query is run with {{cache=true}} (or 
unspecified) and the query isn't cached yet, {{matches}} _is_ called for every 
doc in the index so that the filter can be cached.  

Now to the interesting bit.  I wrote a JUnit driver to perf-compare DVTQ's 
existing TPI implementation and the proposed postfilter implementation.  For 
details on how the perf test was set up, see the latest patch.  The driver 
indexes data and submits increasingly larger {{terms}} queries, measuring the 
performance with both approaches.

The graphs below show the results of a few of these runs, with numTerms 
increasing left to right and QTime measured on the Y axis.  DVTQ's per-segment 
TPI implementation is shown in blue, postfilter performance is in red.

In a "normal" run the TPI implementation starts out more performant for small 
"numTerms" values, but its QTime increases linearly as the size of the "terms" 
query increases.  At around 700 terms the postfilter implementation takes and 
keeps the lead.

 !Screen Shot 2020-01-02 at 2.25.12 PM.png! 

But after an optimize, with the whole index in one segment, this difference 
disappears, confirming what David suspected.  So if we wanted to replace the 
post-filter implementation with a top-level TPI implementation, we'd see the 
same performance.  

 !post_optimize_performance.png! 

The question remains though whether we should stick with postfilter or switch 
to TPI here though.  Both approaches really suffer performance-wise when the 
query matches a large number of documents.  That makes me wonder whether we 
might be better off with "postfilter", which has a very explicit switch that 
users can control (unlike DVTQ which seems to always operate using TPI).  But 
that's just a thought, I was a little surprised to see the post-optimize perf 
is exactly the same, so I'm a bit flatfooted how to proceed.

> Add postfilter support to {!terms} queries
> ------------------------------------------
>
>                 Key: SOLR-13890
>                 URL: https://issues.apache.org/jira/browse/SOLR-13890
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: query parsers
>    Affects Versions: master (9.0)
>            Reporter: Jason Gerlowski
>            Assignee: Jason Gerlowski
>            Priority: Major
>         Attachments: SOLR-13890.patch, SOLR-13890.patch, SOLR-13890.patch, 
> SOLR-13890.patch, Screen Shot 2020-01-02 at 2.25.12 PM.png, 
> post_optimize_performance.png
>
>
> There are some use-cases where it'd be nice if the "terms" qparser created a 
> query that could be run as a postfilter.  Particularly, when users are 
> checking for hundreds or thousands of terms, a postfilter implementation can 
> be more performant than the standard processing.
> WIth this issue, I'd like to propose a post-filter implementation for the 
> {{docValuesTermsFilter}} "method".  Postfilter creation can use a 
> SortedSetDocValues object to populate a DV bitset with the "terms" being 
> checked for.  Each document run through the post-filter can look at their 
> doc-values for the field in question and check them efficiently against the 
> constructed bitset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to