[ https://issues.apache.org/jira/browse/SOLR-13890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007449#comment-17007449 ]
Jason Gerlowski commented on SOLR-13890: ---------------------------------------- bq. If other lower cost queries are in play then TPI matches() won't be called if the document can be excluded by them. Did some quick testing on this, and it looks like you're right (and my initial reading was wrong). At least, with a caveat. The behavior is as you described when run with {{cache=false}}. When the query is run with {{cache=true}} (or unspecified) and the query isn't cached yet, {{matches}} _is_ called for every doc in the index so that the filter can be cached. Now to the interesting bit. I wrote a JUnit driver to perf-compare DVTQ's existing TPI implementation and the proposed postfilter implementation. For details on how the perf test was set up, see the latest patch. The driver indexes data and submits increasingly larger {{terms}} queries, measuring the performance with both approaches. The graphs below show the results of a few of these runs, with numTerms increasing left to right and QTime measured on the Y axis. DVTQ's per-segment TPI implementation is shown in blue, postfilter performance is in red. In a "normal" run the TPI implementation starts out more performant for small "numTerms" values, but its QTime increases linearly as the size of the "terms" query increases. At around 700 terms the postfilter implementation takes and keeps the lead. !Screen Shot 2020-01-02 at 2.25.12 PM.png! But after an optimize, with the whole index in one segment, this difference disappears, confirming what David suspected. So if we wanted to replace the post-filter implementation with a top-level TPI implementation, we'd see the same performance. !post_optimize_performance.png! The question remains though whether we should stick with postfilter or switch to TPI here though. Both approaches really suffer performance-wise when the query matches a large number of documents. That makes me wonder whether we might be better off with "postfilter", which has a very explicit switch that users can control (unlike DVTQ which seems to always operate using TPI). But that's just a thought, I was a little surprised to see the post-optimize perf is exactly the same, so I'm a bit flatfooted how to proceed. > Add postfilter support to {!terms} queries > ------------------------------------------ > > Key: SOLR-13890 > URL: https://issues.apache.org/jira/browse/SOLR-13890 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers > Affects Versions: master (9.0) > Reporter: Jason Gerlowski > Assignee: Jason Gerlowski > Priority: Major > Attachments: SOLR-13890.patch, SOLR-13890.patch, SOLR-13890.patch, > SOLR-13890.patch, Screen Shot 2020-01-02 at 2.25.12 PM.png, > post_optimize_performance.png > > > There are some use-cases where it'd be nice if the "terms" qparser created a > query that could be run as a postfilter. Particularly, when users are > checking for hundreds or thousands of terms, a postfilter implementation can > be more performant than the standard processing. > WIth this issue, I'd like to propose a post-filter implementation for the > {{docValuesTermsFilter}} "method". Postfilter creation can use a > SortedSetDocValues object to populate a DV bitset with the "terms" being > checked for. Each document run through the post-filter can look at their > doc-values for the field in question and check them efficiently against the > constructed bitset. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org