[jira] [Commented] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support

Zach Chen (Jira) Thu, 04 Nov 2021 21:49:05 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439028#comment-17439028
 ]


Zach Chen commented on LUCENE-10061:
------------------------------------

Hi [~jpountz], I've implemented a quick optimization to replace combinatorial 
calculation with an upper-bound approximation 
([commit|https://github.com/apache/lucene/pull/418/commits/2ba435e5c83f870be95662c951c9818111843a59])
 .

With this and other bug fixes / optimizations based on CPU profiler, I was able 
to get the following performance test results (perf test index rebuilt to 
enable norm for title field, task file attached, and luceneutil integration 
available at 
[https://github.com/mikemccand/luceneutil/pull/148):|https://github.com/mikemccand/luceneutil/pull/148:]
{code:java}
Run 1
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 CFQHighHighHigh        4.64      (6.5%)        3.30      
(4.7%)  -29.0% ( -37% -  -19%) 0.000
                     CFQHighHigh       11.09      (6.0%)        9.61      
(6.0%)  -13.3% ( -23% -   -1%) 0.000
                        PKLookup      103.38      (4.4%)      108.04      
(4.3%)    4.5% (  -4% -   13%) 0.001
                   CFQHighMedLow       10.58      (6.1%)       12.30      
(8.7%)   16.2% (   1% -   33%) 0.000
                      CFQHighMed       10.70      (7.4%)       15.51     
(11.2%)   44.9% (  24% -   68%) 0.000
                   CFQHighLowLow        8.18      (8.2%)       12.87     
(11.6%)   57.3% (  34% -   84%) 0.000
                      CFQHighLow       14.57      (7.5%)       30.81     
(15.1%)  111.4% (  82% -  144%) 0.000

Run 2
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 CFQHighHighHigh        5.33      (5.7%)        4.02      
(7.7%)  -24.4% ( -35% -  -11%) 0.000
                   CFQHighLowLow       17.14      (6.2%)       13.06      
(5.4%)  -23.8% ( -33% -  -13%) 0.000
                      CFQHighMed       17.37      (5.8%)       14.38      
(7.7%)  -17.2% ( -29% -   -3%) 0.000
                        PKLookup      103.57      (5.5%)      108.84      
(5.9%)    5.1% (  -6% -   17%) 0.005
                   CFQHighMedLow       11.25      (7.2%)       12.70      
(9.0%)   12.9% (  -3% -   31%) 0.000
                     CFQHighHigh        5.00      (6.2%)        7.54     
(12.1%)   51.0% (  30% -   73%) 0.000
                      CFQHighLow       21.60      (5.2%)       34.57     
(14.1%)   60.0% (  38% -   83%) 0.000

Run 3
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 CFQHighHighHigh        5.40      (6.9%)        4.06      
(5.1%)  -24.8% ( -34% -  -13%) 0.000
                   CFQHighMedLow        7.64      (7.4%)        5.79      
(6.3%)  -24.2% ( -35% -  -11%) 0.000
                     CFQHighHigh       11.11      (7.0%)        9.60      
(5.9%)  -13.6% ( -24% -    0%) 0.000
                   CFQHighLowLow       21.21      (7.6%)       21.22      
(6.6%)    0.0% ( -13% -   15%) 0.993
                        PKLookup      103.15      (5.9%)      107.60      
(6.9%)    4.3% (  -8% -   18%) 0.034
                      CFQHighLow       21.85      (8.1%)       34.18     
(13.5%)   56.4% (  32% -   84%) 0.000
                      CFQHighMed       12.07      (8.4%)       19.98     
(16.7%)   65.5% (  37% -   98%) 0.000

Run 4
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                     CFQHighHigh        8.50      (5.8%)        6.85      
(5.2%)  -19.5% ( -28% -   -8%) 0.000
                   CFQHighMedLow       10.89      (5.7%)        8.96      
(5.4%)  -17.8% ( -27% -   -7%) 0.000
                      CFQHighMed        8.41      (5.8%)        7.74      
(5.6%)   -7.9% ( -18% -    3%) 0.000
                 CFQHighHighHigh        3.45      (6.7%)        3.38      
(5.3%)   -2.0% ( -13% -   10%) 0.287
                   CFQHighLowLow        7.82      (6.4%)        8.20      
(7.5%)    4.8% (  -8% -   20%) 0.030
                        PKLookup      103.50      (5.0%)      110.69      
(5.4%)    6.9% (  -3% -   18%) 0.000
                      CFQHighLow       11.46      (6.0%)       13.16      
(6.7%)   14.8% (   1% -   29%) 0.000
{code}
I think overall this shows that the pruning will be most effective when there's 
a significant difference between terms' frequencies, but will slow things down 
if they are close, as the cost of pruning outweighs the efficacy of skipping. 
I'm wondering if we should then gate the pruning by checking the frequencies as 
well, but from some quick trials that seems to be an expensive operation? Do 
you have any recommendation for this scenario?

> CombinedFieldsQuery needs dynamic pruning support
> -------------------------------------------------
>
>                 Key: LUCENE-10061
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10061
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, 
> forcing Lucene to collect all matches in order to figure the top-k hits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support

Reply via email to