[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil

Feng Guo (Jira) Wed, 06 Apr 2022 14:40:04 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518473#comment-17518473
 ]


Feng Guo commented on LUCENE-10315:
-----------------------------------

Here is the benchmark result I got on my machine by 
[https://github.com/iverase/benchmark_forutil].
{code:java}
Benchmark                                            Mode  Cnt   Score   Error  
 Units
ReadInts24Benchmark.readInts24ForUtil               thrpt   25   9.086 ± 0.089  
ops/us
ReadInts24Benchmark.readInts24ForUtilVisitor        thrpt   25   0.764 ± 0.005  
ops/us
ReadInts24Benchmark.readInts24Legacy                thrpt   25   2.877 ± 0.013  
ops/us
ReadInts24Benchmark.readInts24Visitor               thrpt   25   0.778 ± 0.006  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLong1         thrpt   25   3.329 ± 0.023  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLong2         thrpt   25   3.218 ± 0.037  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLong3         thrpt   25   3.755 ± 0.017  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLong4         thrpt   25   3.862 ± 0.025  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor1  thrpt   25   0.710 ± 0.008  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor2  thrpt   25   0.849 ± 0.013  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor3  thrpt   25   0.804 ± 0.006  
ops/us
ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor4  thrpt   25   0.768 ± 0.007  
ops/us
ReadIntsBenchmark.readIntsForUtil                   thrpt   25  18.957 ± 0.194  
ops/us
ReadIntsBenchmark.readIntsForUtilVisitor            thrpt   25   0.817 ± 0.004  
ops/us
ReadIntsBenchmark.readIntsLegacy                    thrpt   25   2.456 ± 0.016  
ops/us
ReadIntsBenchmark.readIntsLegacyVisitor             thrpt   25   0.608 ± 0.007  
ops/us
{code}
In this result, I'm seeing {{readInts24ForUtil}} runs 3 times faster than 
{{{}readInts24Legacy{}}}. This speed is attractive to me. So i'm trying to find 
some ways to solve the regression when calling visitor. A way i'm thinking 
about is to introduce {{visit(int[] docs, int count)}} for {{IntersectVisitor.}}

 

The benefit of this method:

1. This method can help reduce the number of virtual function call.
2. {{BufferAdder}} can directly use {{System#arraycopy}} to append doc ids.
3. {{InverseIntersectVisitor}} can count cost faster.




Based on luceneutil, I reproduced the regression successfully on my local 
machine by nightly benchmark tasks and random seed = 10:
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                          IntNRQ       27.43      (1.8%)       24.12      
(1.1%)  -12.1% ( -14% -   -9%) 0.000
{code}
After the optimization, I can see the speed up with the same seed:
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                          IntNRQ       27.68      (1.7%)       31.89      
(2.0%)   15.2% (  11% -   19%) 0.000
{code}


I post the draft code here: [https://github.com/apache/lucene/pull/797].
This commit 
[https://github.com/apache/lucene/pull/797/commits/7fb6ac3f5901a29d87e9fa427ba429d1e1749b14]
 shows what was changed.

> Speed up BKD leaf block ids codec by a 512 ints ForUtil
> -------------------------------------------------------
>
>                 Key: LUCENE-10315
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10315
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Feng Guo
>            Assignee: Feng Guo
>            Priority: Major
>         Attachments: addall.svg, cpu_profile_baseline.html, 
> cpu_profile_path.html
>
>          Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Elasticsearch (which based on lucene) can automatically infers types for 
> users with its dynamic mapping feature. When users index some low cardinality 
> fields, such as gender / age / status... they often use some numbers to 
> represent the values, while ES will infer these fields as {{{}long{}}}, and 
> ES uses BKD as the index of {{long}} fields. When the data volume grows, 
> building the result set of low-cardinality fields will make the CPU usage and 
> load very high.
> This is a flame graph we obtained from the production environment:
> [^addall.svg]
> It can be seen that almost all CPU is used in addAll. When we reindex 
> {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly 
> reduced ( We spent weeks of time to reindex all indices... ). I know that ES 
> recommended to use {{keyword}} for term/terms query and {{long}} for range 
> query in the document, but there are always some users who didn't realize 
> this and keep their habit of using sql database, or dynamic mapping 
> automatically selects the type for them. All in all, users won't realize that 
> there would be such a big difference in performance between {{long}} and 
> {{keyword}} fields in low cardinality fields. So from my point of view it 
> will make sense if we can make BKD works better for the low/medium 
> cardinality fields.
> As far as i can see, for low cardinality fields, there are two advantages of 
> {{keyword}} over {{{}long{}}}:
> 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's 
> delta VInt, because its batch reading (readLongs) and SIMD decode.
> 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily 
> materialize of its result set, and when another small result clause 
> intersects with this low cardinality condition, the low cardinality field can 
> avoid reading all docIds into memory.
> This ISSUE is targeting to solve the first point. The basic idea is trying to 
> use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization 
> by mocking some random {{LongPoint}} and querying them with 
> {{PointInSetQuery}}.
> *Benchmark Result*
> |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff 
> percentage|
> |100000000|32|1|51.44|148.26|188.22%|
> |100000000|32|2|26.8|101.88|280.15%|
> |100000000|32|4|14.04|53.52|281.20%|
> |100000000|32|8|7.04|28.54|305.40%|
> |100000000|32|16|3.54|14.61|312.71%|
> |100000000|128|1|110.56|350.26|216.81%|
> |100000000|128|8|16.6|89.81|441.02%|
> |100000000|128|16|8.45|48.07|468.88%|
> |100000000|128|32|4.2|25.35|503.57%|
> |100000000|128|64|2.13|13.02|511.27%|
> |100000000|1024|1|536.19|843.88|57.38%|
> |100000000|1024|8|109.71|251.89|129.60%|
> |100000000|1024|32|33.24|104.11|213.21%|
> |100000000|1024|128|8.87|30.47|243.52%|
> |100000000|1024|512|2.24|8.3|270.54%|
> |100000000|8192|1|3333.33|5000|50.00%|
> |100000000|8192|32|139.47|214.59|53.86%|
> |100000000|8192|128|54.59|109.23|100.09%|
> |100000000|8192|512|15.61|36.15|131.58%|
> |100000000|8192|2048|4.11|11.14|171.05%|
> |100000000|1048576|1|2597.4|3030.3|16.67%|
> |100000000|1048576|32|314.96|371.75|18.03%|
> |100000000|1048576|128|99.7|116.28|16.63%|
> |100000000|1048576|512|30.5|37.15|21.80%|
> |100000000|1048576|2048|10.38|12.3|18.50%|
> |100000000|8388608|1|2564.1|3174.6|23.81%|
> |100000000|8388608|32|196.27|238.95|21.75%|
> |100000000|8388608|128|55.36|68.03|22.89%|
> |100000000|8388608|512|15.58|19.24|23.49%|
> |100000000|8388608|2048|4.56|5.71|25.22%|
> The indices size is reduced for low cardinality fields and flat for high 
> cardinality fields.
> {code:java}
> 113M    index_100000000_doc_32_cardinality_baseline
> 114M    index_100000000_doc_32_cardinality_candidate
> 140M    index_100000000_doc_128_cardinality_baseline
> 133M    index_100000000_doc_128_cardinality_candidate
> 193M    index_100000000_doc_1024_cardinality_baseline
> 174M    index_100000000_doc_1024_cardinality_candidate
> 241M    index_100000000_doc_8192_cardinality_baseline
> 233M    index_100000000_doc_8192_cardinality_candidate
> 314M    index_100000000_doc_1048576_cardinality_baseline
> 315M    index_100000000_doc_1048576_cardinality_candidate
> 392M    index_100000000_doc_8388608_cardinality_baseline
> 391M    index_100000000_doc_8388608_cardinality_candidate
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil

Reply via email to