[ https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518473#comment-17518473 ]
Feng Guo commented on LUCENE-10315: ----------------------------------- Here is the benchmark result I got on my machine by [https://github.com/iverase/benchmark_forutil]. {code:java} Benchmark Mode Cnt Score Error Units ReadInts24Benchmark.readInts24ForUtil thrpt 25 9.086 ± 0.089 ops/us ReadInts24Benchmark.readInts24ForUtilVisitor thrpt 25 0.764 ± 0.005 ops/us ReadInts24Benchmark.readInts24Legacy thrpt 25 2.877 ± 0.013 ops/us ReadInts24Benchmark.readInts24Visitor thrpt 25 0.778 ± 0.006 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLong1 thrpt 25 3.329 ± 0.023 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLong2 thrpt 25 3.218 ± 0.037 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLong3 thrpt 25 3.755 ± 0.017 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLong4 thrpt 25 3.862 ± 0.025 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor1 thrpt 25 0.710 ± 0.008 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor2 thrpt 25 0.849 ± 0.013 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor3 thrpt 25 0.804 ± 0.006 ops/us ReadIntsAsLongBenchmark.readIntsLegacyLongVisitor4 thrpt 25 0.768 ± 0.007 ops/us ReadIntsBenchmark.readIntsForUtil thrpt 25 18.957 ± 0.194 ops/us ReadIntsBenchmark.readIntsForUtilVisitor thrpt 25 0.817 ± 0.004 ops/us ReadIntsBenchmark.readIntsLegacy thrpt 25 2.456 ± 0.016 ops/us ReadIntsBenchmark.readIntsLegacyVisitor thrpt 25 0.608 ± 0.007 ops/us {code} In this result, I'm seeing {{readInts24ForUtil}} runs 3 times faster than {{{}readInts24Legacy{}}}. This speed is attractive to me. So i'm trying to find some ways to solve the regression when calling visitor. A way i'm thinking about is to introduce {{visit(int[] docs, int count)}} for {{IntersectVisitor.}} The benefit of this method: 1. This method can help reduce the number of virtual function call. 2. {{BufferAdder}} can directly use {{System#arraycopy}} to append doc ids. 3. {{InverseIntersectVisitor}} can count cost faster. Based on luceneutil, I reproduced the regression successfully on my local machine by nightly benchmark tasks and random seed = 10: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value IntNRQ 27.43 (1.8%) 24.12 (1.1%) -12.1% ( -14% - -9%) 0.000 {code} After the optimization, I can see the speed up with the same seed: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value IntNRQ 27.68 (1.7%) 31.89 (2.0%) 15.2% ( 11% - 19%) 0.000 {code} I post the draft code here: [https://github.com/apache/lucene/pull/797]. This commit [https://github.com/apache/lucene/pull/797/commits/7fb6ac3f5901a29d87e9fa427ba429d1e1749b14] shows what was changed. > Speed up BKD leaf block ids codec by a 512 ints ForUtil > ------------------------------------------------------- > > Key: LUCENE-10315 > URL: https://issues.apache.org/jira/browse/LUCENE-10315 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Feng Guo > Assignee: Feng Guo > Priority: Major > Attachments: addall.svg, cpu_profile_baseline.html, > cpu_profile_path.html > > Time Spent: 6.5h > Remaining Estimate: 0h > > Elasticsearch (which based on lucene) can automatically infers types for > users with its dynamic mapping feature. When users index some low cardinality > fields, such as gender / age / status... they often use some numbers to > represent the values, while ES will infer these fields as {{{}long{}}}, and > ES uses BKD as the index of {{long}} fields. When the data volume grows, > building the result set of low-cardinality fields will make the CPU usage and > load very high. > This is a flame graph we obtained from the production environment: > [^addall.svg] > It can be seen that almost all CPU is used in addAll. When we reindex > {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly > reduced ( We spent weeks of time to reindex all indices... ). I know that ES > recommended to use {{keyword}} for term/terms query and {{long}} for range > query in the document, but there are always some users who didn't realize > this and keep their habit of using sql database, or dynamic mapping > automatically selects the type for them. All in all, users won't realize that > there would be such a big difference in performance between {{long}} and > {{keyword}} fields in low cardinality fields. So from my point of view it > will make sense if we can make BKD works better for the low/medium > cardinality fields. > As far as i can see, for low cardinality fields, there are two advantages of > {{keyword}} over {{{}long{}}}: > 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's > delta VInt, because its batch reading (readLongs) and SIMD decode. > 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily > materialize of its result set, and when another small result clause > intersects with this low cardinality condition, the low cardinality field can > avoid reading all docIds into memory. > This ISSUE is targeting to solve the first point. The basic idea is trying to > use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization > by mocking some random {{LongPoint}} and querying them with > {{PointInSetQuery}}. > *Benchmark Result* > |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff > percentage| > |100000000|32|1|51.44|148.26|188.22%| > |100000000|32|2|26.8|101.88|280.15%| > |100000000|32|4|14.04|53.52|281.20%| > |100000000|32|8|7.04|28.54|305.40%| > |100000000|32|16|3.54|14.61|312.71%| > |100000000|128|1|110.56|350.26|216.81%| > |100000000|128|8|16.6|89.81|441.02%| > |100000000|128|16|8.45|48.07|468.88%| > |100000000|128|32|4.2|25.35|503.57%| > |100000000|128|64|2.13|13.02|511.27%| > |100000000|1024|1|536.19|843.88|57.38%| > |100000000|1024|8|109.71|251.89|129.60%| > |100000000|1024|32|33.24|104.11|213.21%| > |100000000|1024|128|8.87|30.47|243.52%| > |100000000|1024|512|2.24|8.3|270.54%| > |100000000|8192|1|3333.33|5000|50.00%| > |100000000|8192|32|139.47|214.59|53.86%| > |100000000|8192|128|54.59|109.23|100.09%| > |100000000|8192|512|15.61|36.15|131.58%| > |100000000|8192|2048|4.11|11.14|171.05%| > |100000000|1048576|1|2597.4|3030.3|16.67%| > |100000000|1048576|32|314.96|371.75|18.03%| > |100000000|1048576|128|99.7|116.28|16.63%| > |100000000|1048576|512|30.5|37.15|21.80%| > |100000000|1048576|2048|10.38|12.3|18.50%| > |100000000|8388608|1|2564.1|3174.6|23.81%| > |100000000|8388608|32|196.27|238.95|21.75%| > |100000000|8388608|128|55.36|68.03|22.89%| > |100000000|8388608|512|15.58|19.24|23.49%| > |100000000|8388608|2048|4.56|5.71|25.22%| > The indices size is reduced for low cardinality fields and flat for high > cardinality fields. > {code:java} > 113M index_100000000_doc_32_cardinality_baseline > 114M index_100000000_doc_32_cardinality_candidate > 140M index_100000000_doc_128_cardinality_baseline > 133M index_100000000_doc_128_cardinality_candidate > 193M index_100000000_doc_1024_cardinality_baseline > 174M index_100000000_doc_1024_cardinality_candidate > 241M index_100000000_doc_8192_cardinality_baseline > 233M index_100000000_doc_8192_cardinality_candidate > 314M index_100000000_doc_1048576_cardinality_baseline > 315M index_100000000_doc_1048576_cardinality_candidate > 392M index_100000000_doc_8388608_cardinality_baseline > 391M index_100000000_doc_8388608_cardinality_candidate > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org