xinyiZzz opened a new pull request, #9581: URL: https://github.com/apache/incubator-doris/pull/9581
# Proposed changes Issue Number: close #xxx ## Problem Summary: 1. High concurrency stress test on SSB and wide table. Compare the performance of turning the vectorization engine on and off. Turning on the vectorization engine is slower for most SSB queries. 2. Optimize the Allocator in the vectorization engine. Memory allocation between 4KB and 64MB will be through ChunkAllocator, those less than 4KB will be through malloc, and those greater than 64MB will be through MMAP. In most queries, the performance is improved by about 10%. 3. Fix Lru Cache MemTracker, add NO_MEM_TRACKER compile option, and fix some details. 4. Optimize Chunk Allocator, increase the limit that allows chunks to be stolen from other core's arena, and optimize reserved bytes conf. ## Checklist(Required) 1. Does it affect the original behavior: (Yes) 2. Has unit tests been added: (No) 5. Has document been added or modified: (Yes) 6. Does it need to update dependencies: (No) 7. Are there any changes that cannot be rolled back: (Yes) ## Further comments Stress testing the vectorization engine. ## 1. Env and Test Set ``` Env: 1 FE, 1 BE Test Set: SSB, 100G, lineorder 60003w rows Width table from online service, 419 columns, 1710549 rows set global parallel_fragment_exec_instance_num=10 jmeter conf: <stringProp name="ThreadGroup.num_threads">100</stringProp> <stringProp name="ThreadGroup.ramp_time">1</stringProp> <boolProp name="ThreadGroup.scheduler">true</boolProp> <stringProp name="ThreadGroup.duration">30</stringProp> <stringProp name="ThreadGroup.delay">0</stringProp> actual concurrency = parallel_fragment_exec_instance_num * ThreadGroup.num_threads ``` ## 2. Test - TO: Master, set global enable_vectorized_engine=false; - T1: Master, set global enable_vectorized_engine=true; - T2: Master, set global enable_vectorized_engine=true, tc_max_total_thread_cache_bytes=100G; - T3: This PR, set global enable_vectorized_engine=true, allocate 4k < size < 64M use ChunkAllocator; - T4: This PR, set global enable_vectorized_engine=true, Allocator 4k < size < 64M use chunkAllocator, and compile NO_MEM_TRACKER=1; - R1: (T0mid-T1mid)/T0mid, Compare the performance of turning the vectorization engine on and off. - R2: (T1mid-T3mid)/T1mid, Performance changes brought by allocating 4K < size < 64M memory through ChunkAllocator in the vectorization engine. - R3: (T1mid-T4mid)/T1mid, Same as above, close memtracker. > Form Notes: "xxx,xxx,xxx": Repeat 3 times, the AvgTime(ms) of each time. | query | num_threads | T0 | T1 | T2 | T3 | T4 | R1 | R2 | R3 | | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | | Q1.1 | 100 | 36252,36903,36800 | 34297,35053,36087 | 35757,34483,33825 | 33314,31657,31838 | 30801,31496,30445 | 4.7% | 9.2% | 12.1% | | Q1.2 | 100 | 24017,24338,25478 | 25273.25222,25914 | 26647,24651,25406 | 23453,23498,23604 | 23771,23084,23704 | -3.8% | 7% | 6.2% | | Q1.3 | 100 | 24349,23780,22844 | 24073,24487,24401 | 23842,23149,24198 | 22614,22984,23050 | 22678,22466,22225 | -2.6% | 9% | 7.9 | | Q2.1 | 20 | 89466,21528,21889 | 26300,24345,24222 | 89662,25042,24069 | 24094,24542,24197 | 20538,19651,19627 | -11.2% | 0.6% | 19.3% | | Q2.2| 20 | 16963,21435,18154 | 15855,16936,15047 | 16006,17251,16593 | 15072,14407,15648 | 15716,16347,15588 | 12.7% | 4.9% | 0.8% | | Q2.3 | 20 | 15183,16194,13977 | 15551,15033,14801 | 14338,14605,14531 | 14302,14548,15301 | 14318,13601,13689 | 1% | 3.2% | 8.9% | | Q3.1 | 20 | 32021,32176,31427 | 31037,30283,30231 | 38002,30272,30016 | 25162,23187,24411 | 27673,23147,22492 | 5.4% | 19.4% | 23.6% | | Q3.2 | 20 | 10379,10433,9893 | 11837,11184,11223 | 11403,11481,9788 | 9296,9452,9455 | 9576,9172,9283 | -8% | 15.8% | 17.3% | | Q3.3 | 20 | 8559,8472,8639 | 8713,9390,8992 | 8367,8153,8133 | 7998,8618,8040 | 7952,7476,7845 | -5% | 10.6% | 12.8% | | Q4.1 | 20 | 32249,29965,29136 | 47405,40357,40443 | 41912,36981,37571 | 31230,27683,29166 | 31848,27585,27435 | -35% | 27.9% | 31.8% | | Q4.2 | 20 | 19979,18798,17169 | 34066,32614,30849 | 34560,34460,35194 | 27117,29645,27337 | 27651,29205,27149 | -73.5 | 16.2% | 15.2% | | Q4.3 | 20 | 19357,20762,19992 | 28256,29862,29230 | 27647,29216,29091 | 30523,30067,29260 | 29092,26644,30017 | -46.2% | -2.8% | 0.5% | | Width table (419 rows) | 100 | no work | 4211,4546,4710 | 4089,4479,4551 | 3679,3745,3816 | 3664,3732,3829 | 100% | 17.6% | 17.9% | ## 3. Detailed description - T2: Theoretically, when the capacity of the tcmalloc thread cache is sufficient, the spin lock in the central free list will be avoided to a great extent, but in practice, the spin lock cost is still large in high concurrency queries, I will test this matter in more detail below. - T3: Because tcmalloc thread cache cannot avoid spin lock, the introduction of ChunkAllocator is equivalent to adding a layer of cache in User Mode. In allocator.h, Memory allocation between 4KB and 64MB will be through ChunkAllocator, those less than 4KB will be through malloc (for example, tcmalloc), and those greater than 64MB will be through MMAP. In the actual test, chunkallocator allocates less than 4KB of memory slower than malloc, and chunkallocator allocates more than 64MB of memory slower than MMAP, but the 4KB threshold is an empirical value, which needs to be determined by more detailed test later. - T4: Close memtracker at compile time can be selected during POC. Memtracker records the consumption value through an atomic variable. In high query concurrency, the atomic variable spin lock has a high cost. I'll optimize memtracker. After that ## 4. CPU hotspot by pprof pprof --svg --seconds=250 http://be_ip:brpc_port/pprof/profile > q11.svg T0:             T1:     T2:     T3:  7f04661e.svg)   T4:    -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org