xinyiZzz opened a new pull request, #9581:
URL: https://github.com/apache/incubator-doris/pull/9581

   # Proposed changes
   
   Issue Number: close #xxx
   
   ## Problem Summary:
   
   1. High concurrency stress test on SSB and wide table. Compare the 
performance of turning the vectorization engine on and off. Turning on the 
vectorization engine is slower for most SSB queries.
   2. Optimize the Allocator in the vectorization engine. Memory allocation 
between 4KB and 64MB will be through ChunkAllocator, those less than 4KB will 
be through malloc, and those greater than 64MB will be through MMAP. In most 
queries, the performance is improved by about 10%.
   3. Fix Lru Cache MemTracker, add NO_MEM_TRACKER compile option, and fix some 
details.
   4. Optimize Chunk Allocator, increase the limit that allows chunks to be 
stolen from other core's arena, and optimize reserved bytes conf.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: (Yes)
   2. Has unit tests been added: (No)
   5. Has document been added or modified: (Yes)
   6. Does it need to update dependencies: (No)
   7. Are there any changes that cannot be rolled back: (Yes)
   
   ## Further comments
   
   Stress testing the vectorization engine.
   ## 1. Env and Test Set
   ```
   Env: 1 FE, 1 BE
   Test Set: 
        SSB, 100G, lineorder 60003w rows
        Width table from online service, 419 columns, 1710549 rows
   set global parallel_fragment_exec_instance_num=10
   
   jmeter conf:
        <stringProp name="ThreadGroup.num_threads">100</stringProp>
        <stringProp name="ThreadGroup.ramp_time">1</stringProp>
        <boolProp name="ThreadGroup.scheduler">true</boolProp>
        <stringProp name="ThreadGroup.duration">30</stringProp>
        <stringProp name="ThreadGroup.delay">0</stringProp>
   
   actual concurrency = parallel_fragment_exec_instance_num * 
ThreadGroup.num_threads
   ```
   
   ## 2. Test
   
   - TO: Master, set global enable_vectorized_engine=false;
   - T1: Master, set global enable_vectorized_engine=true;
   - T2: Master, set global enable_vectorized_engine=true, 
tc_max_total_thread_cache_bytes=100G;
   - T3: This PR, set global enable_vectorized_engine=true, allocate 4k < size 
< 64M use ChunkAllocator; 
   - T4: This PR, set global enable_vectorized_engine=true, Allocator 4k < size 
< 64M use chunkAllocator, and compile NO_MEM_TRACKER=1;
   - R1: (T0mid-T1mid)/T0mid, Compare the performance of turning the 
vectorization engine on and off.
   - R2: (T1mid-T3mid)/T1mid, Performance changes brought by allocating 4K < 
size < 64M memory through ChunkAllocator in the vectorization engine.
   - R3: (T1mid-T4mid)/T1mid, Same as above, close memtracker.
   
   > Form Notes: "xxx,xxx,xxx": Repeat 3 times, the AvgTime(ms) of each time.
   
   | query | num_threads | T0 | T1 | T2 | T3 | T4 | R1 | R2 | R3 |
   | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | 
------ |  ------ |
   | Q1.1 | 100 | 36252,36903,36800 | 34297,35053,36087 | 35757,34483,33825 | 
33314,31657,31838 | 30801,31496,30445 | 4.7% | 9.2% | 12.1% |
   | Q1.2 | 100 | 24017,24338,25478 | 25273.25222,25914 | 26647,24651,25406 | 
23453,23498,23604 | 23771,23084,23704 | -3.8% | 7% | 6.2% |
   | Q1.3 | 100 | 24349,23780,22844 | 24073,24487,24401 | 23842,23149,24198 | 
22614,22984,23050 | 22678,22466,22225 | -2.6% | 9% | 7.9 |
   | Q2.1 | 20 | 89466,21528,21889 | 26300,24345,24222 | 89662,25042,24069 | 
24094,24542,24197 | 20538,19651,19627 | -11.2% | 0.6% | 19.3% |
   | Q2.2| 20 | 16963,21435,18154 | 15855,16936,15047 | 16006,17251,16593 | 
15072,14407,15648 | 15716,16347,15588 | 12.7% | 4.9% | 0.8% |
   | Q2.3 | 20 | 15183,16194,13977 | 15551,15033,14801 | 14338,14605,14531 | 
14302,14548,15301 | 14318,13601,13689 | 1% | 3.2% | 8.9% |
   | Q3.1 | 20 | 32021,32176,31427 | 31037,30283,30231 | 38002,30272,30016 | 
25162,23187,24411 | 27673,23147,22492 | 5.4% | 19.4% | 23.6% |
   | Q3.2 | 20 | 10379,10433,9893 | 11837,11184,11223 | 11403,11481,9788 | 
9296,9452,9455 | 9576,9172,9283 | -8% | 15.8% | 17.3% |
   | Q3.3 | 20 | 8559,8472,8639 | 8713,9390,8992 | 8367,8153,8133 | 
7998,8618,8040 | 7952,7476,7845 | -5% | 10.6% | 12.8% |
   | Q4.1 | 20 | 32249,29965,29136 | 47405,40357,40443 | 41912,36981,37571 | 
31230,27683,29166 | 31848,27585,27435 | -35% | 27.9% | 31.8% |
   | Q4.2 | 20 | 19979,18798,17169 | 34066,32614,30849 | 34560,34460,35194 | 
27117,29645,27337 | 27651,29205,27149 | -73.5 | 16.2% | 15.2% |
   | Q4.3 | 20 | 19357,20762,19992 | 28256,29862,29230 | 27647,29216,29091 | 
30523,30067,29260 | 29092,26644,30017 | -46.2% | -2.8% | 0.5% |
   | Width table (419 rows) | 100 | no work | 4211,4546,4710 | 4089,4479,4551 | 
3679,3745,3816 | 3664,3732,3829 | 100% | 17.6% | 17.9% |
   
   ## 3. Detailed description
   - T2:  Theoretically, when the capacity of the tcmalloc thread cache is 
sufficient, the spin lock in the central free list will be avoided to a great 
extent, but in practice, the spin lock cost is still large in high concurrency 
queries, I will test this matter in more detail below.
   - T3:  Because tcmalloc thread cache cannot avoid spin lock, the 
introduction of ChunkAllocator is equivalent to adding a layer of cache in User 
Mode. 
        In allocator.h, Memory allocation between 4KB and 64MB will be through 
ChunkAllocator, those less than 4KB will be through malloc (for example, 
tcmalloc), and those greater than 64MB will be through MMAP.
        In the actual test, chunkallocator allocates less than 4KB of memory 
slower than malloc, and chunkallocator allocates more than 64MB of memory 
slower than MMAP, but the 4KB threshold is an empirical value, which needs to 
be determined by more detailed test later.
   - T4: Close memtracker at compile time can be selected during POC. 
Memtracker records the consumption value through an atomic variable. In high 
query concurrency, the atomic variable spin lock has a high cost. I'll optimize 
memtracker. After that
   
   ## 4. CPU hotspot by pprof
   pprof --svg --seconds=250 http://be_ip:brpc_port/pprof/profile > q11.svg
   T0:
   ![q1 
1_0515_2_no_vec](https://user-images.githubusercontent.com/13197424/168498387-95dd2f70-1795-4f62-b845-2dbeb6157fd2.svg)
   ![q1 
2_0515_2_no_vec](https://user-images.githubusercontent.com/13197424/168498392-177fc44d-62e7-4b89-8278-106b3e468075.svg)
   ![q1 
3_0515_2_no_vec](https://user-images.githubusercontent.com/13197424/168498402-349dd5c5-4833-4eab-a55e-e578eddcf359.svg)
   ![q2 
1_0515_2_no_vec](https://user-images.githubusercontent.com/13197424/168498409-c66830aa-6a37-4a6c-bed8-0f9584a080e8.svg)
   ![q2 
2_0515_2_no_vec](https://user-images.githubusercontent.com/13197424/168498412-55029770-84f5-488a-a0c8-85b3a8e4460b.svg)
   ![q2 
3_0515_2_no_vec](https://user-images.githubusercontent.com/13197424/168498415-48ef163b-0616-4f2b-a7e7-1366f5c6f398.svg)
   ![q3 
1_0515_2_no_vec](https://user-images.githubusercontent.com/13197424/168498418-2f9cea43-77ca-40c2-a7d0-d0458cd770f2.svg)
   ![q3 
2_0515_1_no_vec](https://user-images.githubusercontent.com/13197424/168498419-98085bcd-3775-409c-aec9-e66131752796.svg)
   ![q3 
3_0515_2_no_vec](https://user-images.githubusercontent.com/13197424/168498422-082b2022-f4f9-4cc2-aac5-620aa2cbf7d7.svg)
   ![q4 
1_0515_2_no_vec](https://user-images.githubusercontent.com/13197424/168498428-1ed3b578-71aa-4851-9f18-a8ef71b8ebe6.svg)
   ![q4 
2_0515_3_no_vec](https://user-images.githubusercontent.com/13197424/168498433-5502c41d-2337-4a00-8cab-a1bdf74aedf1.svg)
   ![q4 
3_0515_2_no_vec](https://user-images.githubusercontent.com/13197424/168498439-87e614d9-1923-4d86-96ab-ce6d0a14804d.svg)
   
   T1:
   
![300column_0515_2](https://user-images.githubusercontent.com/13197424/168498458-ace4f5c3-3567-4959-8f56-46dbbf642407.svg)
   ![q1 
1_0515_2](https://user-images.githubusercontent.com/13197424/168498454-2f1d0e28-73d0-4672-9fbb-207c39568704.svg)
   ![q1 
2_0515_2](https://user-images.githubusercontent.com/13197424/168498500-3fbc60b7-0ede-43d6-acc6-2c35a1545154.svg)
   ![q1 
3_0515_2](https://user-images.githubusercontent.com/13197424/168498514-6d718169-a0e0-4599-906e-d050bcd04f52.svg)
   
   
   T2:
   
![300column_0515_2_upthreadcache](https://user-images.githubusercontent.com/13197424/168498464-f0f48507-6eb4-4ef3-9651-b6cf92cc549a.svg)
   ![q1 
1_0515_2_upthreadcache](https://user-images.githubusercontent.com/13197424/168498473-3da8c24a-e653-42ee-8f91-f55b820624b7.svg)
   ![q1 
2_0515_2_upthreadcache](https://user-images.githubusercontent.com/13197424/168498495-f8b160d0-f811-4ddd-a642-b7848e4c9e35.svg)
   ![q1 
3_0515_2_upthreadcache](https://user-images.githubusercontent.com/13197424/168498512-4257f9e3-0514-4829-8ad1-e6eb3182b87a.svg)
   
   
   T3:
   
![300column_0515_2_chunk](https://user-images.githubusercontent.com/13197424/168498468-be644079-633b-4fa2-8f0d-d5db
   ![q1 
1_0515_2_chunk](https://user-images.githubusercontent.com/13197424/168498479-75be1e9a-c9b5-4e56-98b4-537a599c468d.svg)
   7f04661e.svg)
   ![q1 
2_0515_2_chunk](https://user-images.githubusercontent.com/13197424/168498490-e4efbd5a-e7d3-46e2-bc39-9504c4734a31.svg)
   ![q1 
3_0515_2_chunk](https://user-images.githubusercontent.com/13197424/168498509-6c12b183-f437-4c86-8e1d-0985b3e168fe.svg)
   
   T4:
   ![q1 
1_0515_2_chunk_no_track](https://user-images.githubusercontent.com/13197424/168498484-96ce58e3-a418-4d00-ad9a-fe362a30cf89.svg)
   ![q1 
2_0515_2_chunk_no_track](https://user-images.githubusercontent.com/13197424/168498488-d3881f71-33d1-4c9a-8b8b-3aef361afb57.svg)
   ![q1 
3_0515_2_chunk_no_track](https://user-images.githubusercontent.com/13197424/168498505-9f876b7b-947b-4a8e-8676-2a8e0e04c64e.svg)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to