yangzhg opened a new pull request #4958: URL: https://github.com/apache/incubator-doris/pull/4958
changed the merge method of the unique table, merged the cumulative version data first, and then merged with the base version. For the data with only one base version, read directly without merging ## Proposed changes Optimize the read performance of AGG and UNIQUE table with too much versions the benchmar as follows ## 测试数据 本次测试数据集 为 tpcds 10G数据集中的catalog_sales 数据,数据分成两份 一份是完整数据(对应big 表)3G,共28802522(unqiue 后14401261)行,一份只只有200M的采样数据(对应test表)共1000000 行 ,只使用一个分区,其中完整数据共分10个桶。 表均为 segment V2 格式 ``` +-------+--------------+------+-------+---------+---------+ | Field | Type | Null | Key | Default | Extra | +-------+--------------+------+-------+---------+---------+ | k1 | BIGINT | No | true | NULL | | | k2 | BIGINT | No | true | NULL | | | k3 | BIGINT | Yes | true | NULL | | | k4 | BIGINT | Yes | true | NULL | | | k5 | BIGINT | Yes | true | NULL | | | k6 | BIGINT | Yes | true | NULL | | | k7 | BIGINT | Yes | true | NULL | | | k8 | BIGINT | Yes | true | NULL | | | k9 | BIGINT | Yes | true | NULL | | | k10 | BIGINT | Yes | true | NULL | | | v1 | BIGINT | Yes | false | NULL | REPLACE | | v2 | BIGINT | Yes | false | NULL | REPLACE | | v3 | BIGINT | Yes | false | NULL | REPLACE | | v4 | BIGINT | Yes | false | NULL | REPLACE | | v5 | BIGINT | Yes | false | NULL | REPLACE | | v6 | BIGINT | Yes | false | NULL | REPLACE | | v7 | BIGINT | Yes | false | NULL | REPLACE | | v8 | BIGINT | Yes | false | NULL | REPLACE | | v9 | BIGINT | Yes | false | NULL | REPLACE | | v10 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | | v11 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | | v12 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | | v13 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | | v14 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | | v15 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | | v16 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | | v17 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | | v18 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | | v19 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | | v20 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | | v21 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | | v22 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | | v23 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | | v24 | DECIMAL(7,2) | Yes | false | NULL | REPLACE | +-------+--------------+------+-------+---------+---------+ ``` 本次测试主要是测试读取性能,尤其是有大量为合并的小版本时的读取性能,因此本次的测试query 为 `select count(*) from (select k1,k2,k3,k4,k5,k6,k7,k8,k9,k10 from table_name) a;` ### UNIQUE_KEY 和 UNIQUE_KEY 对比 首先本次测试对比了 UNIQUE_KEY 表和 DUPLICTE_KEY 表的读取性能的差异,两个版本中第一个版本为空版本 | 表 | 数据大小 | 分区数 | 分桶数 | 行数 | 版本数 | 查询耗时(s) | | ------------- | --------- | ------ | ------ | ---------- | ------ | ------------- | | test_uniq | 93.585 MB | 1 | 1 | 1,000,000 | 2 | 9 | | test_uniq_big | 1.327 GB | 10 | 10 | 14,401,261 | 2 | 13.5 | | test_dup | 93.724 MB | 1 | 1 | 1,000,000 | 2 | 4.8 | | test_dup_big | 2.571 G | 10 | 10 | 28,802,522 | 2 | 7 | 可以看到duplicate 表读取速度是 unique 的**1倍**左右 优化后 数据 | 表 | 数据大小 | 分区数 | 分桶数 | 行数 | 版本数 | 查询耗时(s) | | ------------- | --------- | ------ | ------ | ---------- | ------ | ------------- | | test_uniq | 93.585 MB | 1 | 1 | 1,000,000 | 2 | 4.8 | | test_uniq_big | 1.327 GB | 10 | 10 | 14,401,261 | 2 | 7.4 | | test_dup | 93.724 MB | 1 | 1 | 1,000,000 | 2 | 4.7 | | test_dup_big | 2.571 G | 10 | 10 | 28,802,522 | 2 | 7 | 优化后 在base 版本数比较少的情况下 查询性能已经差距不大了 ### UNIQUE_KEY 多版本读取优化对比 由于之后多个版本导入数据为随机数据,所以非全量版本数据有点差异, 测试query 为 `select count(*) from table_name` | 表 | 数据大小 | 分区数 | 分桶数 | 行数 | 版本数 | 查询耗时(s) | | ------------------------------ | -------------- | ------ | ------ | ------------ | --------- | ------------- | | test_uniq(优化前) | 136.288 MB | 1 | 1 | 1008592 | 10000 | 11.7 | | **test_uniq(优化后)** | **136.266 MB** | **1** | **1** | **1008635** | **10000** | **7.1** | | test_uniq_big(优化前) | 1.368 GB | 1 | 10 | 14401261 | 10000 | 15.5 | | **test_uniq_big(优化后)** | **1.368 GB** | **1** | **10** | **14401261** | **10000** | **12** | | test_uniq_base(优化前) | 94.252 MB | 1 | 1 | 1000000 | 1 | 4 | | **test_uniq_base(优化后)** | **94.252 MB** | **1** | **1** | **1000000** | **1** | **3.75** | | test_uniq_big_base(优化前) | 1.327 GB | 1 | 10 | 14401261 | 1 | 6 | | **test_uniq_big_base(优化后)** | **1.327 GB** | **1** | **10** | **14401261** | **1** | **5.6** | | test_uniq(优化前) | 96.348 MB | 1 | 1 | 1008592 | 500 | 7.1 | | **test_uniq(优化后)** | **96.349 MB** | **1** | **1** | **1008635** | **500** | **4.5** | | test_uniq_big(优化前) | 1.329 GB | 1 | 10 | 14401261 | 500 | 8.8 | | **test_uniq_big(优化后)** | **1.327 GB** | **1** | **10** | **14401261** | **500** | 6.7 | | 1 | 1 | 1008592 | 500 | 7.1 | | **test_uniq (after optimization)** | **96.349 MB** | **1** | **1** | **1008635** | **500** | **4.5** | | test_uniq_big (before optimization) | 1.329 GB | 1 | 10 | 14401261 | 500 | 8.8 | | **test_uniq_big (after optimization)** | **1.327 GB** | **1** | **10** | **14401261** | **500** | 6.7 | ## Types of changes What types of changes does your code introduce to Doris? _Put an `x` in the boxes that apply_ - [] Bugfix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [] Documentation Update (if none of the other choices apply) - [] Code refactor (Modify the code structure, format the code, etc...) ## Checklist _Put an `x` in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code._ - [x] I have create an issue on (Fix #4957 ), and have described the bug/feature there in detail - [] Compiling and unit tests pass locally with my changes - [x] I have added tests that prove my fix is effective or that my feature works - [] If this change need a document change, I have updated the document - [] Any dependent changes have been merged ## Further comments If this is a relatively large or complex change, kick off the discussion at d...@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc... ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org