GoGoWen opened a new pull request, #14148: URL: https://github.com/apache/doris/pull/14148
# Proposed changes enhance broker load for parquet and orc file when missing columns in src files ## Problem summary source my_file.orc/parquet like below: +------+------+-------------+-------+------+ | name | id | impressions | click | cost | +------+------+-------------+-------+------+ | 1 | 1 | 2 | NULL | 2 | | 4 | 4 | 8 | 8 | 8 | | 5 | 5 | 10 | 10 | 10 | | 3 | 3 | 6 | 6 | 6 | | 2 | 2 | 4 | 4 | 4 | | 11 | 11 | 22 | 22 | 22 | +------+------+-------------+-------+------+ case 1: create table t1 like below CREATE TABLE `t1` ( `name` bigint(20) NOT NULL, `id` bigint(20) NOT NULL, `impressions` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总展现', `click` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总点击', `cost` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总消费' ) ENGINE=OLAP AGGREGATE KEY(`name`, `id`) COMMENT 'OLAP' PARTITION BY RANGE(`name`) (PARTITION p201901 VALUES [("1"), ("100"))) DISTRIBUTED BY HASH(`id`) BUCKETS 16 PROPERTIES ( "replication_allocation" = "tag.location.default: 3", "in_memory" = "false", "storage_format" = "V2", "disable_auto_compaction" = "false" ); when we load from source file, we wil get +------+------+-------------+-------+------+ | name | id | impressions | click | cost | +------+------+-------------+-------+------+ | 1 | 1 | 2 | NULL | 2 | | 4 | 4 | 8 | 8 | 8 | | 5 | 5 | 10 | 10 | 10 | | 3 | 3 | 6 | 6 | 6 | | 2 | 2 | 4 | 4 | 4 | | 11 | 11 | 22 | 22 | 22 | +------+------+-------------+-------+------+ case 2: when create table t1 like below(column id2 is missing from src file): CREATE TABLE `t1` ( `name` bigint(20) NOT NULL, `id` bigint(20) NOT NULL, `id2` bigint(20) NULL DEFAULT "0", `impressions` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总展现', `click` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总点击', `cost` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总消费' ) ENGINE=OLAP AGGREGATE KEY(`name`, `id`, `id2`) COMMENT 'OLAP' PARTITION BY RANGE(`name`) (PARTITION p201901 VALUES [("1"), ("100"))) DISTRIBUTED BY HASH(`id`) BUCKETS 16 PROPERTIES ( "replication_allocation" = "tag.location.default: 3", "in_memory" = "false", "storage_format" = "V2", "disable_auto_compaction" = "false" ); after broker load from my_file.orc/parquet. we will get: +------+------+------+-------------+-------+------+ | name | id | id2 | impressions | click | cost | +------+------+------+-------------+-------+------+ | 5 | 5 | 0 | 5 | 5 | 5 | | 4 | 4 | 0 | 4 | 4 | 4 | | 11 | 11 | 0 | 11 | 11 | 11 | | 1 | 1 | 0 | 1 | NULL | 1 | | 2 | 2 | 0 | 2 | 2 | 2 | | 3 | 3 | 0 | 3 | 3 | 3 | +------+------+------+-------------+-------+------+ .... Note: the case that enable_new_load_scan_node=true is not included in this pr. Describe your changes. ## Checklist(Required) 1. Does it affect the original behavior: - [ ] Yes - [ ] No - [ ] I don't know 2. Has unit tests been added: - [ ] Yes - [ ] No - [ ] No Need 3. Has document been added or modified: - [ ] Yes - [ ] No - [ ] No Need 4. Does it need to update dependencies: - [ ] Yes - [ ] No 5. Are there any changes that cannot be rolled back: - [ ] Yes (If Yes, please explain WHY) - [ ] No ## Further comments If this is a relatively large or complex change, kick off the discussion at [d...@doris.apache.org](mailto:d...@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org