GoGoWen opened a new pull request, #14148:
URL: https://github.com/apache/doris/pull/14148

   
   # Proposed changes
   enhance broker load for parquet and orc file when missing columns in src 
files
   ## Problem summary
   source my_file.orc/parquet like below:
   +------+------+-------------+-------+------+
   | name | id   | impressions | click | cost |
   +------+------+-------------+-------+------+
   |    1 |    1 |           2 |  NULL |    2 |
   |    4 |    4 |           8 |     8 |    8 |
   |    5 |    5 |          10 |    10 |   10 |
   |    3 |    3 |           6 |     6 |    6 |
   |    2 |    2 |           4 |     4 |    4 |
   |   11 |   11 |          22 |    22 |   22 |
   +------+------+-------------+-------+------+
   case 1:   
   create table t1 like below
    CREATE TABLE `t1` (
     `name` bigint(20) NOT NULL,
     `id` bigint(20) NOT NULL,
     `impressions` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总展现',
     `click` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总点击',
     `cost` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总消费'
   ) ENGINE=OLAP
   AGGREGATE KEY(`name`, `id`)
   COMMENT 'OLAP'
   PARTITION BY RANGE(`name`)
   (PARTITION p201901 VALUES [("1"), ("100")))
   DISTRIBUTED BY HASH(`id`) BUCKETS 16
   PROPERTIES (
   "replication_allocation" = "tag.location.default: 3",
   "in_memory" = "false",
   "storage_format" = "V2",
   "disable_auto_compaction" = "false"
   );
   
   when we load from source file, we wil get 
   +------+------+-------------+-------+------+
   | name | id   | impressions | click | cost |
   +------+------+-------------+-------+------+
   |    1 |    1 |           2 |  NULL |    2 |
   |    4 |    4 |           8 |     8 |    8 |
   |    5 |    5 |          10 |    10 |   10 |
   |    3 |    3 |           6 |     6 |    6 |
   |    2 |    2 |           4 |     4 |    4 |
   |   11 |   11 |          22 |    22 |   22 |
   +------+------+-------------+-------+------+
   
   
   case 2:
   when create table t1 like below(column id2 is missing from src file):
   CREATE TABLE `t1` (
     `name` bigint(20) NOT NULL,
     `id` bigint(20) NOT NULL,
     `id2` bigint(20) NULL DEFAULT "0",
     `impressions` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总展现',
     `click` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总点击',
     `cost` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总消费'
   ) ENGINE=OLAP
   AGGREGATE KEY(`name`, `id`, `id2`)
   COMMENT 'OLAP'
   PARTITION BY RANGE(`name`)
   (PARTITION p201901 VALUES [("1"), ("100")))
   DISTRIBUTED BY HASH(`id`) BUCKETS 16
   PROPERTIES (
   "replication_allocation" = "tag.location.default: 3",
   "in_memory" = "false",
   "storage_format" = "V2",
   "disable_auto_compaction" = "false"
   );
   
   after broker load from my_file.orc/parquet. we will get:
   +------+------+------+-------------+-------+------+
   | name | id   | id2  | impressions | click | cost |
   +------+------+------+-------------+-------+------+
   |    5 |    5 |    0 |           5 |     5 |    5 |
   |    4 |    4 |    0 |           4 |     4 |    4 |
   |   11 |   11 |    0 |          11 |    11 |   11 |
   |    1 |    1 |    0 |           1 |  NULL |    1 |
   |    2 |    2 |    0 |           2 |     2 |    2 |
   |    3 |    3 |    0 |           3 |     3 |    3 |
   +------+------+------+-------------+-------+------+
   
   ....
   
   Note:
   the case that enable_new_load_scan_node=true is not included in this pr.
   
   Describe your changes.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: 
       - [ ] Yes
       - [ ] No
       - [ ] I don't know
   2. Has unit tests been added:
       - [ ] Yes
       - [ ] No
       - [ ] No Need
   3. Has document been added or modified:
       - [ ] Yes
       - [ ] No
       - [ ] No Need
   4. Does it need to update dependencies:
       - [ ] Yes
       - [ ] No
   5. Are there any changes that cannot be rolled back:
       - [ ] Yes (If Yes, please explain WHY)
       - [ ] No
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at 
[d...@doris.apache.org](mailto:d...@doris.apache.org) by explaining why you 
chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to