[GitHub] [incubator-doris] worker24h opened a new issue #3509: Doris流式导入json数据，相关接口定义

GitBox Thu, 07 May 2020 05:17:01 -0700


worker24h opened a new issue #3509:
URL: https://github.com/apache/incubator-doris/issues/3509



   此issue主要讨论，导入json数据接口定义， 欢迎大家进行讨论
   
   前提： 表1有 category,price,author三列， 以routine load导入举例说明
   分类： 由于json支持嵌套，所以json数据可能很复杂，因为这里设计默认导入和精准导入
   一、默认导入（不提供jsonpath）
       默认导入，表示表列的顺序与json中key/value一一对应, **即表列名子等于json key的名字**
       1.1 创建routine load语句， 该语句同事支持 场景1.1 和 场景1.2
           CREATE ROUTINE LOAD example_db.test_json_label_1 ON table1
           COLUMNS(category,price,author)
           PROPERTIES
           (
               "desired_concurrent_number"="3",
               "max_batch_interval" = "20",
               "max_batch_rows" = "300000",
               "max_batch_size" = "209715200",
               "strict_mode" = "false",
               "format" = "json"
           )
           FROM KAFKA
           (
               "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
               "kafka_topic" = "my_topic",
               "kafka_partitions" = "0,1,2",
               "kafka_offsets" = "0,0,0"
           );
           format: 目前支持csv, json两种，默认是csv
   
       1.2 单object数据
           kafka中一条消息内容 => {"category":"a1", "price":10, "author":"c1"}
           导入后表数据：
           category      price      author
           -------------------------------
           a1            10         c1
   
       1.3 多object数据
           为了提升吞吐，支持一次性导入多条数据，**其中RECORDS是关键字且为数组**
           kafka中一条消息内容 => {
                               "RECORDS":[
                                   {"category":"a1", "price":10, "author":"c1"},
                                   {"author":"c2", "category":"a2"}
                                   {"category":"a1", "price":30, "author":"c3"}
                               ]
                           }
           导入后表数据：
           category      price      author
           -------------------------------
           a1            10          c1
           a2            null        c2
           a3            30          c3
           【特别说明】 如果没有匹配到数据，默认是null
   
   
   二、精准导入（提供jsonpaths）
       2.1 json数据格式
       {
           "store": {
               "book": [
                   {"category": "a1", "author": "c1", "price": 10},
                   {"category": "a2", "author": "c2"},
                   {"category": "a3", "author": "c3", "price": 30}
               ]
       }
       2.2 创建routine load语句
       CREATE ROUTINE LOAD example_db.test_json_label_1 ON table1
       COLUMNS(category, price, author)
       PROPERTIES
       (
           "desired_concurrent_number"="3",
           "max_batch_interval" = "20",
           "max_batch_rows" = "300000",
           "max_batch_size" = "209715200",
           "strict_mode" = "false",
           "format" = "json",
           "jsonpaths" = 
"[\"$.store.book.category\",\"$.store.book.price\",\"$.store.book.author\"]"
       )
       FROM KAFKA
       (
           "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
           "kafka_topic" = "my_topic",
           "kafka_partitions" = "0,1,2",
           "kafka_offsets" = "0,0,0"
       );
       导入后表数据：
       category      price      author
       -------------------------------
       a1            10          c1
       a2            null        c2
       a3            30          c3
       【特别说明】
       1）如果jsonpath未匹配到数据，则默认是null
       
2）**doris内部使用rapidJson第三方库进行解析json，该库只支持双引号("")，表示字符串，因此在定义jsonpath时我们需要对其进行转义**
   
       2.3 jsonpaths_file
       当我们列比较多时可以将jsonpaths放到一个固定文件中且将文件上传到doris中
       CREATE ROUTINE LOAD example_db.test_json_label_1 ON table1
       COLUMNS(category, price, author)
       PROPERTIES
       (
           "desired_concurrent_number"="3",
           "max_batch_interval" = "20",
           "max_batch_rows" = "300000",
           "max_batch_size" = "209715200",
           "strict_mode" = "false",
           "format" = "json",
           "jsonpaths_file" = "DEFAULT:filename.json"
       )
       FROM KAFKA
       (
           "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
           "kafka_topic" = "my_topic",
           "kafka_partitions" = "0,1,2",
           "kafka_offsets" = "0,0,0"
       );
       导入后表数据，与2.2一致。
       【特别说明】 当jsonpaths 和 jsonpaths_file同时出现时， 将使用jsonpaths作为查询依据
   
   三、备注
       由于jsonpath语法比较复杂，目前doris只实现简单语法格式
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [incubator-doris] worker24h opened a new issue #3509: Doris流式导入json数据，相关接口定义

Reply via email to