[GitHub] [incubator-doris] wyb commented on issue #3010: Spark load etl interface

GitBox Sun, 10 May 2020 18:12:20 -0700


wyb commented on issue #3010:
URL: 
https://github.com/apache/incubator-doris/issues/3010#issuecomment-626421411



   1. User interface:
   
   1.1 Spark resource management
   
   Spark is used as an external computing resource in Doris to do ETL work. In 
the future, there may be other external resources that will be used in Doris, 
for example, MapReduce is used for ETL, Spark/GPU is used for queries, HDFS/S3  
is used for external storage. We introduced resource management to manage these 
external resources used by Doris.
   
   ```sql
   -- create spark resource
   CREATE EXTERNAL RESOURCE resource_name
   [FOR user_name]
   PROPERTIES 
   (                 
     type = spark,
     spark_conf_key = spark_conf_value,
     working_dir = path,
     broker = broker_name
   )
   
   -- drop spark resource
   DROP EXTERNAL RESOURCE resource_name
   
   -- show resources
   SHOW EXTERNAL RESOURCES
   SHOW PROC "/external_resources"
   ```
   
   
   
   - CREATE EXTERNAL RESOURCE:
   
   FOR user_name is optional. If there has, the external resource belongs to 
this user. If not, the external resource belongs to the system and all users 
are available.
   
   PROPERTIES：
   1. type: resource type. Only support spark now.
   2. spark configuration: follow the standard writing of Spark configurations, 
refer to: https://spark.apache.org/docs/latest/configuration.html.
   3. working_dir: optional, used to store ETL intermediate results in spark 
ETL.
   4. broker: optional, used in spark ETL. The ETL intermediate results need to 
be read with the broker when pushed into BE.
   
   Example: 
   
   ```sql
   CREATE EXTERNAL RESOURCE "spark0"
   FOR "user0"
   PROPERTIES 
   (                                                                            
 
     "type" = "spark",                   
     "spark.master" = "yarn",
     "spark.submit.deployMode" = "cluster",
     "spark.jars" = "xxx.jar,yyy.jar",
     "spark.files" = "/tmp/aaa,/tmp/bbb",
     "spark.executor.memory" = "1g",
     "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
     "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
     "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",
     "broker" = "broker0"
   )
   ```
   
   
   
   - SHOW EXTERNAL RESOURCES:
   General users can only see their own resources.
   Admin and root users can show all resources.
   
   
   
   
   1.2 Create spark load job
   
   ```sql
   LOAD LABEL db_name.label_name 
   (
     DATA INFILE ("/tmp/file1") INTO TABLE table_name, ...
   )
   WITH RESOURCE resource_name
   [(key1 = value1, ...)]
   [PROPERTIES (key2 = value2, ... )]
   ```
   
   Example:
   
   ```sql
   LOAD LABEL example_db.test_label 
   (
     DATA INFILE ("hdfs:/127.0.0.1:10000/tmp/file1") INTO TABLE example_table
   )
   WITH RESOURCE "spark0"
   (
     "spark.executor.memory" = "1g",
     "spark.files" = "/tmp/aaa,/tmp/bbb"
   )
   PROPERTIES ("timeout" = "3600")
   ```
   
   The spark configuartions in load stmt can override the existing 
configuration in the resource for temporary use.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [incubator-doris] wyb commented on issue #3010: Spark load etl interface

Reply via email to