[GitHub] [incubator-doris] xy720 opened a new issue #4101: [Proposal]Create a jar package's repository for Spark Load

GitBox Wed, 15 Jul 2020 01:24:20 -0700


xy720 opened a new issue #4101:
URL: https://github.com/apache/incubator-doris/issues/4101



   **Motivation**
   Recently, we have introduced the Spark Load, which currently needs to upload 
many jar packages to the Yarn cluster before load. These jar packages include 
`$DORIS_HOME/lib/palo-fe.jar`(the Dpp runtime dependency) and all jars in the 
`$SPARK_HOME/jars` folder(the Spark dependencies), which usually takes 2~3 
minutes to upload.
   
   Currently, these jars are uploaded to the temporary directories in HDFS. The 
`palo-fe.jar` is uploaded to  `{working_dir}/jobs/DB_ID/LABEL/JOB_ID/configs`. 
Other jars are packaged as zip file and uploaded to 
`{stage_dir}/APPLICATION_ID/__spark_lib__.zip`. 
   
   In most cases, the jar packages uploaded by two different load are 
completely same, which means we don't have to upload these jar packages every 
time. Secondly, the jar packages should be stored in one directory so that we 
can manage them  easily. Moreover, we can put all jars in a zip file in the 
compile phase and upload it to a specified remote repository before load.
   
   Therefore, as a proposal, I suggest to create a repository for all 
dependencies of Spark Load in HDFS.
   
   **The repository structure**
   
   ```
   Repository/
   |-lib_{version}.zip
   |     {All spark dependencies}
   |     |-roaringbitmap.jar
   |     |-activation-1.1.1.jar
   |     |-aircompressor-0.10.jar
   |     |-...
   |     {All dpp dependencies}
   |     |-spark-dpp.jar
   |-lib_{version}.zip
   |-lib_{version}.zip
   |-...
   ```
   
   The Repository/ directory is the parent dir of all zip files. When we submit 
a spark load, fe will compare the version between remote zip file and local zip 
file, and only upload when we can not find the right versionn.
   
   Note that, the `spark-dpp.jar `is built by spark-dpp sub-modules. The 
difference between `palo-fe.jar` and `spark-dpp.jar` is that `spark-dpp.jar` 
contain other third-party libraries that `palo-fe.jar` depends on. You can see 
the details about multi-modules of fe in this issue #4098 .
   
   Meanwhile, we can set `AppResourceHdfsPath` argument of spark-submit to 
lib.zip file. Spark will analyze it and find the entrance of MainClass.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [incubator-doris] xy720 opened a new issue #4101: [Proposal]Create a jar package's repository for Spark Load

Reply via email to