xy720 opened a new issue #4101: URL: https://github.com/apache/incubator-doris/issues/4101
**Motivation** Recently, we have introduced the Spark Load, which currently needs to upload many jar packages to the Yarn cluster before load. These jar packages include `$DORIS_HOME/lib/palo-fe.jar`(the Dpp runtime dependency) and all jars in the `$SPARK_HOME/jars` folder(the Spark dependencies), which usually takes 2~3 minutes to upload. Currently, these jars are uploaded to the temporary directories in HDFS. The `palo-fe.jar` is uploaded to `{working_dir}/jobs/DB_ID/LABEL/JOB_ID/configs`. Other jars are packaged as zip file and uploaded to `{stage_dir}/APPLICATION_ID/__spark_lib__.zip`. In most cases, the jar packages uploaded by two different load are completely same, which means we don't have to upload these jar packages every time. Secondly, the jar packages should be stored in one directory so that we can manage them easily. Moreover, we can put all jars in a zip file in the compile phase and upload it to a specified remote repository before load. Therefore, as a proposal, I suggest to create a repository for all dependencies of Spark Load in HDFS. **The repository structure** ``` Repository/ |-lib_{version}.zip | {All spark dependencies} | |-roaringbitmap.jar | |-activation-1.1.1.jar | |-aircompressor-0.10.jar | |-... | {All dpp dependencies} | |-spark-dpp.jar |-lib_{version}.zip |-lib_{version}.zip |-... ``` The Repository/ directory is the parent dir of all zip files. When we submit a spark load, fe will compare the version between remote zip file and local zip file, and only upload when we can not find the right versionn. Note that, the `spark-dpp.jar `is built by spark-dpp sub-modules. The difference between `palo-fe.jar` and `spark-dpp.jar` is that `spark-dpp.jar` contain other third-party libraries that `palo-fe.jar` depends on. You can see the details about multi-modules of fe in this issue #4098 . Meanwhile, we can set `AppResourceHdfsPath` argument of spark-submit to lib.zip file. Spark will analyze it and find the entrance of MainClass. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org