kangpinghuang opened a new issue #3295: spark data preparation process URL: https://github.com/apache/incubator-doris/issues/3295 To solve #2855, we intent to do elt by using spark cluster The pr #3010 has resolve the spark job submission job. The issue #2940 has resolve the global dict build process in spark load. And this issue is used to track the spark dpp job, which will accomplish the following tasks: 1. read and do the etl job from the data source there are many jobs should be done in the step, including: - schema check - type cast - data validation - null value/default value - strict mode support - udf function support 2. repartition and bucket as the doris data model 3. rollup build/aggregation/sort 4. rewrite data to parquet(phase 1) or doris segment file(phase 2) 5. write the dpp job statistics for FE to parse
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org