[GitHub] [incubator-doris] kangpinghuang opened a new issue #3295: spark data preparation process

GitBox Fri, 10 Apr 2020 05:22:45 -0700

kangpinghuang opened a new issue #3295: spark data preparation process
URL: https://github.com/apache/incubator-doris/issues/3295
 
 
   To solve #2855, we intent to do elt by using spark cluster
   
   The pr #3010 has resolve the spark job submission job.
   The issue #2940 has resolve the global dict build process in spark load.
   And this issue is used to track the spark dpp job, which will accomplish the 
following tasks:
   
   1. read and do the etl job from the data source
   there are many jobs should be done in the step, including:
   
   - schema check
   
   - type cast
   
   - data validation
   
   - null value/default value
   
   - strict mode support
   
   - udf function support
   
   2. repartition and bucket as the doris data model
   3. rollup build/aggregation/sort
   4. rewrite data to parquet(phase 1) or doris segment file(phase 2)
   5. write the dpp job statistics for FE to parse


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [incubator-doris] kangpinghuang opened a new issue #3295: spark data preparation process

Reply via email to