compasses opened a new pull request, #15839:
URL: https://github.com/apache/doris/pull/15839

   # Proposed changes
   
   Issue Number: just one part of #11640
   
   ## Problem summary
   
   Describe your changes.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: 
       - [ ] Yes
       - [ ✓] No
       - [ ] I don't know
   2. Has unit tests been added:
       - [ ] Yes
       - [ ] No
       - [ ✓] No Need
   3. Has document been added or modified:
       - [ ] Yes
       - [ ✓] No
       - [ ] No Need
   4. Does it need to update dependencies:
       - [ ] Yes
       - [ ✓] No
   5. Are there any changes that cannot be rolled back:
       - [ ] Yes (If Yes, please explain WHY)
       - [ ✓] No
   
   ## Further comments
   This PR is one part of our bulk load implementation, which provide the tool 
to build the segment file of a tablet in an external way.
   It's support build local and HDFS, which means you need provide the meta 
file and the data file like this:
   
   ```
   ./segment_builder --meta_file=/path/to/hdr/88409.hdr 
--data_path=/path/to/data/file --format=parquet --is_remote=false
   
   ll /path/to/data/file
   xxx1..gz.parquet
   xxx2..gz.parquet
   ...
   ```
   If the file all from the HDFS, the path should be the HDFS path. Currently 
only support parquet.
   
   Since from internal we use the privately-owned HDFS lib, *** so this PR HDFS 
related code may not work ***.  I don't have such open source HDFS environment 
to test it.
   
   
   
![image](https://user-images.githubusercontent.com/10161171/211814188-f70afdd1-b973-48cc-be7e-db39dc5a036a.png)
   
   From above picture you can see the final work flow:
   1. Read the hdr file from the meta path, do some validation and system 
initialization.
   2. Build the HDFS scanner, and read the parquet file from HDFS directly, and 
generate the segment file on local disk.
   3. At last upload the segment file to HDFS, same path with the hdr file, and 
all these files will be used by the load segment statement.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to