[GitHub] [doris] morningman opened a new pull request, #18074: [enhance](parquet-reader) cache file meta of parquet to speed up query

via GitHub Thu, 23 Mar 2023 09:22:38 -0700


morningman opened a new pull request, #18074:
URL: https://github.com/apache/doris/pull/18074


   # Proposed changes
   
   Issue Number: close #xxx
   
   ## Problem summary
   
   Problem:
   1. FE will split the parquet file into split. So a file can have several 
splits.
   2. BE will scan each split, read the footer of the parquet file.
   3. If 2 splits belongs to a same parquet file, the footer of this file will 
be read twice.
   
   This PR mainly changes:
   1. Use kv cache to cache the footer of parquet file.
   2. The kv cache is belong to a scan node, so all parquet reader belong to 
this scan node will share same kv cache.
   3. In cache, the key is "meta_file_path", the value is parsed thrift footer.
   
   In my test, a query with 26 splits can reduce the footer parse time from 3s 
-> 1s
   
   ## Checklist(Required)
   
   * [ ] Does it affect the original behavior
   * [ ] Has unit tests been added
   * [ ] Has document been added or modified
   * [ ] Does it need to update dependencies
   * [ ] Is this PR support rollback (If NO, please explain WHY)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at 
[d...@doris.apache.org](mailto:d...@doris.apache.org) by explaining why you 
chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [doris] morningman opened a new pull request, #18074: [enhance](parquet-reader) cache file meta of parquet to speed up query

Reply via email to