[GitHub] [iceberg] rbalamohan opened a new issue, #6364: Optimise POS reads

GitBox Mon, 05 Dec 2022 19:00:40 -0800


rbalamohan opened a new issue, #6364:
URL: https://github.com/apache/iceberg/issues/6364


   ### Apache Iceberg version
   
   0.14.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   Currently combinedFileTask can have more than 1 file. Depending on the 
nature of workload, it can even have 30-50+ files in single split. When there 
are 4+ POS files, it takes lot longer time to process "select" queries. This is 
due to the fact,
   that every file needs to process POS file and it leads to read 
amplification. 
   
   Request is to optimise the way POS file reading is done.
   - Optimise parquet reader with cached filestatus and footer
     - Optimise within combinedFileTask in a single task in a single executor. 
This can have more than 1 file in single split. Typically there can 10-50+ 
files depending on the size of the files.
     - For simplicity, let us start with 1 POS file. This POS file can have 
delete information about all the 50+ files in the combined task
       - Currently, for every file it opens, it needs "delete row positions". 
So it invokes "DeleteFilter::deletedRowPositions". This opens the POS file, 
reads the footer and reads the snippet for specific file path.
       - Above step happens for all the 50+ files in sequential order.
       - Internally, it opens and reads the footer information 50+ times which 
is not needed.
       - Need a lightweight parquet reader, which can accept readerConfs etc 
and take up footer information as argument. Basically cache footer details, 
file status details to reduce turn around with object stores.
         - Otherwise pass the POS reader during data reading, such that it 
doesn't need to reopen and read the footers again.
    - Optimise on reading POS
      - Though path is dictionary encoded, it ends up materializing the path 
again and again. Need a way to optimise this to reduce CPU burn when reading 
POS files.
      - Covered in https://github.com/apache/iceberg/issues/5863
           
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rbalamohan opened a new issue, #6364: Optimise POS reads

Reply via email to