rbalamohan opened a new issue, #6364: URL: https://github.com/apache/iceberg/issues/6364
### Apache Iceberg version 0.14.1 ### Query engine Spark ### Please describe the bug 🐞 Currently combinedFileTask can have more than 1 file. Depending on the nature of workload, it can even have 30-50+ files in single split. When there are 4+ POS files, it takes lot longer time to process "select" queries. This is due to the fact, that every file needs to process POS file and it leads to read amplification. Request is to optimise the way POS file reading is done. - Optimise parquet reader with cached filestatus and footer - Optimise within combinedFileTask in a single task in a single executor. This can have more than 1 file in single split. Typically there can 10-50+ files depending on the size of the files. - For simplicity, let us start with 1 POS file. This POS file can have delete information about all the 50+ files in the combined task - Currently, for every file it opens, it needs "delete row positions". So it invokes "DeleteFilter::deletedRowPositions". This opens the POS file, reads the footer and reads the snippet for specific file path. - Above step happens for all the 50+ files in sequential order. - Internally, it opens and reads the footer information 50+ times which is not needed. - Need a lightweight parquet reader, which can accept readerConfs etc and take up footer information as argument. Basically cache footer details, file status details to reduce turn around with object stores. - Otherwise pass the POS reader during data reading, such that it doesn't need to reopen and read the footers again. - Optimise on reading POS - Though path is dictionary encoded, it ends up materializing the path again and again. Need a way to optimise this to reduce CPU burn when reading POS files. - Covered in https://github.com/apache/iceberg/issues/5863 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org