Re: [I] Read Parquet data file with projection [iceberg-rust]

via GitHub Tue, 26 Mar 2024 01:55:22 -0700


liurenjie1024 commented on issue #244:
URL: https://github.com/apache/iceberg-rust/issues/244#issuecomment-2019857203


   # Problem Statement
   
   When converting parquet file to arrow in iceberg, there are several problems 
to take into consideration:
   
   1. Field id mapping. Iceberg stores field in arrow schema, and then 
converted to parquet schema's field id. When doing projection, we should map id 
by field, rather by name.
   2. Type promotion. Iceberg support lazy schema evolution, e.g. int -> long. 
So when reading from parquet, we need to promote int array to long array.
   3. Default value. Due to iceberg's lazy schema evolution, when reading from 
parquet, one field maybe missing in parquet file, and in this case we can fill 
in null values, or some default values in schema ideally.
   
   # Example
   Let's use an example to illustrate these problems. Let's say current iceberg 
table schema is following:
   ```protobuf
   schema {
     struct person [id = 1] {
        struct address [id = 2] {
           string city [id= 3]
           string street [id = 4]
        }
        string name [id = 5]
     }
     struct howtown [id=6] {
        string city [id = 7]
        string state [id=8]
     }
     long age [id=9]
   }
   ```
   
   And parquet file with following schema: 
   
   ```protobuf
   schema {
     struct person [id = 1] {
        struct address [id = 2] {
           string city [id= 3]
           string street [id = 4]
        }
        string name [id = 5]
     }
     struct howtown [id=6] {
        string city [id = 7]
     }
     int age [id=9]
   }
   ```
   
   Now we want to do following projection: ("person.address", "person.name", 
"hometown.state", "age")
   
   The result schema is supposed to be following:
   ```
   schema {
      struct address [id = 2] {
           string city [id= 3]
           string street [id = 4]
        }
       string name [id = 5]
       string state [id=8]
       long age [id=9]
   }
   ```
   
   # Solution
   
   After #251 #252 , we have finished necessary building blocks for projection. 
Here is a proposed algorithm for this :
   1. Collect leave column ids after schema pruning, and translate it to 
`ProjectionMask` to do column pruning when reading parquet file.
   2. Implement sth like [ArrowProjectionVisitor in 
python](https://github.com/apache/iceberg-python/blob/afdfa351119090f09d38ef72857d6303e691f5ad/pyiceberg/io/pyarrow.py#L1135)
 to translate record batch to actual RecordBatch matching iceberg's schema.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Read Parquet data file with projection [iceberg-rust]

Reply via email to