liurenjie1024 commented on issue #244: URL: https://github.com/apache/iceberg-rust/issues/244#issuecomment-2019857203
# Problem Statement When converting parquet file to arrow in iceberg, there are several problems to take into consideration: 1. Field id mapping. Iceberg stores field in arrow schema, and then converted to parquet schema's field id. When doing projection, we should map id by field, rather by name. 2. Type promotion. Iceberg support lazy schema evolution, e.g. int -> long. So when reading from parquet, we need to promote int array to long array. 3. Default value. Due to iceberg's lazy schema evolution, when reading from parquet, one field maybe missing in parquet file, and in this case we can fill in null values, or some default values in schema ideally. # Example Let's use an example to illustrate these problems. Let's say current iceberg table schema is following: ```protobuf schema { struct person [id = 1] { struct address [id = 2] { string city [id= 3] string street [id = 4] } string name [id = 5] } struct howtown [id=6] { string city [id = 7] string state [id=8] } long age [id=9] } ``` And parquet file with following schema: ```protobuf schema { struct person [id = 1] { struct address [id = 2] { string city [id= 3] string street [id = 4] } string name [id = 5] } struct howtown [id=6] { string city [id = 7] } int age [id=9] } ``` Now we want to do following projection: ("person.address", "person.name", "hometown.state", "age") The result schema is supposed to be following: ``` schema { struct address [id = 2] { string city [id= 3] string street [id = 4] } string name [id = 5] string state [id=8] long age [id=9] } ``` # Solution After #251 #252 , we have finished necessary building blocks for projection. Here is a proposed algorithm for this : 1. Collect leave column ids after schema pruning, and translate it to `ProjectionMask` to do column pruning when reading parquet file. 2. Implement sth like [ArrowProjectionVisitor in python](https://github.com/apache/iceberg-python/blob/afdfa351119090f09d38ef72857d6303e691f5ad/pyiceberg/io/pyarrow.py#L1135) to translate record batch to actual RecordBatch matching iceberg's schema. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org