ZENOTME commented on issue #1047: URL: https://github.com/apache/iceberg-rust/issues/1047#issuecomment-2722967864
> Thanks [@ZENOTME](https://github.com/ZENOTME) for raising this. I think what's missing is a `FileReader` which accepts following arguements: > > 1. File path > 2. File range > 3. Expected schema > 4. Arrow batch size > > This reader need to convert files(parquet, orc, avro) into arrow record batch, which handles things like missing column, type promotion, etc, which are caused by schema evolution. > > With this api, it would be easy to implement the `read_data`, `read_pos_delete`, `read_eq_delete` you mentioned. But I'm not sure if we acutally need to provided these apis. I think the `FileReader` + `FileScanTask` has provided enough flexibility for compute engines. For example, it can choose to join data file with pos deletions and eq deletions in logical plan, or they could choose to implement their own file scan operator. In this design, does `ArrowReader` reuse `FileReader`? - If so, I think we may need to refactor some logic of `ArrowReader` - Otherwise, `FileReader` is an independent component and it may be more convenient to maintain. And for delete file(pos delete, equality delete), do we need to handle things like missing column, type promotion? 🤔 Seems for pos delete and eq delete without value, we can't fulfill the value if they miss. So in here we may need the `read_data`, `read_pos_delete`, `read_eq_delete` to separate the handle way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org