pvary commented on PR #12298: URL: https://github.com/apache/iceberg/pull/12298#issuecomment-2736976799
Here are a few important open questions: 1. We should decide on the expected filtering behavior. Currently the filters are applied as best effort for the file format readers. We might decide on more strict behavior, and enforce the file formats to apply all filters when provided. I would suggest to do it in another PR even if we chose to change current state. 2. Batch sizes are currently parameters which could be set for non-vectorized readers too. We could put the batch size as a reader property, and tell the readers to parse the reader properties when batch read happens. I would prefer the current solution as the expectation for the readers is self documented. 3. Parquet/Orc configuration. Currently the Spark batch reader uses different configuration objects for Parquet and ORC as requested by @aokolnychyi. @rdblue suggested to use a common configuration instead. I'm still learning the Spark code, so I don't have a strong opinion here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org