Re: [PR] WIP: Interface based DataFile reader and writer API [iceberg]

via GitHub Wed, 19 Mar 2025 08:03:00 -0700


pvary commented on PR #12298:
URL: https://github.com/apache/iceberg/pull/12298#issuecomment-2736976799


   Here are a few important open questions:
   1. We should decide on the expected filtering behavior. Currently the 
filters are applied as best effort for the file format readers. We might decide 
on more strict behavior, and enforce the file formats to apply all filters when 
provided. I would suggest to do it in another PR even if we chose to change 
current state.
   2. Batch sizes are currently parameters which could be set for 
non-vectorized readers too. We could put the batch size as a reader property, 
and tell the readers to parse the reader properties when batch read happens. I 
would prefer the current solution as the expectation for the readers is self 
documented.
   3. Parquet/Orc configuration. Currently the Spark batch reader uses 
different configuration objects for Parquet and ORC as requested by 
@aokolnychyi. @rdblue suggested to use a common configuration instead. I'm 
still learning the Spark code, so I don't have a strong opinion here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] WIP: Interface based DataFile reader and writer API [iceberg]

Reply via email to