bigluck commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1962855117
Hey @kevinjqliu , we're currently debugging the issue on Slack, but I thought it would be helpful to report our findings here as well. In my tests, the pyarrow table is generated using the following code: ``` return pa.Table.from_pylist([row for batch in results for row in batch]) ``` I've also cached the table on disk to save time, and it's read using the following code: ``` with pa.memory_map('table_10000000.arrow', 'r') as source: pa_table = pa.ipc.open_file(source).read_all() ``` Although I know that using a record batch would be the right way to read the file, I'm explicitly using `read_all()` because I can't control what table will be generated by users' code. This is to evaluate the edge case where the table has to be read without using streaming functions. After importing, the `bin_pack_arrow_table` returns only one chunk: ``` total=6.69 GB, chunks=['6.69 GB'] ``` Let me know if you have any questions, and thanks for your time! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org