kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1951390652
I took the above code and did some investigation. Here's the notebook to see it in action https://colab.research.google.com/drive/12O4ARckCwJqP2U6L4WxREZbyPv_AmWC2?usp=sharing I'll summarize. I used the above functions to generate 1 million records and save them in a feather file. The file on disk is 447M in size. Reading the feather file back to Python, the pyarrow.Table is `685.46 MB`` in size We can use the `to_batches` function to chop the table into `RecordBatch`s, using the default setting. This turns the table into 16 batches with around 40MB each. Given this, it seems plausible to chunk the table into batches and use the `write_batch` function. We can set the default chunk to size 516MB or some recommended setting. Note, [`to_batches`](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_batches) is a "zero-copy" method, so there shouldn't be performance impact -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org