Re: [I] Parallel Table.append [iceberg-python]

via GitHub Sun, 18 Feb 2024 09:19:16 -0800


kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1951390652

I took the above code and did some investigation.
Here's the notebook to see it in action
https://colab.research.google.com/drive/12O4ARckCwJqP2U6L4WxREZbyPv_AmWC2?usp=sharing

I'll summarize.

I used the above functions to generate 1 million records and save them in a
feather file.
The file on disk is 447M in size.
Reading the feather file back to Python, the pyarrow.Table is `685.46 MB``
in size

We can use the `to_batches` function to chop the table into `RecordBatch`s,
using the default setting.
This turns the table into 16 batches with around 40MB each.

Given this, it seems plausible to chunk the table into batches and use the
`write_batch` function. We can set the default chunk to size 516MB or some
recommended setting.

Note,
[`to_batches`](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_batches)
is a "zero-copy" method, so there shouldn't be performance impact

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Parallel Table.append [iceberg-python]

Reply via email to