Re: [I] Parallel Table.append [iceberg-python]

via GitHub Sun, 25 Feb 2024 00:26:11 -0800


bigluck commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1962855117


   Hey @kevinjqliu , we're currently debugging the issue on Slack, but I 
thought it would be helpful to report our findings here as well. In my tests, 
the pyarrow table is generated using the following code:
   
   ```
   return pa.Table.from_pylist([row for batch in results for row in batch])
   ```
   
   I've also cached the table on disk to save time, and it's read using the 
following code:
   
   ```
   with pa.memory_map('table_10000000.arrow', 'r') as source:
       pa_table = pa.ipc.open_file(source).read_all()
   ```
   
   Although I know that using a record batch would be the right way to read the 
file, I'm explicitly using `read_all()` because I can't control what table will 
be generated by users' code. This is to evaluate the edge case where the table 
has to be read without using streaming functions.
   
   After importing, the `bin_pack_arrow_table` returns only one chunk:
   
   ```
   total=6.69 GB, chunks=['6.69 GB']
   ```
   
   Let me know if you have any questions, and thanks for your time!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Parallel Table.append [iceberg-python]

Reply via email to