Re: [I] Parallel Table.append [iceberg-python]

via GitHub Sat, 24 Feb 2024 10:20:05 -0800


kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1962460623


   hm. Looks like something weird is going on if the resulting parquet file is 
1.6 GB. Each parquet file size should be at most 512 MB, if not less. See the 
[bin packing 
logic](https://github.com/apache/iceberg-python/pull/444/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1767-R1771).
 
   
   Here's something we can run for diagnostics,
   ```
   # 
https://stackoverflow.com/questions/12523586/python-format-size-application-converting-b-to-kb-mb-gb-tb
   def humanbytes(B):
       """Return the given bytes as a human friendly KB, MB, GB, or TB 
string."""
       B = float(B)
       KB = float(1024)
       MB = float(KB ** 2) # 1,048,576
       GB = float(KB ** 3) # 1,073,741,824
       TB = float(KB ** 4) # 1,099,511,627,776
   
       if B < KB:
           return '{0} {1}'.format(B,'Bytes' if 0 == B > 1 else 'Byte')
       elif KB <= B < MB:
           return '{0:.2f} KB'.format(B / KB)
       elif MB <= B < GB:
           return '{0:.2f} MB'.format(B / MB)
       elif GB <= B < TB:
           return '{0:.2f} GB'.format(B / GB)
       elif TB <= B:
           return '{0:.2f} TB'.format(B / TB)
   
   import pyarrow as pa
   from pyarrow import feather
   
   arrow_tbl = feather.read_table('fake_data.feather')
   print(f"Table has {len(arrow_tbl)} records")
   
   from pyiceberg.catalog import load_catalog
   catalog = load_catalog()
   try:
       iceberg_tbl = catalog.drop_table("fake.fake_data")
   except:
       pass
   iceberg_tbl = catalog.create_table("fake.fake_data", arrow_tbl.schema)
   
   from pyiceberg.io.pyarrow import bin_pack_arrow_table
   bins = bin_pack_arrow_table(arrow_tbl,iceberg_tbl.properties)
   for bin in bins:
       print(f"total={humanbytes(sum(map(lambda x:x.nbytes, bin)))}, 
chunks={[humanbytes(batch.nbytes) for batch in bin]}")
   ```
   
   You might have to change the `arrow_tbl` and `iceberg_tbl`. 
   This will show us how the arrow table is bin-packed during writes. 
   
   
   For Write operations (append/overwrite), parallelism only kicks in during 
the actual writing. In order to take advantage of the parallelism, you'd have 
to set the `PYICEBERG_MAX_WORKERS` env variable to the number of CPUs. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Parallel Table.append [iceberg-python]

Reply via email to