kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1962460623
hm. Looks like something weird is going on if the resulting parquet file is 1.6 GB. Each parquet file size should be at most 512 MB, if not less. See the [bin packing logic](https://github.com/apache/iceberg-python/pull/444/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1767-R1771). Here's something we can run for diagnostics, ``` # https://stackoverflow.com/questions/12523586/python-format-size-application-converting-b-to-kb-mb-gb-tb def humanbytes(B): """Return the given bytes as a human friendly KB, MB, GB, or TB string.""" B = float(B) KB = float(1024) MB = float(KB ** 2) # 1,048,576 GB = float(KB ** 3) # 1,073,741,824 TB = float(KB ** 4) # 1,099,511,627,776 if B < KB: return '{0} {1}'.format(B,'Bytes' if 0 == B > 1 else 'Byte') elif KB <= B < MB: return '{0:.2f} KB'.format(B / KB) elif MB <= B < GB: return '{0:.2f} MB'.format(B / MB) elif GB <= B < TB: return '{0:.2f} GB'.format(B / GB) elif TB <= B: return '{0:.2f} TB'.format(B / TB) import pyarrow as pa from pyarrow import feather arrow_tbl = feather.read_table('fake_data.feather') print(f"Table has {len(arrow_tbl)} records") from pyiceberg.catalog import load_catalog catalog = load_catalog() try: iceberg_tbl = catalog.drop_table("fake.fake_data") except: pass iceberg_tbl = catalog.create_table("fake.fake_data", arrow_tbl.schema) from pyiceberg.io.pyarrow import bin_pack_arrow_table bins = bin_pack_arrow_table(arrow_tbl,iceberg_tbl.properties) for bin in bins: print(f"total={humanbytes(sum(map(lambda x:x.nbytes, bin)))}, chunks={[humanbytes(batch.nbytes) for batch in bin]}") ``` You might have to change the `arrow_tbl` and `iceberg_tbl`. This will show us how the arrow table is bin-packed during writes. For Write operations (append/overwrite), parallelism only kicks in during the actual writing. In order to take advantage of the parallelism, you'd have to set the `PYICEBERG_MAX_WORKERS` env variable to the number of CPUs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org