andormarkus opened a new issue, #1751: URL: https://github.com/apache/iceberg-python/issues/1751
### Feature Request / Improvement ## Problem Statement A key problem in distributed Iceberg systems is that commit processes can block each other when multiple workers try to update table metadata simultaneously. This blocking creates a severe performance bottleneck that limits throughput, particularly in high-volume ingestion scenarios. ## Use Case In our distributed architecture: 1. Process A writes Parquet files in Iceberg-compatible format 2. Simple string identifiers (file paths) need to be passed between systems 3. Process B takes these strings and commits the files to make them visible in queries This pattern is especially useful for high-concurrency ingestion scenarios where multiple writers could be writing data to an Iceberg table simultaneously, but we want to centralize and coordinate the commit process. This approach is critical because in distributed environments, commit processes can block each other, creating a significant bottleneck in high-throughput scenarios. ### Detailed Workflow Our workflow involves: ```python # Process A: Write data but don't commit table = catalog.load_table(identifier="iceberg.table") data_files = list(pyiceberg.io.pyarrow._dataframe_to_data_files( table_metadata=table1.metadata, write_uuid=uuid.uuid4(), df=pa_df, io=table1.io ) ) queue.send(data_files) # Send data_files strings to queue system # Process B: Commit processor (runs separately) data_files = queue.receive() with table.transaction() as trx: with trx.update_snapshot().fast_append() as update_snapshot: for data_file in data_files: update_snapshot.append_data_file(data_file) ``` This separation of write and commit operations provides several advantages: - Improved throughput by parallelizing write operations across multiple workers - Reduced lock contention since metadata commits (which require locks) are centralized - Better failure handling - failed writes don't impact the table state - Controlled transaction timing - commits can be batched or scheduled optimally - Elimination of commit process blocking - by centralizing commits, we prevent distributed writers from blocking each other during metadata updates, which is a major performance bottleneck ## Current Limitations - Serializing `DataFile` objects between processes is challenging - We've attempted custom serialization with compression (gzip, zlib), which is working however required long complex code - Using `jsonpickle` also presented significant problems ## Proposed Solution We're seeking a robust way to handle distributed writes, potentially with: 1. Add serialization/deserialization methods to the `DataFile` class 2. Support Avro for efficient serialization of `DataFile` objects (potentially smaller than other approaches) 3. Better integration with `append_data_file` API 4. OR a more accessible way to use the ManifestFile functionality that's already implemented in PyIceberg Ideally, the solution would: - Handle schema evolution gracefully (unlike current `add_files` approach which has issues when schema changes) - Work efficiently with minimal overhead for large-scale concurrent processing - Provide simple primitives that can be used in distributed systems without requiring complex serialization - Follow patterns similar to those used in the Java implementation where appropriate ## Alternative Approaches Tried - We've implemented a custom serialization/deserialization function with compression - We explored the approach in #1678, but found it created too many commits and became a performance bottleneck ## Related PRs/Issues - PR #1742 (closed): Original attempt at write_parquet API - Issue #1737: Feature request for Table-Compatible Parquet Files - Issue #1678: Related implementation suggestion We're looking for guidance on the best approach to solve this distributed writing pattern while maintaining performance and schema compatibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org