andormarkus opened a new issue, #1751:
URL: https://github.com/apache/iceberg-python/issues/1751

   ### Feature Request / Improvement
   
   ## Problem Statement
   A key problem in distributed Iceberg systems is that commit processes can 
block each other when multiple workers try to update table metadata 
simultaneously. This blocking creates a severe performance bottleneck that 
limits throughput, particularly in high-volume ingestion scenarios.
   
   ## Use Case
   In our distributed architecture:
   1. Process A writes Parquet files in Iceberg-compatible format
   2. Simple string identifiers (file paths) need to be passed between systems
   3. Process B takes these strings and commits the files to make them visible 
in queries
   
   This pattern is especially useful for high-concurrency ingestion scenarios 
where multiple writers could be writing data to an Iceberg table 
simultaneously, but we want to centralize and coordinate the commit process. 
This approach is critical because in distributed environments, commit processes 
can block each other, creating a significant bottleneck in high-throughput 
scenarios.
   
   ### Detailed Workflow
   Our workflow involves:
   
   ```python
   # Process A: Write data but don't commit
   table = catalog.load_table(identifier="iceberg.table")
   data_files = list(pyiceberg.io.pyarrow._dataframe_to_data_files(
           table_metadata=table1.metadata, write_uuid=uuid.uuid4(), df=pa_df, 
io=table1.io
       )
   )
   
   queue.send(data_files)  # Send data_files strings to queue system
   
   # Process B: Commit processor (runs separately)
   data_files = queue.receive()
   with table.transaction() as trx:
       with trx.update_snapshot().fast_append() as update_snapshot:
           for data_file in data_files:
               update_snapshot.append_data_file(data_file)
   ```
   
   This separation of write and commit operations provides several advantages:
   - Improved throughput by parallelizing write operations across multiple 
workers
   - Reduced lock contention since metadata commits (which require locks) are 
centralized
   - Better failure handling - failed writes don't impact the table state
   - Controlled transaction timing - commits can be batched or scheduled 
optimally
   - Elimination of commit process blocking - by centralizing commits, we 
prevent distributed writers from blocking each other during metadata updates, 
which is a major performance bottleneck
   
   ## Current Limitations
   - Serializing `DataFile` objects between processes is challenging
   - We've attempted custom serialization with compression (gzip, zlib), which 
is working however required long complex code
   - Using `jsonpickle` also presented significant problems
   
   
   ## Proposed Solution
   We're seeking a robust way to handle distributed writes, potentially with:
   
   1. Add serialization/deserialization methods to the `DataFile` class
   2. Support Avro for efficient serialization of `DataFile` objects 
(potentially smaller than other approaches)
   3. Better integration with `append_data_file` API
   4. OR a more accessible way to use the ManifestFile functionality that's 
already implemented in PyIceberg
   
   Ideally, the solution would:
   - Handle schema evolution gracefully (unlike current `add_files` approach 
which has issues when schema changes)
   - Work efficiently with minimal overhead for large-scale concurrent 
processing
   - Provide simple primitives that can be used in distributed systems without 
requiring complex serialization
   - Follow patterns similar to those used in the Java implementation where 
appropriate
   
   ## Alternative Approaches Tried
   - We've implemented a custom serialization/deserialization function with 
compression
   - We explored the approach in #1678, but found it created too many commits 
and became a performance bottleneck
   
   ## Related PRs/Issues
   - PR #1742 (closed): Original attempt at write_parquet API
   - Issue #1737: Feature request for Table-Compatible Parquet Files 
   - Issue #1678: Related implementation suggestion
   
   We're looking for guidance on the best approach to solve this distributed 
writing pattern while maintaining performance and schema compatibility.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to