[PR] Add `write_parquet` API for writing Parquet files without committing [iceberg-python]

via GitHub Fri, 28 Feb 2025 00:33:32 -0800


andormarkus opened a new pull request, #1742:
URL: https://github.com/apache/iceberg-python/pull/1742


   This PR adds a new API method `write_parquet()` to the `Table` class, which 
allows writing a PyArrow table to Parquet files in Iceberg-compatible format 
without committing them to the table metadata. This provides a way to decouple 
the write and commit process, which is particularly useful in high-concurrency 
scenarios.
   
   ## Key features
    - `write_parquet(df)` writes Parquet files compatible with Iceberg table 
format
    - Returns a list of file paths to the written files
    - Files can later be committed using `add_files()` API
    - Helps manage concurrency by separating write operations from metadata 
commits
   
   ## Use case
   This is especially useful for high-concurrency ingestion scenarios where 
multiple writers could be writing data to an Iceberg table simultaneously. By 
separating the write and commit phases, applications can implement a queue 
system where the commit process (which requires a lock) is handled separately 
from the data writing phase:
   
   ```python
   # Write data but don't commit
   file_paths = table.write_parquet(df)
   
   # Later, commit the files to make them visible in queries
   table.add_files(file_paths=file_paths)
   ```
   
   
   ## Documentation
   Added comprehensive documentation to the API docs, including explanations 
and examples of how to use the new method alongside the existing add_files API.
   
   
   ## Seeking guidance
   I would appreciate guidance from project maintainers on:
    1. Which test cases would be most appropriate for this new API
    2. Is there a preferred location or approach for testing this functionality?
    3. Should we add tests that specifically verify the interaction between 
write_parquet() and add_files()?
    4. Are there any performance considerations or edge cases that should be 
covered in testing?
    5. Any further documentation or API changes before this is ready for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Add `write_parquet` API for writing Parquet files without committing [iceberg-python]

Reply via email to