paultmathew commented on issue #2152:
URL: 
https://github.com/apache/iceberg-python/issues/2152#issuecomment-4393959745

   PR up at #3335. Plan to land streaming in three reviewable PRs to keep diffs 
scoped:
   
   1. **PR1 — #3335 (this thread).** `Transaction.append/overwrite` accept 
`pa.RecordBatchReader`. Unpartitioned only. Microbatched by 
`write.target-file-size-bytes` via the new `bin_pack_record_batches` helper, 
files committed in one snapshot via `fast_append`. Memory bound: `N_workers × 
target_file_size`. Two semantics caveats called out in docstrings: (a) 
`target_file_size` is currently uncompressed in-memory Arrow bytes (matches 
existing `bin_pack_arrow_table`), and (b) `RecordBatchReader` is single-pass so 
retry is the caller's responsibility.
   
   2. **PR2 — follow-up.** Switch the streaming internals to a rolling 
`pq.ParquetWriter` + `OutputStream.tell()` (now possible thanks to #2998). 
Drops peak memory from `N_workers × target_file_size` to roughly one batch per 
worker, and makes `write.target-file-size-bytes` reflect actual on-disk 
compressed bytes (matches Java/Spark/Flink). No public API change.
   
   3. **PR3 — partitioned streaming.** Genuinely the harder case. Open design 
questions I'd love input on **before** I start coding:
      - **Partition cardinality**: a streaming reader with high-cardinality 
partition columns implies many concurrent writers (one per partition value). 
Bound it via `max_open_files`-style spill, or sort-then-stream, or pushdown to 
caller? iceberg-go #369 punted on this and partitioned streaming hasn't 
followed there yet either.
      - **Crash/retry idempotency**: rolling writes per partition value commit 
metadata only at end-of-stream; partial crashes leave orphans. Today's pa.Table 
path is naturally transactional because all files are written before commit. 
The streaming path inverts that — worth being explicit about whether we accept 
orphan-data risk in exchange for streaming, or build cleanup into the write 
path.
   
   This staging mirrors iceberg-go #369's — they shipped unpartitioned first 
and partitioned hasn't followed yet for the same design reasons. Happy to 
reorder if maintainers prefer otherwise.
   
   Also pulled out a small companion test-state-isolation fix in #3334 (noticed 
while running the integration suite repeatedly during this work) — independent 
of this PR but worth landing first to clean up `make test-integration-exec` for 
everyone.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to