corleyma commented on issue #402: URL: https://github.com/apache/iceberg-python/issues/402#issuecomment-2000597425
Note from Slack: to work in the larger-file usecases where folks are using PySpark/Spark, I think this would need to play well with pyarrow streaming read/write functionality, so that one could do atomic upsert of batches without having to read all the data into memory at once. I call this out because current write functionality works with pyarrow Tables, which are fully materialized in memory. Working with larger data might include making the pyiceberg write APIs work with `Iterator[RecordBatch]` and friends (as returned by pyarrow Datasets/Scanner) in addition to pyarrow Tables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org