ldacey commented on issue #2159:
URL: 
https://github.com/apache/iceberg-python/issues/2159#issuecomment-3356870773

   Posted this on Slack as well but during my initial test with one month of 
data.
   
   I read Parquet source files in batches of 128 MB (based on file size) and 
used .upsert() to write to iceberg. I tried with the table identifier as a 
single column (xxhash 128) and multiple columns (the actual business columns).
   
   initial upsert (on empty table) 962,000 rows in 75 seconds
   batch 2 upsert 967,000 rows in 256 seconds
   batch 3 upsert 728,000 rows in 328 seconds
   
   Then I tried to process 1 day of data and I ran into an OOM error (with 
delta-rs this same test took 3 seconds to completely successfully).
   
   Each batch took longer than the previous batch and everything took way too 
long. I am going to test if I can use partial overwrites to build a more 
specific predicate/expression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to