ldacey commented on issue #2159: URL: https://github.com/apache/iceberg-python/issues/2159#issuecomment-3356870773
Posted this on Slack as well but during my initial test with one month of data. I read Parquet source files in batches of 128 MB (based on file size) and used .upsert() to write to iceberg. I tried with the table identifier as a single column (xxhash 128) and multiple columns (the actual business columns). initial upsert (on empty table) 962,000 rows in 75 seconds batch 2 upsert 967,000 rows in 256 seconds batch 3 upsert 728,000 rows in 328 seconds Then I tried to process 1 day of data and I ran into an OOM error (with delta-rs this same test took 3 seconds to completely successfully). Each batch took longer than the previous batch and everything took way too long. I am going to test if I can use partial overwrites to build a more specific predicate/expression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
