mariotaddeucci commented on issue #1202: URL: https://github.com/apache/iceberg-python/issues/1202#issuecomment-2412604690
Hey, actually I'm using in production for small datasets in combination with duckdb specially to avoid small files with webscrapping. For ingestion, reading many raw files (json, csv, and parquet), all off then with a key using ulid (sortable id is necessary) in combination with overwrite specifying this key as overwrite filter. Duckdb generates a record_batach_reader, which allows to generate the table and schema without load all in memory, after creating the table is necessary to converte into a arrow table to write the final iceberg table. Because of the sortable id, it's possible to use the the filter predicate overwriting the data between upper and lower bound the data set to be ingested. The table maintenance still using spark for expiring snapshot. To avoid small files, after certain period using the duckdb native iceberg read, I reload the entire dataset and overwrite it fully (a workaround for rewrite files procedure) I would love to expand it for more scenarios but some features are necessary like - allow to write using record_batch_reader, so no need to load a full arrow table in memory. - clear snapshots from pyiceberg, that's turns the maintenance easier, no external engine or tool - maybe a simple optimization like binpack, is not the best but it's better than read all and overwrite it. - Maybe an integration with duckdb, just taking the last metada location and creating a view on it using their native iceberg reader - a truly merge operation, so avoiding errors when doing upserts, making not necessary to use the upper and lower bound of DF key as overwrite filter. These pipelines are leaving from spark server and running on isolated containers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org