Re: [I] PyIceberg Production Use case survey [iceberg-python]

via GitHub Mon, 14 Oct 2024 17:55:22 -0700


mariotaddeucci commented on issue #1202:
URL: 
https://github.com/apache/iceberg-python/issues/1202#issuecomment-2412604690


   Hey, actually I'm using in production for small datasets in combination with 
duckdb specially to avoid small files with webscrapping.
   
   For ingestion, reading many raw files (json, csv, and parquet), all off then 
with a key using ulid (sortable id is necessary) in combination with overwrite 
specifying this key as overwrite filter.
   Duckdb generates a record_batach_reader, which allows to generate the table 
and schema without load all in memory, after creating the table is necessary to 
converte into a arrow table to write the final iceberg table.
   
   Because of the sortable id, it's possible to use the the filter predicate 
overwriting the data between upper and lower bound the data set to be ingested.
   
   The table maintenance still using spark for expiring snapshot.
   
   To avoid small files, after certain period using the duckdb native iceberg 
read, I reload the entire dataset and overwrite it fully (a workaround for 
rewrite files procedure)
   
   I would love to expand it for more scenarios but some features are necessary 
like
   
   - allow to write using record_batch_reader, so no need to load a full arrow 
table in memory.
   - clear snapshots from pyiceberg, that's turns the maintenance easier, no 
external engine or tool
   - maybe a simple optimization like binpack, is not the best but it's better 
than read all and overwrite it.
   - Maybe an integration with duckdb, just taking the last metada location and 
creating a view on it using their native iceberg reader
   - a truly merge operation, so avoiding errors when doing upserts, making not 
necessary to use the upper and lower bound of DF key as overwrite filter.
   
   These pipelines are leaving from spark server and running on isolated 
containers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] PyIceberg Production Use case survey [iceberg-python]

Reply via email to