Re: [I] Minimum Requirement of Data File Name in Apache Iceberg? [iceberg-python]

via GitHub Fri, 16 Feb 2024 04:35:54 -0800


Fokko commented on issue #429:
URL: https://github.com/apache/iceberg-python/issues/429#issuecomment-1948312519


   Thanks @syun64 for raising this!
   
   > The partition 'path' undeniably has a very important significance in 
Iceberg. However, the data file paths/names are all stored in in the manifest 
files, and hence one could argue that the specific logic used to generate the 
file name is more a convention than a requirement. If the only requirement is 
that the data file name is unique, it could open up more options in leveraging 
external engines in parallelizing writes, and then using the written files to 
write the manifests, and upwards.
   
   This is true with Iceberg, contrary to the classical Hive partitions, the 
partition is just to help the user, but it is not a requirement as you already 
explained.
   
   The requirements is that we can trace where the data belongs to. For 
example, the partition information is not required, but makes it possible to 
link a parquet file to a partition without looking at the metadata. For the 
filename, it is more about seeing which write operation created the file. 
   
   Looking at the current code, I think it would be good to re-use the 
commit-uuid as the write-uuid: https://github.com/apache/iceberg-python/pull/437
   
   Does this help?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Minimum Requirement of Data File Name in Apache Iceberg? [iceberg-python]

Reply via email to