Fokko commented on issue #429: URL: https://github.com/apache/iceberg-python/issues/429#issuecomment-1948312519
Thanks @syun64 for raising this! > The partition 'path' undeniably has a very important significance in Iceberg. However, the data file paths/names are all stored in in the manifest files, and hence one could argue that the specific logic used to generate the file name is more a convention than a requirement. If the only requirement is that the data file name is unique, it could open up more options in leveraging external engines in parallelizing writes, and then using the written files to write the manifests, and upwards. This is true with Iceberg, contrary to the classical Hive partitions, the partition is just to help the user, but it is not a requirement as you already explained. The requirements is that we can trace where the data belongs to. For example, the partition information is not required, but makes it possible to link a parquet file to a partition without looking at the metadata. For the filename, it is more about seeing which write operation created the file. Looking at the current code, I think it would be good to re-use the commit-uuid as the write-uuid: https://github.com/apache/iceberg-python/pull/437 Does this help? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org