syun64 opened a new issue, #429: URL: https://github.com/apache/iceberg-python/issues/429
### Question Hi folks, I was chatting with @jaychia the other day and we were both wondering about the importance of the file naming convention in Apache Iceberg. Currently, PyIceberg and Java code both seem to have slightly different logic in generating the unique data file name. **(UriScheme)://(TableDataLocationPrefix)/(PartitionPath)/(FileName).(Extension)** e.g. **s3a://warehouse-dev/table/data/DATE=2024-01-20/00008-6616-9bc4befb-04af-4fb3-b3ac-1b32de91349a-00001.parquet** 1. PyIceberg: https://github.com/apache/iceberg-python/blob/3a158957b692c7a962bee46905ea1b64c5bffd5e/pyiceberg/table/__init__.py#L2321 2. Java Iceberg: https://github.com/apache/iceberg/blob/a582968975dd30ff4917fbbe999f1be903efac02/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L92-L101 The partition 'path' undeniably has a very important significance in Iceberg. However, the data file paths/names are all stored in in the manifest files, and hence one could argue that the specific logic used to generate the file name is more a convention than a requirement. If the only requirement is that the data file name is unique, it could open up more options in leveraging external engines in parallelizing writes, and then using the written files to write the manifests, and upwards. So, what is the requirement for the data **file name** in Apache Iceberg? Is it simply that they are unique? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org