syun64 opened a new issue, #429:
URL: https://github.com/apache/iceberg-python/issues/429

   ### Question
   
   Hi folks, I was chatting with @jaychia the other day and we were both 
wondering about the importance of the file naming convention in Apache Iceberg. 
Currently, PyIceberg and Java code both seem to have slightly different logic 
in generating the unique data file name.
   
   
**(UriScheme)://(TableDataLocationPrefix)/(PartitionPath)/(FileName).(Extension)**
   
   e.g.
   
   
**s3a://warehouse-dev/table/data/DATE=2024-01-20/00008-6616-9bc4befb-04af-4fb3-b3ac-1b32de91349a-00001.parquet**
   
   1. PyIceberg: 
https://github.com/apache/iceberg-python/blob/3a158957b692c7a962bee46905ea1b64c5bffd5e/pyiceberg/table/__init__.py#L2321
 
   2. Java Iceberg: 
https://github.com/apache/iceberg/blob/a582968975dd30ff4917fbbe999f1be903efac02/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L92-L101
   
   The partition 'path' undeniably has a very important significance in 
Iceberg. However, the data file paths/names are all stored in in the manifest 
files, and hence one could argue that the specific logic used to generate the 
file name is more a convention than a requirement. If the only requirement is 
that the data file name is unique, it could open up more options in leveraging 
external engines in parallelizing writes, and then using the written files to 
write the manifests, and upwards.
   
   So, what is the requirement for the data **file name** in Apache Iceberg? Is 
it simply that they are unique?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to