[I] Minimum Requirement of Data File Name in Apache Iceberg? [iceberg-python]

via GitHub Wed, 14 Feb 2024 06:31:57 -0800


syun64 opened a new issue, #429:
URL: https://github.com/apache/iceberg-python/issues/429

### Question

Hi folks, I was chatting with @jaychia the other day and we were both
wondering about the importance of the file naming convention in Apache Iceberg.
Currently, PyIceberg and Java code both seem to have slightly different logic
in generating the unique data file name.

**(UriScheme)://(TableDataLocationPrefix)/(PartitionPath)/(FileName).(Extension)**

e.g.

**s3a://warehouse-dev/table/data/DATE=2024-01-20/00008-6616-9bc4befb-04af-4fb3-b3ac-1b32de91349a-00001.parquet**

1. PyIceberg:
https://github.com/apache/iceberg-python/blob/3a158957b692c7a962bee46905ea1b64c5bffd5e/pyiceberg/table/__init__.py#L2321

2. Java Iceberg:
https://github.com/apache/iceberg/blob/a582968975dd30ff4917fbbe999f1be903efac02/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L92-L101

The partition 'path' undeniably has a very important significance in
Iceberg. However, the data file paths/names are all stored in in the manifest
files, and hence one could argue that the specific logic used to generate the
file name is more a convention than a requirement. If the only requirement is
that the data file name is unique, it could open up more options in leveraging
external engines in parallelizing writes, and then using the written files to
write the manifests, and upwards.

So, what is the requirement for the data **file name** in Apache Iceberg? Is
it simply that they are unique?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Minimum Requirement of Data File Name in Apache Iceberg? [iceberg-python]

Reply via email to