[I] When write.object-storage.enabled=true, it is difficult to gather information for individual partition of partitioned tables [iceberg]

via GitHub Thu, 07 Nov 2024 23:17:21 -0800


borderlayout opened a new issue, #11488:
URL: https://github.com/apache/iceberg/issues/11488


   ### Feature Request / Improvement
   
   Hi all：
       When using Amazon S3 object storage with Iceberg, there can be a 
throttling issue for the same path. By setting the parameter 
write.object-storage.enabled=true, files under the same file path are hashed to 
different paths, which avoids the throttling issue with Amazon S3 object 
storage. 
   
（see：https://iceberg.apache.org/docs/nightly/docs/configuration/?h=write.object+storage.enabled#write-properties）
   
   However, I encountered a problem: when setting up partitioned tables, the 
hash values in the path are inserted before the partition name, making it 
difficult to gather information for individual partition, such as the number of 
files or file sizes of one partition.
   
   Is there a reason for designing it this way? If putting the random value 
after the partition fields would be a better approach ?
   
   - one partition column((parCol):
   
   
bucket/iceberg_test1/data/_44Xmw/parCol=2024-01-10/00295-2798-63356e4e-b4ec-4a80-ae3f-6888f2f7eac9-0-00003.parquet
   
bucket/iceberg_test1/data/_5l5dQ/parCol=2024-01-09/00063-2566-63356e4e-b4ec-4a80-ae3f-6888f2f7eac9-0-00006.parquet
   
   ==changed ==> 
   
bucket/iceberg_test1/data/parCol=2024-01-10/_44Xmw/00295-2798-63356e4e-b4ec-4a80-ae3f-6888f2f7eac9-0-00003.parquet
   
bucket/iceberg_test1/data/parCol=2024-01-09/_5l5dQ/00063-2566-63356e4e-b4ec-4a80-ae3f-6888f2f7eac9-0-00006.parquet
   
   - two partition columns(parCol,gender)：
   
   
bucket/iceberg_test3/data/APigWw/parCol=2024-01-01/gender=male/00001-7234-7e44c302-a716-4da8-9ea0-0c44caf9a249-0-00003.parquet
   
bucket/iceberg_test3/data/4Z-_sw/parCol=2024-01-01/gender=male/00001-7234-7e44c302-a716-4da8-9ea0-0c44caf9a249-0-00001.parquet
   
   ===changed==> 
   
bucket/iceberg_test3/data/parCol=2024-01-01/gender=male/APigWw/00001-7234-7e44c302-a716-4da8-9ea0-0c44caf9a249-0-00003.parquet
   
bucket/iceberg_test3/data/parCol=2024-01-01/gender=male/4Z-_sw/00001-7234-7e44c302-a716-4da8-9ea0-0c44caf9a249-0-00001.parquet
   
   ### Query engine
   
   Spark
   
   ### Willingness to contribute
   
   - [ ] I can contribute this improvement/feature independently
   - [ ] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [ ] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] When write.object-storage.enabled=true, it is difficult to gather information for individual partition of partitioned tables [iceberg]

Reply via email to