smaheshwar-pltr commented on code in PR #1537: URL: https://github.com/apache/iceberg-python/pull/1537#discussion_r1921103030
########## mkdocs/docs/configuration.md: ########## @@ -195,6 +198,85 @@ PyIceberg uses [S3FileSystem](https://arrow.apache.org/docs/python/generated/pya <!-- markdown-link-check-enable--> +## Location Providers + +Iceberg works with the concept of a LocationProvider that determines the file paths for a table's data. PyIceberg +introduces a pluggable LocationProvider module; the LocationProvider used may be specified on a per-table basis via +table properties. PyIceberg defaults to the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider), +which generates file paths that are optimised for object storage. + +### SimpleLocationProvider + +The SimpleLocationProvider places file names underneath a `data` directory in the table's storage location. For example, +a non-partitioned table might have a data file with location: + +```txt +s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet +``` + +When data is partitioned, the files under a given partition are grouped into a subdirectory, with that partition key +and value as the directory name. For example, a table partitioned over a string column `category` might have a data file +with location: + +```txt +s3://bucket/ns/table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet +``` + +The SimpleLocationProvider is enabled for a table by explicitly setting its `write.object-storage.enabled` table property to `false`. + +### ObjectStoreLocationProvider + +When several files are stored under the same prefix, cloud object stores such as S3 often [throttling requests on prefixes](https://repost.aws/knowledge-center/http-5xx-errors-s3), +resulting in slowdowns. + +The ObjectStoreLocationProvider counteracts this by injecting deterministic hashes, in the form of binary directories, +into file paths, to distribute files across a larger number of object store prefixes. + +Paths contain partitions just before the file name, and a `data` directory beneath the table's location, in a similar Review Comment: See https://github.com/apache/iceberg-python/pull/1537#discussion_r1921100230 re `data` here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org