smaheshwar-pltr commented on code in PR #1537:
URL: https://github.com/apache/iceberg-python/pull/1537#discussion_r1921102006


##########
mkdocs/docs/configuration.md:
##########
@@ -195,6 +198,85 @@ PyIceberg uses 
[S3FileSystem](https://arrow.apache.org/docs/python/generated/pya
 
 <!-- markdown-link-check-enable-->
 
+## Location Providers
+
+Iceberg works with the concept of a LocationProvider that determines the file 
paths for a table's data. PyIceberg
+introduces a pluggable LocationProvider module; the LocationProvider used may 
be specified on a per-table basis via
+table properties. PyIceberg defaults to the 
[ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider),
+which generates file paths that are optimised for object storage.
+
+### SimpleLocationProvider
+
+The SimpleLocationProvider places file names underneath a `data` directory in 
the table's storage location. For example,
+a non-partitioned table might have a data file with location:
+
+```txt
+s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+When data is partitioned, the files under a given partition are grouped into a 
subdirectory, with that partition key
+and value as the directory name. For example, a table partitioned over a 
string column `category` might have a data file
+with location:
+
+```txt
+s3://bucket/ns/table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+The SimpleLocationProvider is enabled for a table by explicitly setting its 
`write.object-storage.enabled` table property to `false`.
+
+### ObjectStoreLocationProvider
+
+When several files are stored under the same prefix, cloud object stores such 
as S3 often [throttling requests on 
prefixes](https://repost.aws/knowledge-center/http-5xx-errors-s3),
+resulting in slowdowns.
+
+The ObjectStoreLocationProvider counteracts this by injecting deterministic 
hashes, in the form of binary directories,
+into file paths, to distribute files across a larger number of object store 
prefixes.
+
+Paths contain partitions just before the file name, and a `data` directory 
beneath the table's location, in a similar
+manner to the 
[SimpleLocationProvider](configuration.md#simplelocationprovider). For example, 
a table partitioned over a string
+column `category` might have a data file with location: (note the additional 
binary directories)
+
+```txt
+s3://bucket/ns/table/data/0101/0110/1001/10110010/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+The `write.object-storage.enabled` table property determines whether the 
ObjectStoreLocationProvider is enabled for a
+table. It is used by default.
+
+#### Partition Exclusion
+
+When the ObjectStoreLocationProvider is used, the table property 
`write.object-storage.partitioned-paths`, which
+defaults to `true`, can be set to `false` as an additional optimisation for 
object stores. This omits partition keys and values from data
+file paths *entirely* to further reduce key size. With it disabled, the same 
data file above would instead be written
+to: (note the absence of `category=orders`)
+
+```txt
+s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+### Loading a Custom LocationProvider
+
+Similar to FileIO, a custom LocationProvider may be provided for a table by 
concretely subclassing the abstract base
+class 
[LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider).
 The

Review Comment:
   I wanted to link to 
[this](https://github.com/apache/iceberg-python/pull/1537/files#r1921101125), 
and this works for me locally, but I get the following warning when serving 
docs locally:
   
   ```
   INFO - Doc file 'configuration md' contains an unrecognized relative link 
'.. / 
reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider',
 it was left as is. Did you mean ' 
reference/pyiceberg/table/locations.md#pyiceberg.table.locations.LocationProvider'?
   ```
   
   But I get a similar warning elsewhere: `Doc file 'SUMMARY.md' contains an 
unrecognized relative link 'reference/', it was left as is.`. So maybe this is 
fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to