Re: [PR] Docs: Location Provider Documentation [iceberg-python]

via GitHub Sun, 19 Jan 2025 11:56:27 -0800


kevinjqliu commented on code in PR #1537:
URL: https://github.com/apache/iceberg-python/pull/1537#discussion_r1921620305



##########
mkdocs/docs/configuration.md:
##########
@@ -195,6 +198,86 @@ PyIceberg uses 
[S3FileSystem](https://arrow.apache.org/docs/python/generated/pya
 
 <!-- markdown-link-check-enable-->
 
+## Location Providers
+
+Iceberg works with the concept of a LocationProvider that determines file 
paths for a table's data. PyIceberg

Review Comment:
   ```suggestion
   Apache Iceberg uses the concept of a `LocationProvider` to manage file paths 
for a table's data. In PyIceberg, the `LocationProvider` module is designed to 
be pluggable, allowing customization for specific use cases. The 
`LocationProvider` for a table can be specified through table properties.
   
   PyIceberg defaults to the 
[ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider),
   which generates file paths that are optimized for object storage
   ```



##########
mkdocs/docs/configuration.md:
##########
@@ -54,15 +54,18 @@ Iceberg tables support table properties to configure table 
behavior.
 
 ### Write options
 
-| Key                                    | Options                           | 
Default | Description                                                           
                      |
-| -------------------------------------- | --------------------------------- | 
------- | 
-------------------------------------------------------------------------------------------
 |
-| `write.parquet.compression-codec`      | `{uncompressed,zstd,gzip,snappy}` | 
zstd    | Sets the Parquet compression coddec.                                  
                      |
-| `write.parquet.compression-level`      | Integer                           | 
null    | Parquet compression level for the codec. If not set, it is up to 
PyIceberg                  |
-| `write.parquet.row-group-limit`        | Number of rows                    | 
1048576 | The upper bound of the number of entries within a single row group    
                      |
-| `write.parquet.page-size-bytes`        | Size in bytes                     | 
1MB     | Set a target threshold for the approximate encoded size of data pages 
within a column chunk |
-| `write.parquet.page-row-limit`         | Number of rows                    | 
20000   | Set a target threshold for the maximum number of rows within a column 
chunk                 |
-| `write.parquet.dict-size-bytes`        | Size in bytes                     | 
2MB     | Set the dictionary page size limit per row group                      
                      |
-| `write.metadata.previous-versions-max` | Integer                           | 
100     | The max number of previous version metadata files to keep before 
deleting after commit.     |
+| Key                                      | Options                           
| Default | Description                                                         
                                                                |
+|------------------------------------------|-----------------------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------|
+| `write.parquet.compression-codec`        | `{uncompressed,zstd,gzip,snappy}` 
| zstd    | Sets the Parquet compression coddec.                                
                                                                |
+| `write.parquet.compression-level`        | Integer                           
| null    | Parquet compression level for the codec. If not set, it is up to 
PyIceberg                                                          |
+| `write.parquet.row-group-limit`          | Number of rows                    
| 1048576 | The upper bound of the number of entries within a single row group  
                                                                |
+| `write.parquet.page-size-bytes`          | Size in bytes                     
| 1MB     | Set a target threshold for the approximate encoded size of data 
pages within a column chunk                                         |
+| `write.parquet.page-row-limit`           | Number of rows                    
| 20000   | Set a target threshold for the maximum number of rows within a 
column chunk                                                         |
+| `write.parquet.dict-size-bytes`          | Size in bytes                     
| 2MB     | Set the dictionary page size limit per row group                    
                                                                |
+| `write.metadata.previous-versions-max`   | Integer                           
| 100     | The max number of previous version metadata files to keep before 
deleting after commit.                                             |
+| `write.object-storage.enabled`           | Boolean                           
| True    | Enables the 
[ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) 
that adds a hash component to file paths    |
+| `write.object-storage.partitioned-paths` | Boolean                           
| True    | Controls whether [partition values are included in file 
paths](configuration.md#partition-exclusion) when object storage is enabled |

Review Comment:
   hyperlinks are weird sometimes, can you make sure that these work as intended



##########
mkdocs/docs/configuration.md:
##########
@@ -195,6 +198,85 @@ PyIceberg uses 
[S3FileSystem](https://arrow.apache.org/docs/python/generated/pya
 
 <!-- markdown-link-check-enable-->
 
+## Location Providers
+
+Iceberg works with the concept of a LocationProvider that determines the file 
paths for a table's data. PyIceberg
+introduces a pluggable LocationProvider module; the LocationProvider used may 
be specified on a per-table basis via
+table properties. PyIceberg defaults to the 
[ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider),
+which generates file paths that are optimised for object storage.
+
+### SimpleLocationProvider
+
+The SimpleLocationProvider places file names underneath a `data` directory in 
the table's storage location. For example,
+a non-partitioned table might have a data file with location:
+
+```txt
+s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+When data is partitioned, the files under a given partition are grouped into a 
subdirectory, with that partition key
+and value as the directory name. For example, a table partitioned over a 
string column `category` might have a data file
+with location:
+
+```txt
+s3://bucket/ns/table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+The SimpleLocationProvider is enabled for a table by explicitly setting its 
`write.object-storage.enabled` table property to `false`.
+
+### ObjectStoreLocationProvider
+
+When several files are stored under the same prefix, cloud object stores such 
as S3 often [throttling requests on 
prefixes](https://repost.aws/knowledge-center/http-5xx-errors-s3),
+resulting in slowdowns.
+
+The ObjectStoreLocationProvider counteracts this by injecting deterministic 
hashes, in the form of binary directories,
+into file paths, to distribute files across a larger number of object store 
prefixes.
+
+Paths contain partitions just before the file name, and a `data` directory 
beneath the table's location, in a similar
+manner to the 
[SimpleLocationProvider](configuration.md#simplelocationprovider). For example, 
a table partitioned over a string
+column `category` might have a data file with location: (note the additional 
binary directories)
+
+```txt
+s3://bucket/ns/table/data/0101/0110/1001/10110010/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+The `write.object-storage.enabled` table property determines whether the 
ObjectStoreLocationProvider is enabled for a
+table. It is used by default.
+
+#### Partition Exclusion
+
+When the ObjectStoreLocationProvider is used, the table property 
`write.object-storage.partitioned-paths`, which
+defaults to `true`, can be set to `false` as an additional optimisation for 
object stores. This omits partition keys and values from data
+file paths *entirely* to further reduce key size. With it disabled, the same 
data file above would instead be written
+to: (note the absence of `category=orders`)
+
+```txt
+s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+### Loading a Custom LocationProvider
+
+Similar to FileIO, a custom LocationProvider may be provided for a table by 
concretely subclassing the abstract base
+class 
[LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider).
 The

Review Comment:
   yea i think its fine, as long as the hyperlink works when you run it locally



##########
mkdocs/docs/configuration.md:
##########
@@ -195,6 +198,86 @@ PyIceberg uses 
[S3FileSystem](https://arrow.apache.org/docs/python/generated/pya
 
 <!-- markdown-link-check-enable-->
 
+## Location Providers
+
+Iceberg works with the concept of a LocationProvider that determines file 
paths for a table's data. PyIceberg
+introduces a pluggable LocationProvider module; the LocationProvider used may 
be specified on a per-table basis via
+table properties. PyIceberg defaults to the 
[ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider),
+which generates file paths that are optimized for object storage.
+
+### SimpleLocationProvider
+
+The SimpleLocationProvider places file names underneath a `data` directory in 
the table's storage location. For example,
+a non-partitioned table might have a data file with location:
+
+```txt
+s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+When data is partitioned, files under a given partition are grouped into a 
subdirectory, with that partition key
+and value as the directory name. For example, a table partitioned over a 
string column `category` might have a data file
+with location:
+
+```txt
+s3://bucket/ns/table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+The SimpleLocationProvider is enabled for a table by explicitly setting its 
`write.object-storage.enabled` table
+property to `False`.
+
+### ObjectStoreLocationProvider
+
+When several files are stored under the same prefix, cloud object stores such 
as S3 often [throttle requests on 
prefixes](https://repost.aws/knowledge-center/http-5xx-errors-s3),
+resulting in slowdowns.
+
+The ObjectStoreLocationProvider counteracts this by injecting deterministic 
hashes, in the form of binary directories,
+into file paths, to distribute files across a larger number of object store 
prefixes.
+
+Paths contain partitions just before the file name and a `data` directory 
beneath the table's location, in a similar
+manner to the 
[SimpleLocationProvider](configuration.md#simplelocationprovider). For example, 
a table partitioned over a string
+column `category` might have a data file with location: (note the additional 
binary directories)
+
+```txt
+s3://bucket/ns/table/data/0101/0110/1001/10110010/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+The `write.object-storage.enabled` table property determines whether the 
ObjectStoreLocationProvider is enabled for a
+table. It is used by default.
+
+#### Partition Exclusion
+
+When the ObjectStoreLocationProvider is used, the table property 
`write.object-storage.partitioned-paths`, which
+defaults to `True`, can be set to `False` as an additional optimization for 
object stores. This omits partition keys and
+values from data file paths *entirely* to further reduce key size. With it 
disabled, the same data file above would
+instead be written to: (note the absence of `category=orders`)
+
+```txt
+s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+### Loading a Custom LocationProvider
+
+Similar to FileIO, a custom LocationProvider may be provided for a table by 
concretely subclassing the abstract base
+class 
[LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider).
 The
+table property `write.py-location-provider.impl` should be set to the 
fully-qualified name of the custom
+LocationProvider (i.e. `module.CustomLocationProvider`). Recall that a 
LocationProvider is configured per-table,
+permitting different location provision for different tables.

Review Comment:
   also mention that java uses `write.location-provider.impl`



##########
mkdocs/docs/configuration.md:
##########
@@ -195,6 +198,86 @@ PyIceberg uses 
[S3FileSystem](https://arrow.apache.org/docs/python/generated/pya
 
 <!-- markdown-link-check-enable-->
 
+## Location Providers
+
+Iceberg works with the concept of a LocationProvider that determines file 
paths for a table's data. PyIceberg
+introduces a pluggable LocationProvider module; the LocationProvider used may 
be specified on a per-table basis via
+table properties. PyIceberg defaults to the 
[ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider),
+which generates file paths that are optimized for object storage.
+
+### SimpleLocationProvider
+
+The SimpleLocationProvider places file names underneath a `data` directory in 
the table's storage location. For example,
+a non-partitioned table might have a data file with location:
+
+```txt
+s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+When data is partitioned, files under a given partition are grouped into a 
subdirectory, with that partition key

Review Comment:
   ```suggestion
   When the table is partitioned, files under a given partition are grouped 
into a subdirectory, with that partition key
   ```



##########
mkdocs/docs/configuration.md:
##########
@@ -195,6 +198,86 @@ PyIceberg uses 
[S3FileSystem](https://arrow.apache.org/docs/python/generated/pya
 
 <!-- markdown-link-check-enable-->
 
+## Location Providers
+
+Iceberg works with the concept of a LocationProvider that determines file 
paths for a table's data. PyIceberg
+introduces a pluggable LocationProvider module; the LocationProvider used may 
be specified on a per-table basis via
+table properties. PyIceberg defaults to the 
[ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider),
+which generates file paths that are optimized for object storage.
+
+### SimpleLocationProvider
+
+The SimpleLocationProvider places file names underneath a `data` directory in 
the table's storage location. For example,
+a non-partitioned table might have a data file with location:
+
+```txt
+s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+When data is partitioned, files under a given partition are grouped into a 
subdirectory, with that partition key
+and value as the directory name. For example, a table partitioned over a 
string column `category` might have a data file
+with location:
+
+```txt
+s3://bucket/ns/table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+The SimpleLocationProvider is enabled for a table by explicitly setting its 
`write.object-storage.enabled` table
+property to `False`.
+
+### ObjectStoreLocationProvider
+
+When several files are stored under the same prefix, cloud object stores such 
as S3 often [throttle requests on 
prefixes](https://repost.aws/knowledge-center/http-5xx-errors-s3),
+resulting in slowdowns.
+
+The ObjectStoreLocationProvider counteracts this by injecting deterministic 
hashes, in the form of binary directories,
+into file paths, to distribute files across a larger number of object store 
prefixes.
+
+Paths contain partitions just before the file name and a `data` directory 
beneath the table's location, in a similar
+manner to the 
[SimpleLocationProvider](configuration.md#simplelocationprovider). For example, 
a table partitioned over a string
+column `category` might have a data file with location: (note the additional 
binary directories)
+
+```txt
+s3://bucket/ns/table/data/0101/0110/1001/10110010/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+The `write.object-storage.enabled` table property determines whether the 
ObjectStoreLocationProvider is enabled for a
+table. It is used by default.
+
+#### Partition Exclusion
+
+When the ObjectStoreLocationProvider is used, the table property 
`write.object-storage.partitioned-paths`, which
+defaults to `True`, can be set to `False` as an additional optimization for 
object stores. This omits partition keys and
+values from data file paths *entirely* to further reduce key size. With it 
disabled, the same data file above would
+instead be written to: (note the absence of `category=orders`)
+
+```txt
+s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```

Review Comment:
   nit: what about giving an example of this set to True and another one set to 
False



##########
mkdocs/docs/configuration.md:
##########
@@ -195,6 +198,86 @@ PyIceberg uses 
[S3FileSystem](https://arrow.apache.org/docs/python/generated/pya
 
 <!-- markdown-link-check-enable-->
 
+## Location Providers
+
+Iceberg works with the concept of a LocationProvider that determines file 
paths for a table's data. PyIceberg
+introduces a pluggable LocationProvider module; the LocationProvider used may 
be specified on a per-table basis via
+table properties. PyIceberg defaults to the 
[ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider),
+which generates file paths that are optimized for object storage.
+
+### SimpleLocationProvider
+
+The SimpleLocationProvider places file names underneath a `data` directory in 
the table's storage location. For example,

Review Comment:
   ```suggestion
   The `SimpleLocationProvider` places a table's data file names underneath a 
`data` directory in the table's storage location. For example,
   ```



##########
mkdocs/docs/configuration.md:
##########
@@ -195,6 +198,85 @@ PyIceberg uses 
[S3FileSystem](https://arrow.apache.org/docs/python/generated/pya
 
 <!-- markdown-link-check-enable-->
 
+## Location Providers
+
+Iceberg works with the concept of a LocationProvider that determines the file 
paths for a table's data. PyIceberg
+introduces a pluggable LocationProvider module; the LocationProvider used may 
be specified on a per-table basis via
+table properties. PyIceberg defaults to the 
[ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider),
+which generates file paths that are optimised for object storage.
+
+### SimpleLocationProvider
+
+The SimpleLocationProvider places file names underneath a `data` directory in 
the table's storage location. For example,
+a non-partitioned table might have a data file with location:
+
+```txt
+s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+When data is partitioned, the files under a given partition are grouped into a 
subdirectory, with that partition key
+and value as the directory name. For example, a table partitioned over a 
string column `category` might have a data file
+with location:
+
+```txt
+s3://bucket/ns/table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+The SimpleLocationProvider is enabled for a table by explicitly setting its 
`write.object-storage.enabled` table property to `false`.
+
+### ObjectStoreLocationProvider

Review Comment:
   i think we should link to that for additional context



##########
mkdocs/docs/configuration.md:
##########
@@ -195,6 +198,86 @@ PyIceberg uses 
[S3FileSystem](https://arrow.apache.org/docs/python/generated/pya
 
 <!-- markdown-link-check-enable-->
 
+## Location Providers
+
+Iceberg works with the concept of a LocationProvider that determines file 
paths for a table's data. PyIceberg
+introduces a pluggable LocationProvider module; the LocationProvider used may 
be specified on a per-table basis via
+table properties. PyIceberg defaults to the 
[ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider),
+which generates file paths that are optimized for object storage.
+
+### SimpleLocationProvider
+
+The SimpleLocationProvider places file names underneath a `data` directory in 
the table's storage location. For example,

Review Comment:
   I think we should also call out that the "base location" is the 
`table.metadata.location` 
   
   From the spec, https://iceberg.apache.org/spec/#table-metadata-fields
   
   > The table's base location. This is used by writers to determine where to 
store data files, manifest files, and table metadata files.
   



##########
mkdocs/docs/configuration.md:
##########
@@ -54,15 +54,18 @@ Iceberg tables support table properties to configure table 
behavior.
 
 ### Write options
 
-| Key                                    | Options                           | 
Default | Description                                                           
                      |
-| -------------------------------------- | --------------------------------- | 
------- | 
-------------------------------------------------------------------------------------------
 |
-| `write.parquet.compression-codec`      | `{uncompressed,zstd,gzip,snappy}` | 
zstd    | Sets the Parquet compression coddec.                                  
                      |
-| `write.parquet.compression-level`      | Integer                           | 
null    | Parquet compression level for the codec. If not set, it is up to 
PyIceberg                  |
-| `write.parquet.row-group-limit`        | Number of rows                    | 
1048576 | The upper bound of the number of entries within a single row group    
                      |
-| `write.parquet.page-size-bytes`        | Size in bytes                     | 
1MB     | Set a target threshold for the approximate encoded size of data pages 
within a column chunk |
-| `write.parquet.page-row-limit`         | Number of rows                    | 
20000   | Set a target threshold for the maximum number of rows within a column 
chunk                 |
-| `write.parquet.dict-size-bytes`        | Size in bytes                     | 
2MB     | Set the dictionary page size limit per row group                      
                      |
-| `write.metadata.previous-versions-max` | Integer                           | 
100     | The max number of previous version metadata files to keep before 
deleting after commit.     |
+| Key                                      | Options                           
| Default | Description                                                         
                                                                |
+|------------------------------------------|-----------------------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------|
+| `write.parquet.compression-codec`        | `{uncompressed,zstd,gzip,snappy}` 
| zstd    | Sets the Parquet compression coddec.                                
                                                                |
+| `write.parquet.compression-level`        | Integer                           
| null    | Parquet compression level for the codec. If not set, it is up to 
PyIceberg                                                          |
+| `write.parquet.row-group-limit`          | Number of rows                    
| 1048576 | The upper bound of the number of entries within a single row group  
                                                                |
+| `write.parquet.page-size-bytes`          | Size in bytes                     
| 1MB     | Set a target threshold for the approximate encoded size of data 
pages within a column chunk                                         |
+| `write.parquet.page-row-limit`           | Number of rows                    
| 20000   | Set a target threshold for the maximum number of rows within a 
column chunk                                                         |
+| `write.parquet.dict-size-bytes`          | Size in bytes                     
| 2MB     | Set the dictionary page size limit per row group                    
                                                                |
+| `write.metadata.previous-versions-max`   | Integer                           
| 100     | The max number of previous version metadata files to keep before 
deleting after commit.                                             |
+| `write.object-storage.enabled`           | Boolean                           
| True    | Enables the 
[ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) 
that adds a hash component to file paths    |
+| `write.object-storage.partitioned-paths` | Boolean                           
| True    | Controls whether [partition values are included in file 
paths](configuration.md#partition-exclusion) when object storage is enabled |
+| `write.py-location-provider.impl`        | String of form `module.ClassName` 
| null    | Optional, [custom 
LocationProvider](configuration.md#loading-a-custom-locationprovider) 
implementation                              |

Review Comment:
   maybe similar to the custom catalog 
   https://py.iceberg.apache.org/configuration/#custom-catalog-implementations



##########
mkdocs/docs/configuration.md:
##########
@@ -195,6 +198,86 @@ PyIceberg uses 
[S3FileSystem](https://arrow.apache.org/docs/python/generated/pya
 
 <!-- markdown-link-check-enable-->
 
+## Location Providers
+
+Iceberg works with the concept of a LocationProvider that determines file 
paths for a table's data. PyIceberg
+introduces a pluggable LocationProvider module; the LocationProvider used may 
be specified on a per-table basis via
+table properties. PyIceberg defaults to the 
[ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider),
+which generates file paths that are optimized for object storage.
+
+### SimpleLocationProvider
+
+The SimpleLocationProvider places file names underneath a `data` directory in 
the table's storage location. For example,
+a non-partitioned table might have a data file with location:
+
+```txt
+s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
+When data is partitioned, files under a given partition are grouped into a 
subdirectory, with that partition key

Review Comment:
   nit: maybe also when the `hive-style` partition path



##########
mkdocs/docs/configuration.md:
##########
@@ -54,15 +54,18 @@ Iceberg tables support table properties to configure table 
behavior.
 
 ### Write options
 
-| Key                                    | Options                           | 
Default | Description                                                           
                      |
-| -------------------------------------- | --------------------------------- | 
------- | 
-------------------------------------------------------------------------------------------
 |
-| `write.parquet.compression-codec`      | `{uncompressed,zstd,gzip,snappy}` | 
zstd    | Sets the Parquet compression coddec.                                  
                      |
-| `write.parquet.compression-level`      | Integer                           | 
null    | Parquet compression level for the codec. If not set, it is up to 
PyIceberg                  |
-| `write.parquet.row-group-limit`        | Number of rows                    | 
1048576 | The upper bound of the number of entries within a single row group    
                      |
-| `write.parquet.page-size-bytes`        | Size in bytes                     | 
1MB     | Set a target threshold for the approximate encoded size of data pages 
within a column chunk |
-| `write.parquet.page-row-limit`         | Number of rows                    | 
20000   | Set a target threshold for the maximum number of rows within a column 
chunk                 |
-| `write.parquet.dict-size-bytes`        | Size in bytes                     | 
2MB     | Set the dictionary page size limit per row group                      
                      |
-| `write.metadata.previous-versions-max` | Integer                           | 
100     | The max number of previous version metadata files to keep before 
deleting after commit.     |
+| Key                                      | Options                           
| Default | Description                                                         
                                                                |
+|------------------------------------------|-----------------------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------|
+| `write.parquet.compression-codec`        | `{uncompressed,zstd,gzip,snappy}` 
| zstd    | Sets the Parquet compression coddec.                                
                                                                |
+| `write.parquet.compression-level`        | Integer                           
| null    | Parquet compression level for the codec. If not set, it is up to 
PyIceberg                                                          |
+| `write.parquet.row-group-limit`          | Number of rows                    
| 1048576 | The upper bound of the number of entries within a single row group  
                                                                |
+| `write.parquet.page-size-bytes`          | Size in bytes                     
| 1MB     | Set a target threshold for the approximate encoded size of data 
pages within a column chunk                                         |
+| `write.parquet.page-row-limit`           | Number of rows                    
| 20000   | Set a target threshold for the maximum number of rows within a 
column chunk                                                         |
+| `write.parquet.dict-size-bytes`          | Size in bytes                     
| 2MB     | Set the dictionary page size limit per row group                    
                                                                |
+| `write.metadata.previous-versions-max`   | Integer                           
| 100     | The max number of previous version metadata files to keep before 
deleting after commit.                                             |
+| `write.object-storage.enabled`           | Boolean                           
| True    | Enables the 
[ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) 
that adds a hash component to file paths    |

Review Comment:
   maybe we should add a warning or something about how this default differs 
from the java implementation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Docs: Location Provider Documentation [iceberg-python]

Reply via email to