This is an automated email from the ASF dual-hosted git repository.
morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 0f00cd71f7a [feat] add parquet_meta tvf doc (#3222)
0f00cd71f7a is described below
commit 0f00cd71f7a8df52307b4cd4e99c1556897cb4ff
Author: Chenjunwei <[email protected]>
AuthorDate: Thu Dec 25 22:04:31 2025 +0800
[feat] add parquet_meta tvf doc (#3222)
## Versions
- [ ] dev
- [ ✅] 4.x
- [ ] 3.x
- [ ] 2.1
## Languages
- [ ✅] Chinese
- [ ] English
## Docs Checklist
- [ ] Checked by AI
- [ ] Test Cases Built
---------
Co-authored-by: Mingyu Chen (Rayner) <[email protected]>
---
.../table-valued-functions/parquet-meta.md | 198 +++++++++++++++++++++
.../table-valued-functions/parquet-meta.md | 198 +++++++++++++++++++++
.../table-valued-functions/parquet-meta.md | 198 +++++++++++++++++++++
sidebars.ts | 1 +
.../table-valued-functions/parquet-meta.md | 198 +++++++++++++++++++++
versioned_sidebars/version-4.x-sidebars.json | 1 +
6 files changed, 794 insertions(+)
diff --git
a/docs/sql-manual/sql-functions/table-valued-functions/parquet-meta.md
b/docs/sql-manual/sql-functions/table-valued-functions/parquet-meta.md
new file mode 100644
index 00000000000..9ad756f1ac2
--- /dev/null
+++ b/docs/sql-manual/sql-functions/table-valued-functions/parquet-meta.md
@@ -0,0 +1,198 @@
+---
+{
+ "title": "PARQUET_META",
+ "language": "en",
+ "description": "The parquet_meta table-valued-function (tvf) can be used to
read Footer metadata of Parquet files without scanning data pages. It allows
for quickly viewing Row Group statistics, Schema, file-level metadata, KV
metadata, and Bloom Filter probe results."
+}
+---
+
+The `parquet_meta` table-valued-function (tvf) can be used to read Footer
metadata of Parquet files without scanning data pages. It allows for quickly
viewing Row Group statistics, Schema, file-level metadata, KV metadata, and
Bloom Filter probe results.
+
+> This is an experimental feature, supported since version 4.0.3.
+
+## Syntax
+
+```sql
+PARQUET_META(
+ "uri" = "<uri>",
+ "mode" = "<mode>",
+ {OptionalParameters},
+ {ConnectionParameters}
+ );
+```
+
+- `uri`
+
+ File path.
+
+- `mode`
+
+ Metadata query mode. Optional, defaults to `parquet_metadata`. See
"Supported Modes" section for values.
+
+- `{OptionalParameters}`
+
+ - `column`: Required when mode is `parquet_bloom_probe`, specifies the
column name to probe.
+ - `value`: Required when mode is `parquet_bloom_probe`, specifies the
literal value to probe.
+
+- `{ConnectionParameters}`
+
+ Parameters required to access the storage system where the file is located.
For details, see:
+
+ * [HDFS](../../../lakehouse/storages/hdfs.md)
+ * [AWS S3](../../../lakehouse/storages/s3.md)
+ * [Google Cloud Storage](../../../lakehouse/storages/gcs.md)
+ * [Azure Blob](../../../lakehouse/storages/azure-blob.md)
+ * [Alibaba Cloud OSS](../../../lakehouse/storages/aliyun-oss.md)
+ * [Tencent Cloud COS](../../../lakehouse/storages/tencent-cos.md)
+ * [Huawei Cloud OBS](../../../lakehouse/storages/huawei-obs.md)
+ * [MinIO](../../../lakehouse/storages/minio.md)
+
+## Supported Modes
+
+### `parquet_metadata`
+
+Default mode.
+
+This mode can be used to query metadata contained in Parquet files. This
metadata reveals various internal details of the Parquet file, such as
statistics for different columns. This helps determine what types of skip
operations can be performed on Parquet files and can even provide quick
insights into the content of different columns.
+
+| Field Name | Type |
+| --- | --- |
+| file_name | STRING |
+| row_group_id | BIGINT |
+| row_group_num_rows | BIGINT |
+| row_group_num_columns | BIGINT |
+| row_group_bytes | BIGINT |
+| column_id | BIGINT |
+| file_offset | BIGINT |
+| num_values | BIGINT |
+| path_in_schema | STRING |
+| type | STRING |
+| stats_min | STRING |
+| stats_max | STRING |
+| stats_null_count | BIGINT |
+| stats_distinct_count | BIGINT |
+| stats_min_value | STRING |
+| stats_max_value | STRING |
+| compression | STRING |
+| encodings | STRING |
+| index_page_offset | BIGINT |
+| dictionary_page_offset | BIGINT |
+| data_page_offset | BIGINT |
+| total_compressed_size | BIGINT |
+| total_uncompressed_size | BIGINT |
+| key_value_metadata | `MAP<VARBINARY, VARBINARY>` |
+| bloom_filter_offset | BIGINT |
+| bloom_filter_length | BIGINT |
+| min_is_exact | BOOLEAN |
+| max_is_exact | BOOLEAN |
+| row_group_compressed_bytes | BIGINT |
+
+### `parquet_schema`
+
+This mode can be used to query the internal schema contained in Parquet files.
Note that this is the structure included in the Parquet file metadata.
+
+| Field Name | Type |
+| --- | --- |
+| file_name | VARCHAR |
+| name | VARCHAR |
+| type | VARCHAR |
+| type_length | BIGINT |
+| repetition_type | VARCHAR |
+| num_children | BIGINT |
+| converted_type | VARCHAR |
+| scale | BIGINT |
+| precision | BIGINT |
+| field_id | BIGINT |
+| logical_type | VARCHAR |
+
+### `parquet_file_metadata`
+
+This mode can be used to query file-level metadata, such as the format version
and encryption algorithm used.
+
+| Field Name | Type |
+| --- | --- |
+| file_name | STRING |
+| created_by | STRING |
+| num_rows | BIGINT |
+| num_row_groups | BIGINT |
+| format_version | BIGINT |
+| encryption_algorithm | STRING |
+| footer_signing_key_metadata | STRING |
+
+### `parquet_kv_metadata`
+
+This mode can be used to query custom metadata defined as key-value pairs.
+
+| Field Name | Type |
+| --- | --- |
+| file_name | STRING |
+| key | STRING |
+| value | STRING |
+
+### `parquet_bloom_probe`
+
+Doris supports using Bloom filters in Parquet files for data filtering and
pruning. This mode is used to detect whether a specified column and column
value can be detected through the Bloom filter.
+
+| Field Name | Type |
+| --- | --- |
+| file_name | STRING |
+| row_group_id | INT |
+| bloom_filter_excludes | INT |
+
+Meaning of `bloom_filter_excludes`:
+
+- `1`: Bloom Filter determines that this Row Group definitely does not contain
this value
+- `0`: Bloom Filter determines that it may contain this value
+- `-1`: File does not have a Bloom Filter
+
+## Examples
+
+- Local file (without scheme)
+
+ ```sql
+ SELECT * FROM parquet_meta(
+ "uri" = "/path/to/test.parquet"
+ );
+ ```
+
+- S3 file (with scheme + storage parameters)
+
+ ```sql
+ SELECT * FROM parquet_meta(
+ "uri" = "s3://bucket/path/test.parquet",
+ "mode" = "parquet_schema",
+ "s3.access_key" = "...",
+ "s3.secret_key" = "...",
+ "s3.endpoint" = "s3.xxx.com",
+ "s3.region" = "us-east-1"
+ );
+ ```
+
+- Using wildcards (glob)
+
+ ```sql
+ SELECT file_name FROM parquet_meta(
+ "uri" = "s3://bucket/path/*meta.parquet",
+ "mode" = "parquet_file_metadata"
+ );
+ ```
+
+- Using `parquet_bloom_probe` mode
+
+ ```sql
+ select * from parquet_meta(
+ "uri" = "${basePath}/bloommeta.parquet",
+ "mode" = "parquet_bloom_probe",
+ "column" = "col",
+ "value" = 500,
+ "s3.access_key" = "${ak}",
+ "s3.secret_key" = "${sk}",
+ "s3.endpoint" = "${endpoint}",
+ "s3.region" = "${region}",
+ );
+ ```
+
+## Notes and Limitations
+
+- `parquet_meta` only reads Parquet Footer metadata, not data pages, making it
suitable for quickly viewing metadata.
+- Supports wildcards (such as `*`, `{}`, `[]`). If no matching files are
found, an error will be reported.
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/table-valued-functions/parquet-meta.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/table-valued-functions/parquet-meta.md
new file mode 100644
index 00000000000..9b0f433d9c6
--- /dev/null
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/table-valued-functions/parquet-meta.md
@@ -0,0 +1,198 @@
+---
+{
+ "title": "PARQUET_META",
+ "language": "zh-CN",
+ "description": "parquet_meta 表函数(table-valued-function,tvf)可以用于读取 Parquet
文件的 Footer 元数据,不会扫描数据页。它可以快速查看 Row Group 统计、Schema、文件级元数据、KV 元数据以及 Bloom Filter
探测结果。"
+}
+---
+
+`parquet_meta` 表函数(table-valued-function,tvf)可以用于读取 Parquet 文件的 Footer
元数据,不会扫描数据页。它可以快速查看 Row Group 统计、Schema、文件级元数据、KV 元数据以及 Bloom Filter 探测结果。
+
+> 该功能为实验功能,自 4.0.3 版本支持。
+
+## 语法
+
+```sql
+PARQUET_META(
+ "uri" = "<uri>",
+ "mode" = "<mode>",
+ {OptionalParameters},
+ {ConnectionParameters}
+ );
+```
+
+- `uri`
+
+ 文件路径。
+
+- `mode`
+
+ 元数据查询模式。可选,默认为 `parquet_metadata`。取值见"支持的模式"章节。
+
+- `{OptionalParameters}`
+
+ - `column`:当模式为 `parquet_bloom_probe` 时必填,表示要探测的列名。
+ - `value`:当模式为 `parquet_bloom_probe` 时必填,表示要探测的字面值。
+
+- `{ConnectionParameters}`
+
+ 访问文件所在的存储系统所需的参数,具体可参阅:
+
+ * [HDFS](../../../lakehouse/storages/hdfs.md)
+ * [AWS S3](../../../lakehouse/storages/s3.md)
+ * [Google Cloud Storage](../../../lakehouse/storages/gcs.md)
+ * [Azure Blob](../../../lakehouse/storages/azure-blob.md)
+ * [阿里云 OSS](../../../lakehouse/storages/aliyun-oss.md)
+ * [腾讯云 COS](../../../lakehouse/storages/tencent-cos.md)
+ * [华为云 OBS](../../../lakehouse/storages/huawei-obs.md)
+ * [MinIO](../../../lakehouse/storages/minio.md)
+
+## 支持的模式
+
+### `parquet_metadata`
+
+默认模式。
+
+该模式可用于查询 Parquet 文件中包含的元数据。这些元数据会揭示 Parquet 文件的各种内部细节,例如不同列的统计信息。这有助于确定
Parquet 文件中可以进行何种类型的跳过操作,甚至可以快速了解不同列包含的内容。
+
+| 字段名 | 类型 |
+| --- | --- |
+| file_name | STRING |
+| row_group_id | BIGINT |
+| row_group_num_rows | BIGINT |
+| row_group_num_columns | BIGINT |
+| row_group_bytes | BIGINT |
+| column_id | BIGINT |
+| file_offset | BIGINT |
+| num_values | BIGINT |
+| path_in_schema | STRING |
+| type | STRING |
+| stats_min | STRING |
+| stats_max | STRING |
+| stats_null_count | BIGINT |
+| stats_distinct_count | BIGINT |
+| stats_min_value | STRING |
+| stats_max_value | STRING |
+| compression | STRING |
+| encodings | STRING |
+| index_page_offset | BIGINT |
+| dictionary_page_offset | BIGINT |
+| data_page_offset | BIGINT |
+| total_compressed_size | BIGINT |
+| total_uncompressed_size | BIGINT |
+| key_value_metadata | `MAP<VARBINARY, VARBINARY>` |
+| bloom_filter_offset | BIGINT |
+| bloom_filter_length | BIGINT |
+| min_is_exact | BOOLEAN |
+| max_is_exact | BOOLEAN |
+| row_group_compressed_bytes | BIGINT |
+
+### `parquet_schema`
+
+该模式可用于查询 Parquet 文件中包含的内部架构。请注意,这是 Parquet 文件元数据中包含的结构。
+
+| 字段名 | 类型 |
+| --- | --- |
+| file_name | VARCHAR |
+| name | VARCHAR |
+| type | VARCHAR |
+| type_length | BIGINT |
+| repetition_type | VARCHAR |
+| num_children | BIGINT |
+| converted_type | VARCHAR |
+| scale | BIGINT |
+| precision | BIGINT |
+| field_id | BIGINT |
+| logical_type | VARCHAR |
+
+### `parquet_file_metadata`
+
+该模式可用于查询文件级元数据,例如所使用的格式版本和加密算法。
+
+| 字段名 | 类型 |
+| --- | --- |
+| file_name | STRING |
+| created_by | STRING |
+| num_rows | BIGINT |
+| num_row_groups | BIGINT |
+| format_version | BIGINT |
+| encryption_algorithm | STRING |
+| footer_signing_key_metadata | STRING |
+
+### `parquet_kv_metadata`
+
+该模式可用于查询定义为键值对的自定义元数据。
+
+| 字段名 | 类型 |
+| --- | --- |
+| file_name | STRING |
+| key | STRING |
+| value | STRING |
+
+### `parquet_bloom_probe`
+
+Doris 支持使用 Parquet 文件中的布隆过滤器进行数据过滤和裁剪。该模式用于检测指定列和列值是否可以通过布隆过滤器检测。
+
+| 字段名 | 类型 |
+| --- | --- |
+| file_name | STRING |
+| row_group_id | INT |
+| bloom_filter_excludes | INT |
+
+`bloom_filter_excludes` 的含义:
+
+- `1`:Bloom Filter 判断该 Row Group 一定不包含该值
+- `0`:Bloom Filter 判断可能包含该值
+- `-1`:文件没有 Bloom Filter
+
+## 示例
+
+- 本地文件(不带 scheme)
+
+ ```sql
+ SELECT * FROM parquet_meta(
+ "uri" = "/path/to/test.parquet"
+ );
+ ```
+
+- S3 文件(带 scheme + 存储参数)
+
+ ```sql
+ SELECT * FROM parquet_meta(
+ "uri" = "s3://bucket/path/test.parquet",
+ "mode" = "parquet_schema",
+ "s3.access_key" = "...",
+ "s3.secret_key" = "...",
+ "s3.endpoint" = "s3.xxx.com",
+ "s3.region" = "us-east-1"
+ );
+ ```
+
+- 使用通配符(glob)
+
+ ```sql
+ SELECT file_name FROM parquet_meta(
+ "uri" = "s3://bucket/path/*meta.parquet",
+ "mode" = "parquet_file_metadata"
+ );
+ ```
+
+- 使用 `parquet_bloom_probe` 模式
+
+ ```sql
+ select * from parquet_meta(
+ "uri" = "${basePath}/bloommeta.parquet",
+ "mode" = "parquet_bloom_probe",
+ "column" = "col",
+ "value" = 500,
+ "s3.access_key" = "${ak}",
+ "s3.secret_key" = "${sk}",
+ "s3.endpoint" = "${endpoint}",
+ "s3.region" = "${region}",
+ );
+ ```
+
+## 说明与限制
+
+- `parquet_meta` 只读取 Parquet Footer 元数据,不读取数据页,适合快速查看元信息。
+- 支持通配符(如 `*`、`{}`、`[]`),若无匹配文件则会报错。
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/table-valued-functions/parquet-meta.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/table-valued-functions/parquet-meta.md
new file mode 100644
index 00000000000..9b0f433d9c6
--- /dev/null
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/table-valued-functions/parquet-meta.md
@@ -0,0 +1,198 @@
+---
+{
+ "title": "PARQUET_META",
+ "language": "zh-CN",
+ "description": "parquet_meta 表函数(table-valued-function,tvf)可以用于读取 Parquet
文件的 Footer 元数据,不会扫描数据页。它可以快速查看 Row Group 统计、Schema、文件级元数据、KV 元数据以及 Bloom Filter
探测结果。"
+}
+---
+
+`parquet_meta` 表函数(table-valued-function,tvf)可以用于读取 Parquet 文件的 Footer
元数据,不会扫描数据页。它可以快速查看 Row Group 统计、Schema、文件级元数据、KV 元数据以及 Bloom Filter 探测结果。
+
+> 该功能为实验功能,自 4.0.3 版本支持。
+
+## 语法
+
+```sql
+PARQUET_META(
+ "uri" = "<uri>",
+ "mode" = "<mode>",
+ {OptionalParameters},
+ {ConnectionParameters}
+ );
+```
+
+- `uri`
+
+ 文件路径。
+
+- `mode`
+
+ 元数据查询模式。可选,默认为 `parquet_metadata`。取值见"支持的模式"章节。
+
+- `{OptionalParameters}`
+
+ - `column`:当模式为 `parquet_bloom_probe` 时必填,表示要探测的列名。
+ - `value`:当模式为 `parquet_bloom_probe` 时必填,表示要探测的字面值。
+
+- `{ConnectionParameters}`
+
+ 访问文件所在的存储系统所需的参数,具体可参阅:
+
+ * [HDFS](../../../lakehouse/storages/hdfs.md)
+ * [AWS S3](../../../lakehouse/storages/s3.md)
+ * [Google Cloud Storage](../../../lakehouse/storages/gcs.md)
+ * [Azure Blob](../../../lakehouse/storages/azure-blob.md)
+ * [阿里云 OSS](../../../lakehouse/storages/aliyun-oss.md)
+ * [腾讯云 COS](../../../lakehouse/storages/tencent-cos.md)
+ * [华为云 OBS](../../../lakehouse/storages/huawei-obs.md)
+ * [MinIO](../../../lakehouse/storages/minio.md)
+
+## 支持的模式
+
+### `parquet_metadata`
+
+默认模式。
+
+该模式可用于查询 Parquet 文件中包含的元数据。这些元数据会揭示 Parquet 文件的各种内部细节,例如不同列的统计信息。这有助于确定
Parquet 文件中可以进行何种类型的跳过操作,甚至可以快速了解不同列包含的内容。
+
+| 字段名 | 类型 |
+| --- | --- |
+| file_name | STRING |
+| row_group_id | BIGINT |
+| row_group_num_rows | BIGINT |
+| row_group_num_columns | BIGINT |
+| row_group_bytes | BIGINT |
+| column_id | BIGINT |
+| file_offset | BIGINT |
+| num_values | BIGINT |
+| path_in_schema | STRING |
+| type | STRING |
+| stats_min | STRING |
+| stats_max | STRING |
+| stats_null_count | BIGINT |
+| stats_distinct_count | BIGINT |
+| stats_min_value | STRING |
+| stats_max_value | STRING |
+| compression | STRING |
+| encodings | STRING |
+| index_page_offset | BIGINT |
+| dictionary_page_offset | BIGINT |
+| data_page_offset | BIGINT |
+| total_compressed_size | BIGINT |
+| total_uncompressed_size | BIGINT |
+| key_value_metadata | `MAP<VARBINARY, VARBINARY>` |
+| bloom_filter_offset | BIGINT |
+| bloom_filter_length | BIGINT |
+| min_is_exact | BOOLEAN |
+| max_is_exact | BOOLEAN |
+| row_group_compressed_bytes | BIGINT |
+
+### `parquet_schema`
+
+该模式可用于查询 Parquet 文件中包含的内部架构。请注意,这是 Parquet 文件元数据中包含的结构。
+
+| 字段名 | 类型 |
+| --- | --- |
+| file_name | VARCHAR |
+| name | VARCHAR |
+| type | VARCHAR |
+| type_length | BIGINT |
+| repetition_type | VARCHAR |
+| num_children | BIGINT |
+| converted_type | VARCHAR |
+| scale | BIGINT |
+| precision | BIGINT |
+| field_id | BIGINT |
+| logical_type | VARCHAR |
+
+### `parquet_file_metadata`
+
+该模式可用于查询文件级元数据,例如所使用的格式版本和加密算法。
+
+| 字段名 | 类型 |
+| --- | --- |
+| file_name | STRING |
+| created_by | STRING |
+| num_rows | BIGINT |
+| num_row_groups | BIGINT |
+| format_version | BIGINT |
+| encryption_algorithm | STRING |
+| footer_signing_key_metadata | STRING |
+
+### `parquet_kv_metadata`
+
+该模式可用于查询定义为键值对的自定义元数据。
+
+| 字段名 | 类型 |
+| --- | --- |
+| file_name | STRING |
+| key | STRING |
+| value | STRING |
+
+### `parquet_bloom_probe`
+
+Doris 支持使用 Parquet 文件中的布隆过滤器进行数据过滤和裁剪。该模式用于检测指定列和列值是否可以通过布隆过滤器检测。
+
+| 字段名 | 类型 |
+| --- | --- |
+| file_name | STRING |
+| row_group_id | INT |
+| bloom_filter_excludes | INT |
+
+`bloom_filter_excludes` 的含义:
+
+- `1`:Bloom Filter 判断该 Row Group 一定不包含该值
+- `0`:Bloom Filter 判断可能包含该值
+- `-1`:文件没有 Bloom Filter
+
+## 示例
+
+- 本地文件(不带 scheme)
+
+ ```sql
+ SELECT * FROM parquet_meta(
+ "uri" = "/path/to/test.parquet"
+ );
+ ```
+
+- S3 文件(带 scheme + 存储参数)
+
+ ```sql
+ SELECT * FROM parquet_meta(
+ "uri" = "s3://bucket/path/test.parquet",
+ "mode" = "parquet_schema",
+ "s3.access_key" = "...",
+ "s3.secret_key" = "...",
+ "s3.endpoint" = "s3.xxx.com",
+ "s3.region" = "us-east-1"
+ );
+ ```
+
+- 使用通配符(glob)
+
+ ```sql
+ SELECT file_name FROM parquet_meta(
+ "uri" = "s3://bucket/path/*meta.parquet",
+ "mode" = "parquet_file_metadata"
+ );
+ ```
+
+- 使用 `parquet_bloom_probe` 模式
+
+ ```sql
+ select * from parquet_meta(
+ "uri" = "${basePath}/bloommeta.parquet",
+ "mode" = "parquet_bloom_probe",
+ "column" = "col",
+ "value" = 500,
+ "s3.access_key" = "${ak}",
+ "s3.secret_key" = "${sk}",
+ "s3.endpoint" = "${endpoint}",
+ "s3.region" = "${region}",
+ );
+ ```
+
+## 说明与限制
+
+- `parquet_meta` 只读取 Parquet Footer 元数据,不读取数据页,适合快速查看元信息。
+- 支持通配符(如 `*`、`{}`、`[]`),若无匹配文件则会报错。
diff --git a/sidebars.ts b/sidebars.ts
index b845e4013c7..d179dcdc322 100644
--- a/sidebars.ts
+++ b/sidebars.ts
@@ -2012,6 +2012,7 @@ const sidebars: SidebarsConfig = {
'sql-manual/sql-functions/table-valued-functions/local',
'sql-manual/sql-functions/table-valued-functions/mv_infos',
'sql-manual/sql-functions/table-valued-functions/numbers',
+
'sql-manual/sql-functions/table-valued-functions/parquet-meta',
'sql-manual/sql-functions/table-valued-functions/partition-values',
'sql-manual/sql-functions/table-valued-functions/partitions',
'sql-manual/sql-functions/table-valued-functions/query',
diff --git
a/versioned_docs/version-4.x/sql-manual/sql-functions/table-valued-functions/parquet-meta.md
b/versioned_docs/version-4.x/sql-manual/sql-functions/table-valued-functions/parquet-meta.md
new file mode 100644
index 00000000000..9ad756f1ac2
--- /dev/null
+++
b/versioned_docs/version-4.x/sql-manual/sql-functions/table-valued-functions/parquet-meta.md
@@ -0,0 +1,198 @@
+---
+{
+ "title": "PARQUET_META",
+ "language": "en",
+ "description": "The parquet_meta table-valued-function (tvf) can be used to
read Footer metadata of Parquet files without scanning data pages. It allows
for quickly viewing Row Group statistics, Schema, file-level metadata, KV
metadata, and Bloom Filter probe results."
+}
+---
+
+The `parquet_meta` table-valued-function (tvf) can be used to read Footer
metadata of Parquet files without scanning data pages. It allows for quickly
viewing Row Group statistics, Schema, file-level metadata, KV metadata, and
Bloom Filter probe results.
+
+> This is an experimental feature, supported since version 4.0.3.
+
+## Syntax
+
+```sql
+PARQUET_META(
+ "uri" = "<uri>",
+ "mode" = "<mode>",
+ {OptionalParameters},
+ {ConnectionParameters}
+ );
+```
+
+- `uri`
+
+ File path.
+
+- `mode`
+
+ Metadata query mode. Optional, defaults to `parquet_metadata`. See
"Supported Modes" section for values.
+
+- `{OptionalParameters}`
+
+ - `column`: Required when mode is `parquet_bloom_probe`, specifies the
column name to probe.
+ - `value`: Required when mode is `parquet_bloom_probe`, specifies the
literal value to probe.
+
+- `{ConnectionParameters}`
+
+ Parameters required to access the storage system where the file is located.
For details, see:
+
+ * [HDFS](../../../lakehouse/storages/hdfs.md)
+ * [AWS S3](../../../lakehouse/storages/s3.md)
+ * [Google Cloud Storage](../../../lakehouse/storages/gcs.md)
+ * [Azure Blob](../../../lakehouse/storages/azure-blob.md)
+ * [Alibaba Cloud OSS](../../../lakehouse/storages/aliyun-oss.md)
+ * [Tencent Cloud COS](../../../lakehouse/storages/tencent-cos.md)
+ * [Huawei Cloud OBS](../../../lakehouse/storages/huawei-obs.md)
+ * [MinIO](../../../lakehouse/storages/minio.md)
+
+## Supported Modes
+
+### `parquet_metadata`
+
+Default mode.
+
+This mode can be used to query metadata contained in Parquet files. This
metadata reveals various internal details of the Parquet file, such as
statistics for different columns. This helps determine what types of skip
operations can be performed on Parquet files and can even provide quick
insights into the content of different columns.
+
+| Field Name | Type |
+| --- | --- |
+| file_name | STRING |
+| row_group_id | BIGINT |
+| row_group_num_rows | BIGINT |
+| row_group_num_columns | BIGINT |
+| row_group_bytes | BIGINT |
+| column_id | BIGINT |
+| file_offset | BIGINT |
+| num_values | BIGINT |
+| path_in_schema | STRING |
+| type | STRING |
+| stats_min | STRING |
+| stats_max | STRING |
+| stats_null_count | BIGINT |
+| stats_distinct_count | BIGINT |
+| stats_min_value | STRING |
+| stats_max_value | STRING |
+| compression | STRING |
+| encodings | STRING |
+| index_page_offset | BIGINT |
+| dictionary_page_offset | BIGINT |
+| data_page_offset | BIGINT |
+| total_compressed_size | BIGINT |
+| total_uncompressed_size | BIGINT |
+| key_value_metadata | `MAP<VARBINARY, VARBINARY>` |
+| bloom_filter_offset | BIGINT |
+| bloom_filter_length | BIGINT |
+| min_is_exact | BOOLEAN |
+| max_is_exact | BOOLEAN |
+| row_group_compressed_bytes | BIGINT |
+
+### `parquet_schema`
+
+This mode can be used to query the internal schema contained in Parquet files.
Note that this is the structure included in the Parquet file metadata.
+
+| Field Name | Type |
+| --- | --- |
+| file_name | VARCHAR |
+| name | VARCHAR |
+| type | VARCHAR |
+| type_length | BIGINT |
+| repetition_type | VARCHAR |
+| num_children | BIGINT |
+| converted_type | VARCHAR |
+| scale | BIGINT |
+| precision | BIGINT |
+| field_id | BIGINT |
+| logical_type | VARCHAR |
+
+### `parquet_file_metadata`
+
+This mode can be used to query file-level metadata, such as the format version
and encryption algorithm used.
+
+| Field Name | Type |
+| --- | --- |
+| file_name | STRING |
+| created_by | STRING |
+| num_rows | BIGINT |
+| num_row_groups | BIGINT |
+| format_version | BIGINT |
+| encryption_algorithm | STRING |
+| footer_signing_key_metadata | STRING |
+
+### `parquet_kv_metadata`
+
+This mode can be used to query custom metadata defined as key-value pairs.
+
+| Field Name | Type |
+| --- | --- |
+| file_name | STRING |
+| key | STRING |
+| value | STRING |
+
+### `parquet_bloom_probe`
+
+Doris supports using Bloom filters in Parquet files for data filtering and
pruning. This mode is used to detect whether a specified column and column
value can be detected through the Bloom filter.
+
+| Field Name | Type |
+| --- | --- |
+| file_name | STRING |
+| row_group_id | INT |
+| bloom_filter_excludes | INT |
+
+Meaning of `bloom_filter_excludes`:
+
+- `1`: Bloom Filter determines that this Row Group definitely does not contain
this value
+- `0`: Bloom Filter determines that it may contain this value
+- `-1`: File does not have a Bloom Filter
+
+## Examples
+
+- Local file (without scheme)
+
+ ```sql
+ SELECT * FROM parquet_meta(
+ "uri" = "/path/to/test.parquet"
+ );
+ ```
+
+- S3 file (with scheme + storage parameters)
+
+ ```sql
+ SELECT * FROM parquet_meta(
+ "uri" = "s3://bucket/path/test.parquet",
+ "mode" = "parquet_schema",
+ "s3.access_key" = "...",
+ "s3.secret_key" = "...",
+ "s3.endpoint" = "s3.xxx.com",
+ "s3.region" = "us-east-1"
+ );
+ ```
+
+- Using wildcards (glob)
+
+ ```sql
+ SELECT file_name FROM parquet_meta(
+ "uri" = "s3://bucket/path/*meta.parquet",
+ "mode" = "parquet_file_metadata"
+ );
+ ```
+
+- Using `parquet_bloom_probe` mode
+
+ ```sql
+ select * from parquet_meta(
+ "uri" = "${basePath}/bloommeta.parquet",
+ "mode" = "parquet_bloom_probe",
+ "column" = "col",
+ "value" = 500,
+ "s3.access_key" = "${ak}",
+ "s3.secret_key" = "${sk}",
+ "s3.endpoint" = "${endpoint}",
+ "s3.region" = "${region}",
+ );
+ ```
+
+## Notes and Limitations
+
+- `parquet_meta` only reads Parquet Footer metadata, not data pages, making it
suitable for quickly viewing metadata.
+- Supports wildcards (such as `*`, `{}`, `[]`). If no matching files are
found, an error will be reported.
diff --git a/versioned_sidebars/version-4.x-sidebars.json
b/versioned_sidebars/version-4.x-sidebars.json
index 460d4ea50f3..5517aec49e7 100644
--- a/versioned_sidebars/version-4.x-sidebars.json
+++ b/versioned_sidebars/version-4.x-sidebars.json
@@ -2030,6 +2030,7 @@
"sql-manual/sql-functions/table-valued-functions/local",
"sql-manual/sql-functions/table-valued-functions/mv_infos",
"sql-manual/sql-functions/table-valued-functions/numbers",
+
"sql-manual/sql-functions/table-valued-functions/parquet-meta",
"sql-manual/sql-functions/table-valued-functions/partition-values",
"sql-manual/sql-functions/table-valued-functions/partitions",
"sql-manual/sql-functions/table-valued-functions/query",
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]