(doris-website) branch master updated: [doc](catalog)add parquet orc reader parameter (#2508)

morningman Thu, 19 Jun 2025 04:32:47 -0700

This is an automated email from the ASF dual-hosted git repository.

morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new c6fc0d822cf [doc](catalog)add parquet orc reader parameter (#2508)
c6fc0d822cf is described below

commit c6fc0d822cfd27f786db467bb6dc7019ca7fe440
Author: daidai <[email protected]>
AuthorDate: Thu Jun 19 19:31:02 2025 +0800

    [doc](catalog)add parquet orc reader parameter (#2508)
    
    ## Versions
    
    - [x] dev
    - [ ] 3.0
    - [ ] 2.1
    - [ ] 2.0
    
    ## Languages
    
    - [x] Chinese
    - [x] English
    
    ## Docs Checklist
    
    - [ ] Checked by AI
    - [ ] Test Cases Built
    
    ---------
    
    Co-authored-by: Mingyu Chen (Rayner) <[email protected]>
---
 docs/lakehouse/file-formats/orc.md                 | 40 +++++++++++++++++++++-
 docs/lakehouse/file-formats/parquet.md             | 30 ++++++++++++++++
 .../current/lakehouse/file-formats/orc.md          | 37 ++++++++++++++++++++
 .../current/lakehouse/file-formats/parquet.md      | 29 ++++++++++++++++
 4 files changed, 135 insertions(+), 1 deletion(-)

diff --git a/docs/lakehouse/file-formats/orc.md 
b/docs/lakehouse/file-formats/orc.md
index 169bd51f0a5..4e2f71e45aa 100644
--- a/docs/lakehouse/file-formats/orc.md
+++ b/docs/lakehouse/file-formats/orc.md
@@ -39,4 +39,42 @@ This document introduces the support for reading and writing 
ORC file formats in
 * lz4
 * zstd
 * lzo
-* zlib
\ No newline at end of file
+* zlib
+
+## Parameters
+
+### Session Variables
+
+* `enable_orc_lazy_mat` (2.1+, 3.0+)
+
+    Controls whether the ORC Reader enables lazy materialization. Default is 
true.
+
+* `hive_orc_use_column_names` (2.1.6+, 3.0.3+)
+
+    When reading ORC data types from Hive tables, Doris will, by default, read 
data from columns in the ORC file that have the same name as the columns in the 
Hive table. When this variable is set to `false`, Doris will read data from the 
ORC file based on the column order in the Hive table, regardless of column 
names. This is similar to the `orc.force.positional.evolution` variable in 
Hive. This parameter only applies to top-level column names and is ineffective 
for columns inside Structs.
+
+* `orc_tiny_stripe_threshold_bytes` (2.1.8+, 3.0.3+)
+
+    In ORC files, if the byte size of a Stripe is less than 
`orc_tiny_stripe_threshold`, it is considered a Tiny Stripe. For multiple 
consecutive Tiny Stripes, read optimization will be performed, i.e., multiple 
Tiny Stripes will be read at once to reduce the number of IO operations. If you 
do not want to use this optimization, you can set this value to 0. Default is 
8M.
+
+* `orc_once_max_read_bytes` (2.1.8+, 3.0.3+)
+
+    When using Tiny Stripe read optimization, multiple Tiny Stripes will be 
merged into a single IO operation. This parameter controls the maximum number 
of bytes for each IO request. You should not set this value smaller than 
`orc_tiny_stripe_threshold`. Default is 8M.
+
+* `orc_max_merge_distance_bytes` (2.1.8+, 3.0.3+)
+
+    When using Tiny Stripe read optimization, since two Tiny Stripes to be 
read may not be consecutive, if the distance between two Tiny Stripes is 
greater than this parameter, they will not be merged into a single IO 
operation. Default is 1M.
+
+* `orc_tiny_stripe_amplification_factor` (3.1.0+)
+
+    In Tiny Stripe optimization, if there are many columns in the ORC file but 
only a few are used in the query, Tiny Stripe optimization may cause severe 
read amplification. When the proportion of actually read bytes to the entire 
Stripe exceeds this parameter, Tiny Stripe read optimization will be used. The 
default value is 0.4, and the minimum value is 0.
+
+* `check_orc_init_sargs_success` (3.1.0+)
+
+    Checks whether ORC predicate pushdown is successful, used for debugging. 
Default is false.
+
+### BE Configuration
+
+* `orc_natural_read_size_mb` (2.1+, 3.0+)
+
+    The maximum number of bytes that the ORC Reader reads at one time. Default 
is 8 MB.
diff --git a/docs/lakehouse/file-formats/parquet.md 
b/docs/lakehouse/file-formats/parquet.md
index 370184376c2..463cedbebb0 100644
--- a/docs/lakehouse/file-formats/parquet.md
+++ b/docs/lakehouse/file-formats/parquet.md
@@ -42,3 +42,33 @@ This document introduces the support for reading and writing 
Parquet file format
 * lzo
 * brotli
 
+## Parameters
+
+### Session Variables
+
+* `enable_parquet_lazy_mat` (2.1+, 3.0+)
+
+    Controls whether the Parquet Reader enables lazy materialization. Default 
is true.
+
+* `hive_parquet_use_column_names` (2.1.6+, 3.0.3+)
+
+    When reading Parquet data types from Hive tables, Doris will, by default, 
read data from columns in the Parquet file that have the same name as the 
columns in the Hive table. When this variable is set to `false`, Doris will 
read data from the Parquet file based on the column order in the Hive table, 
regardless of column names. This is similar to the 
`parquet.column.index.access` variable in Hive. This parameter only applies to 
top-level column names and is ineffective for columns ins [...]
+
+### BE Configuration
+
+* `enable_parquet_page_index` (2.1.5+, 3.0+)
+
+    Determines whether the Parquet Reader uses the Page Index to filter data. 
This is only for debugging purposes, in case the page index sometimes filters 
incorrect data. Default value is false.
+
+* `parquet_header_max_size_mb` (2.1+, 3.0+)
+
+    The maximum buffer size allocated when reading the Parquet Page header. 
Default is 1M.
+
+* `parquet_rowgroup_max_buffer_mb` (2.1+, 3.0+)
+
+    The maximum buffer size allocated when reading a Parquet Row Group. 
Default is 128M.
+
+* `parquet_column_max_buffer_mb` (2.1+, 3.0+)
+
+    The maximum buffer size allocated when reading a Column within a Parquet 
Row Group. Default is 8M.
+
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-formats/orc.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-formats/orc.md
index f17a4a78cc4..a4a83ac27e5 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-formats/orc.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-formats/orc.md
@@ -51,3 +51,40 @@ under the License.
 
 * zlib
 
+## 相关参数
+
+### 会话变量
+
+* `enable_orc_lazy_mat` (2.1+, 3.0+)
+
+    控制 ORC Reader 是否启用延迟物化技术。默认为 true。
+
+* `hive_orc_use_column_names` (2.1.6+, 3.0.3+)
+
+    Doris 在读取 Hive 表 ORC 数据类型时，默认会根据 Hive 表的列名从 ORC 文件中找同名的列来读取数据。当该变量为 
`false` 时，Doris 会根据 Hive 表中的列顺序从 Parquet 文件中读取数据，与列名无关。类似于 Hive 中的 
`orc.force.positional.evolution` 变量。该参数只适用于顶层列名，对 Struct 内部无效。
+
+* `orc_tiny_stripe_threshold_bytes` (2.1.8+, 3.0.3+) 
+
+    在 ORC 文件中如果一个 Stripe 的字节大小小于 `orc_tiny_stripe_threshold`, 我们认为该 Stripe 为 
Tiny Stripe。对于多个连续的 Tiny Stripe 我们会进行读取优化，即一次性读多个 Tiny Stripe 以减少 IO 
次数。如果你不想使用该优化，可以将该值设置为 0。默认为 8M。
+
+* `orc_once_max_read_bytes` (2.1.8+, 3.0.3+) 
+
+    在使用 Tiny Stripe 读取优化的时候，会对多个 Tiny Stripe 合并成一次 IO，该参数用来控制每次 IO 
请求的最大字节大小。你不应该将值设置的小于 `orc_tiny_stripe_threshold`。默认为 8M。
+
+* `orc_max_merge_distance_bytes` (2.1.8+, 3.0.3+) 
+
+    在使用 Tiny Stripe 读取优化的时候，由于需要读取的两个 Tiny Stripe 并不一定连续，当两个 Tiny Stripe 
之间距离大于该参数时，我们不会将其合并成一次 IO。默认为 1M。
+
+* `orc_tiny_stripe_amplification_factor` (3.1.0+)
+
+    在 Tiny Stripe 优化中，如果 ORC 文件中的列较多，而查询中只使用其中的少数列，Tiny Stripe 
优化会导致严重的读取放大。当实际读取的字节数占整个 sSripe 的比例大于该参数时，将使用 Tiny Stripe 读取优化。该参数的默认值为 
0.4，最小值为 0。
+
+* `check_orc_init_sargs_success` (3.1.0+)
+
+    检查 ORC 谓词下推是否成功，用于调试。默认为 false。
+
+### BE 配置项
+
+* `orc_natural_read_size_mb` (2.1+, 3.0+)
+
+    ORC Reader 一次性读取的最大字节大小。默认 8 MB。
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-formats/parquet.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-formats/parquet.md
index 486a965e936..76e92fe626c 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-formats/parquet.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-formats/parquet.md
@@ -52,3 +52,32 @@ under the License.
 
 * brotli
 
+## 相关参数
+
+### 会话变量
+
+* `enable_parquet_lazy_mat` (2.1+, 3.0+)
+
+    控制 Parquet Reader 是否启用延迟物化技术。默认为 true。
+
+* `hive_parquet_use_column_names` (2.1.6+, 3.0.3+)
+
+    Doris 在读取 Hive 表 Parquet 数据类型时，默认会根据 Hive 表的列名从 Parquet 
文件中找同名的列来读取数据。当该变量为 `false` 时，Doris 会根据 Hive 表中的列顺序从 Parquet 文件中读取数据，与列名无关。类似于 
Hive 中的 `parquet.column.index.access` 变量。该参数只适用于顶层列名，对 Struct 内部无效。 
+
+### BE 配置
+
+* `enable_parquet_page_index` (2.1.5+, 3.0+)
+
+    Parquet Reader 是否采用 Page Index 去过滤数据。这仅用于调试目的，以防页面索引有时过滤错误的数据。默认值为 false.
+
+* `parquet_header_max_size_mb` (2.1+, 3.0+)
+
+    读取 Parquet Page header 时所分配的最大 Buffer 大小，默认为 1M。
+
+* `parquet_rowgroup_max_buffer_mb` (2.1+, 3.0+)
+
+    读取 Parquet Row Group 时所分配的最大 Buffer 大小，默认为 128M。
+
+* `parquet_column_max_buffer_mb` (2.1+, 3.0+)
+
+    读取 Parquet Row Group 中的 Column 时所分配的最大 Buffer 大小，默认为 8M。


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: [doc](catalog)add parquet orc reader parameter (#2508)

Reply via email to