(doris-website) branch master updated: [doc](load) optimize Pre-filtering doc (#2214)

liaoxin Thu, 20 Mar 2025 04:19:13 -0700

This is an automated email from the ASF dual-hosted git repository.

liaoxin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 48d2ec015be [doc](load) optimize Pre-filtering doc (#2214)
48d2ec015be is described below

commit 48d2ec015be758292f2741f46c13d771f1006be7
Author: hui lai <1353307...@qq.com>
AuthorDate: Thu Mar 20 19:19:02 2025 +0800

    [doc](load) optimize Pre-filtering doc (#2214)
---
 docs/data-operate/import/load-data-convert.md      | 71 ++++++++++++++++++++-
 .../data-operate/import/load-data-convert.md       | 72 +++++++++++++++++++++-
 .../data-operate/import/load-data-convert.md       | 72 +++++++++++++++++++++-
 .../data-operate/import/load-data-convert.md       | 72 +++++++++++++++++++++-
 .../data-operate/import/load-data-convert.md       | 71 ++++++++++++++++++++-
 .../data-operate/import/load-data-convert.md       | 71 ++++++++++++++++++++-
 6 files changed, 423 insertions(+), 6 deletions(-)

diff --git a/docs/data-operate/import/load-data-convert.md 
b/docs/data-operate/import/load-data-convert.md
index d7b29d696ab..386b2ad5b6a 100644
--- a/docs/data-operate/import/load-data-convert.md
+++ b/docs/data-operate/import/load-data-convert.md
@@ -619,7 +619,18 @@ Scenarios where filtering is needed before column mapping 
and transformation, al
 
 For example, source data contains multiple tables' data (or multiple tables' 
data is written to the same Kafka message queue). Each row of data has a column 
indicating which table the data belongs to. Users can use pre-filtering 
conditions to filter out the corresponding table data for loading.
 
-### Example
+Pre-filtering has the following limitations:
+- Column filtering restrictions
+
+Pre-filtering can only filter independent simple columns in the column list 
and cannot filter columns with expressions. For example: when the column 
mapping is (a, tmp, b = tmp + 1), column b cannot be used as a filter condition.
+
+- Data processing restrictions
+
+Pre-filtering occurs before data transformation, using raw data values for 
comparison, and raw data is treated as string type. For example: for data like 
`\N`, it will be compared directly as the `\N` string, rather than being 
converted to NULL before comparison.
+
+### Example 1: Using Numeric Conditions for Pre-filtering
+
+This example demonstrates how to filter source data using simple numeric 
comparison conditions. By setting the filter condition k1 > 1, we can filter 
out unwanted records before data transformation.
 
 Suppose we have the following source data (column names are for illustration 
purposes only, and there is no actual header):
 ```plain
@@ -683,6 +694,64 @@ mysql> select * from example_table;
 +------+------+-----------+------+
 ```
 
+### Example 2: Using Intermediate Columns to Filter Invalid Data
+
+This example demonstrates how to handle data import scenarios containing 
invalid data.
+
+Source data:
+```plain text
+1,1
+2,abc
+3,3
+```
+
+#### Table Creation
+```sql
+CREATE TABLE example_table
+(
+    k1 INT,
+    k2 INT NOT NULL
+)
+ENGINE = OLAP
+DUPLICATE KEY(k1)
+DISTRIBUTED BY HASH(k1) BUCKETS 1;
+```
+
+For column k2, which is of type INT, `abc` is invalid dirty data. To filter 
this data, we can introduce an intermediate column for filtering.
+
+#### Load Statements
+- Broker Load
+```sql
+LOAD LABEL example_db.label1
+(
+    DATA INFILE("s3://bucket_name/data.csv")
+    INTO TABLE example_table
+    COLUMNS TERMINATED BY ","
+    (k1, tmp, k2 = tmp)
+    PRECEDING FILTER tmp != "abc"
+)
+WITH s3 (...);
+```
+
+- Routine Load
+```sql
+CREATE ROUTINE LOAD example_db.example_routine_load ON example_table
+COLUMNS(k1, tmp, k2 = tmp),
+COLUMNS TERMINATED BY ","
+PRECEDING FILTER tmp != "abc"
+FROM KAFKA (...);
+```
+
+#### Load Results
+```sql
+mysql> select * from example_table;
++------+----+
+| k1   | k2 |
++------+----+
+|    1 |  1 |
+|    3 |  3 |
++------+----+
+```
 
 ## Post-filtering
 
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-data-convert.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-data-convert.md
index 1e9c4d2263e..f789741df3e 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-data-convert.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-data-convert.md
@@ -606,7 +606,19 @@ mysql> select * from example_table;
 
 比如源数据中存储了多张表的数据（或者多张表的数据写入了同一个 Kafka 
消息队列）。数据中每行有一列表名来标识该行数据属于哪个表。用户可以通过前置过滤条件来筛选对应的表数据进行导入。
 
-### 示例
+前置过滤有以下限制：
+- 过滤列限制。
+
+前置过滤只能对列表中的独立简单列进行过滤，无法对带有表达式的列进行过滤。如：在列映射为（a, tmp, b = tmp + 1）时，b 列无法作为过滤条件。
+
+- 数据处理限制
+
+前置过滤发生在数据转换之前，使用原始数据值进行比较，原始数据会视为字符串类型。如：对于 `\N` 这样的数据，会直接用 `\N` 
字符串进行比较，而不会转换为 NULL 后再比较。
+
+### 示例一：使用数值条件进行前置过滤
+
+本示例展示如何使用简单的数值比较条件来过滤源数据。通过设置 k1 > 1 的过滤条件，实现在数据转换前过滤掉不需要的记录。
+
 假设有以下源数据（表头列名仅为方便表述，实际并无表头）：
 ```plain
 列 1，列 2，列 3，列 4
@@ -668,6 +680,64 @@ mysql> select * from example_table;
 +------+------+-----------+------+
 ```
 
+### 示例二：使用中间列过滤无效数据
+
+本示例展示如何处理包含无效数据的导入场景。
+
+源数据为：
+```plain text
+1,1
+2,abc
+3,3
+```
+
+#### 建表语句
+```sql
+CREATE TABLE example_table
+(
+    k1 INT,
+    k2 INT NOT NULL
+)
+ENGINE = OLAP
+DUPLICATE KEY(k1)
+DISTRIBUTED BY HASH(k1) BUCKETS 1;
+```
+
+对于k2列，类型为int，`abc`是不合法的脏数据，想要过滤该数据，可以引入中间列来过滤。
+
+#### 导入语句
+- Broker Load
+```sql
+LOAD LABEL example_db.label1
+(
+    DATA INFILE("s3://bucket_name/data.csv")
+    INTO TABLE example_table
+    COLUMNS TERMINATED BY ","
+    (k1, tmp, k2 = tmp)
+    PRECEDING FILTER tmp != "abc"
+)
+WITH s3 (...);
+```
+
+- Routine Load
+```sql
+CREATE ROUTINE LOAD example_db.example_routine_load ON example_table
+COLUMNS(k1, tmp, k2 = tmp),
+COLUMNS TERMINATED BY ","
+PRECEDING FILTER tmp != "abc"
+FROM KAFKA (...);
+```
+
+#### 导入结果
+```sql
+mysql> select * from example_table;
++------+----+
+| k1   | k2 |
++------+----+
+|    1 |  1 |
+|    3 |  3 |
++------+----+
+```
 
 ## 后置过滤
 
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/load-data-convert.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/load-data-convert.md
index 1e9c4d2263e..f789741df3e 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/load-data-convert.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/load-data-convert.md
@@ -606,7 +606,19 @@ mysql> select * from example_table;
 
 比如源数据中存储了多张表的数据（或者多张表的数据写入了同一个 Kafka 
消息队列）。数据中每行有一列表名来标识该行数据属于哪个表。用户可以通过前置过滤条件来筛选对应的表数据进行导入。
 
-### 示例
+前置过滤有以下限制：
+- 过滤列限制。
+
+前置过滤只能对列表中的独立简单列进行过滤，无法对带有表达式的列进行过滤。如：在列映射为（a, tmp, b = tmp + 1）时，b 列无法作为过滤条件。
+
+- 数据处理限制
+
+前置过滤发生在数据转换之前，使用原始数据值进行比较，原始数据会视为字符串类型。如：对于 `\N` 这样的数据，会直接用 `\N` 
字符串进行比较，而不会转换为 NULL 后再比较。
+
+### 示例一：使用数值条件进行前置过滤
+
+本示例展示如何使用简单的数值比较条件来过滤源数据。通过设置 k1 > 1 的过滤条件，实现在数据转换前过滤掉不需要的记录。
+
 假设有以下源数据（表头列名仅为方便表述，实际并无表头）：
 ```plain
 列 1，列 2，列 3，列 4
@@ -668,6 +680,64 @@ mysql> select * from example_table;
 +------+------+-----------+------+
 ```
 
+### 示例二：使用中间列过滤无效数据
+
+本示例展示如何处理包含无效数据的导入场景。
+
+源数据为：
+```plain text
+1,1
+2,abc
+3,3
+```
+
+#### 建表语句
+```sql
+CREATE TABLE example_table
+(
+    k1 INT,
+    k2 INT NOT NULL
+)
+ENGINE = OLAP
+DUPLICATE KEY(k1)
+DISTRIBUTED BY HASH(k1) BUCKETS 1;
+```
+
+对于k2列，类型为int，`abc`是不合法的脏数据，想要过滤该数据，可以引入中间列来过滤。
+
+#### 导入语句
+- Broker Load
+```sql
+LOAD LABEL example_db.label1
+(
+    DATA INFILE("s3://bucket_name/data.csv")
+    INTO TABLE example_table
+    COLUMNS TERMINATED BY ","
+    (k1, tmp, k2 = tmp)
+    PRECEDING FILTER tmp != "abc"
+)
+WITH s3 (...);
+```
+
+- Routine Load
+```sql
+CREATE ROUTINE LOAD example_db.example_routine_load ON example_table
+COLUMNS(k1, tmp, k2 = tmp),
+COLUMNS TERMINATED BY ","
+PRECEDING FILTER tmp != "abc"
+FROM KAFKA (...);
+```
+
+#### 导入结果
+```sql
+mysql> select * from example_table;
++------+----+
+| k1   | k2 |
++------+----+
+|    1 |  1 |
+|    3 |  3 |
++------+----+
+```
 
 ## 后置过滤
 
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/load-data-convert.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/load-data-convert.md
index 1e9c4d2263e..f789741df3e 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/load-data-convert.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/load-data-convert.md
@@ -606,7 +606,19 @@ mysql> select * from example_table;
 
 比如源数据中存储了多张表的数据（或者多张表的数据写入了同一个 Kafka 
消息队列）。数据中每行有一列表名来标识该行数据属于哪个表。用户可以通过前置过滤条件来筛选对应的表数据进行导入。
 
-### 示例
+前置过滤有以下限制：
+- 过滤列限制。
+
+前置过滤只能对列表中的独立简单列进行过滤，无法对带有表达式的列进行过滤。如：在列映射为（a, tmp, b = tmp + 1）时，b 列无法作为过滤条件。
+
+- 数据处理限制
+
+前置过滤发生在数据转换之前，使用原始数据值进行比较，原始数据会视为字符串类型。如：对于 `\N` 这样的数据，会直接用 `\N` 
字符串进行比较，而不会转换为 NULL 后再比较。
+
+### 示例一：使用数值条件进行前置过滤
+
+本示例展示如何使用简单的数值比较条件来过滤源数据。通过设置 k1 > 1 的过滤条件，实现在数据转换前过滤掉不需要的记录。
+
 假设有以下源数据（表头列名仅为方便表述，实际并无表头）：
 ```plain
 列 1，列 2，列 3，列 4
@@ -668,6 +680,64 @@ mysql> select * from example_table;
 +------+------+-----------+------+
 ```
 
+### 示例二：使用中间列过滤无效数据
+
+本示例展示如何处理包含无效数据的导入场景。
+
+源数据为：
+```plain text
+1,1
+2,abc
+3,3
+```
+
+#### 建表语句
+```sql
+CREATE TABLE example_table
+(
+    k1 INT,
+    k2 INT NOT NULL
+)
+ENGINE = OLAP
+DUPLICATE KEY(k1)
+DISTRIBUTED BY HASH(k1) BUCKETS 1;
+```
+
+对于k2列，类型为int，`abc`是不合法的脏数据，想要过滤该数据，可以引入中间列来过滤。
+
+#### 导入语句
+- Broker Load
+```sql
+LOAD LABEL example_db.label1
+(
+    DATA INFILE("s3://bucket_name/data.csv")
+    INTO TABLE example_table
+    COLUMNS TERMINATED BY ","
+    (k1, tmp, k2 = tmp)
+    PRECEDING FILTER tmp != "abc"
+)
+WITH s3 (...);
+```
+
+- Routine Load
+```sql
+CREATE ROUTINE LOAD example_db.example_routine_load ON example_table
+COLUMNS(k1, tmp, k2 = tmp),
+COLUMNS TERMINATED BY ","
+PRECEDING FILTER tmp != "abc"
+FROM KAFKA (...);
+```
+
+#### 导入结果
+```sql
+mysql> select * from example_table;
++------+----+
+| k1   | k2 |
++------+----+
+|    1 |  1 |
+|    3 |  3 |
++------+----+
+```
 
 ## 后置过滤
 
diff --git 
a/versioned_docs/version-2.1/data-operate/import/load-data-convert.md 
b/versioned_docs/version-2.1/data-operate/import/load-data-convert.md
index d7b29d696ab..386b2ad5b6a 100644
--- a/versioned_docs/version-2.1/data-operate/import/load-data-convert.md
+++ b/versioned_docs/version-2.1/data-operate/import/load-data-convert.md
@@ -619,7 +619,18 @@ Scenarios where filtering is needed before column mapping 
and transformation, al
 
 For example, source data contains multiple tables' data (or multiple tables' 
data is written to the same Kafka message queue). Each row of data has a column 
indicating which table the data belongs to. Users can use pre-filtering 
conditions to filter out the corresponding table data for loading.
 
-### Example
+Pre-filtering has the following limitations:
+- Column filtering restrictions
+
+Pre-filtering can only filter independent simple columns in the column list 
and cannot filter columns with expressions. For example: when the column 
mapping is (a, tmp, b = tmp + 1), column b cannot be used as a filter condition.
+
+- Data processing restrictions
+
+Pre-filtering occurs before data transformation, using raw data values for 
comparison, and raw data is treated as string type. For example: for data like 
`\N`, it will be compared directly as the `\N` string, rather than being 
converted to NULL before comparison.
+
+### Example 1: Using Numeric Conditions for Pre-filtering
+
+This example demonstrates how to filter source data using simple numeric 
comparison conditions. By setting the filter condition k1 > 1, we can filter 
out unwanted records before data transformation.
 
 Suppose we have the following source data (column names are for illustration 
purposes only, and there is no actual header):
 ```plain
@@ -683,6 +694,64 @@ mysql> select * from example_table;
 +------+------+-----------+------+
 ```
 
+### Example 2: Using Intermediate Columns to Filter Invalid Data
+
+This example demonstrates how to handle data import scenarios containing 
invalid data.
+
+Source data:
+```plain text
+1,1
+2,abc
+3,3
+```
+
+#### Table Creation
+```sql
+CREATE TABLE example_table
+(
+    k1 INT,
+    k2 INT NOT NULL
+)
+ENGINE = OLAP
+DUPLICATE KEY(k1)
+DISTRIBUTED BY HASH(k1) BUCKETS 1;
+```
+
+For column k2, which is of type INT, `abc` is invalid dirty data. To filter 
this data, we can introduce an intermediate column for filtering.
+
+#### Load Statements
+- Broker Load
+```sql
+LOAD LABEL example_db.label1
+(
+    DATA INFILE("s3://bucket_name/data.csv")
+    INTO TABLE example_table
+    COLUMNS TERMINATED BY ","
+    (k1, tmp, k2 = tmp)
+    PRECEDING FILTER tmp != "abc"
+)
+WITH s3 (...);
+```
+
+- Routine Load
+```sql
+CREATE ROUTINE LOAD example_db.example_routine_load ON example_table
+COLUMNS(k1, tmp, k2 = tmp),
+COLUMNS TERMINATED BY ","
+PRECEDING FILTER tmp != "abc"
+FROM KAFKA (...);
+```
+
+#### Load Results
+```sql
+mysql> select * from example_table;
++------+----+
+| k1   | k2 |
++------+----+
+|    1 |  1 |
+|    3 |  3 |
++------+----+
+```
 
 ## Post-filtering
 
diff --git 
a/versioned_docs/version-3.0/data-operate/import/load-data-convert.md 
b/versioned_docs/version-3.0/data-operate/import/load-data-convert.md
index d7b29d696ab..386b2ad5b6a 100644
--- a/versioned_docs/version-3.0/data-operate/import/load-data-convert.md
+++ b/versioned_docs/version-3.0/data-operate/import/load-data-convert.md
@@ -619,7 +619,18 @@ Scenarios where filtering is needed before column mapping 
and transformation, al
 
 For example, source data contains multiple tables' data (or multiple tables' 
data is written to the same Kafka message queue). Each row of data has a column 
indicating which table the data belongs to. Users can use pre-filtering 
conditions to filter out the corresponding table data for loading.
 
-### Example
+Pre-filtering has the following limitations:
+- Column filtering restrictions
+
+Pre-filtering can only filter independent simple columns in the column list 
and cannot filter columns with expressions. For example: when the column 
mapping is (a, tmp, b = tmp + 1), column b cannot be used as a filter condition.
+
+- Data processing restrictions
+
+Pre-filtering occurs before data transformation, using raw data values for 
comparison, and raw data is treated as string type. For example: for data like 
`\N`, it will be compared directly as the `\N` string, rather than being 
converted to NULL before comparison.
+
+### Example 1: Using Numeric Conditions for Pre-filtering
+
+This example demonstrates how to filter source data using simple numeric 
comparison conditions. By setting the filter condition k1 > 1, we can filter 
out unwanted records before data transformation.
 
 Suppose we have the following source data (column names are for illustration 
purposes only, and there is no actual header):
 ```plain
@@ -683,6 +694,64 @@ mysql> select * from example_table;
 +------+------+-----------+------+
 ```
 
+### Example 2: Using Intermediate Columns to Filter Invalid Data
+
+This example demonstrates how to handle data import scenarios containing 
invalid data.
+
+Source data:
+```plain text
+1,1
+2,abc
+3,3
+```
+
+#### Table Creation
+```sql
+CREATE TABLE example_table
+(
+    k1 INT,
+    k2 INT NOT NULL
+)
+ENGINE = OLAP
+DUPLICATE KEY(k1)
+DISTRIBUTED BY HASH(k1) BUCKETS 1;
+```
+
+For column k2, which is of type INT, `abc` is invalid dirty data. To filter 
this data, we can introduce an intermediate column for filtering.
+
+#### Load Statements
+- Broker Load
+```sql
+LOAD LABEL example_db.label1
+(
+    DATA INFILE("s3://bucket_name/data.csv")
+    INTO TABLE example_table
+    COLUMNS TERMINATED BY ","
+    (k1, tmp, k2 = tmp)
+    PRECEDING FILTER tmp != "abc"
+)
+WITH s3 (...);
+```
+
+- Routine Load
+```sql
+CREATE ROUTINE LOAD example_db.example_routine_load ON example_table
+COLUMNS(k1, tmp, k2 = tmp),
+COLUMNS TERMINATED BY ","
+PRECEDING FILTER tmp != "abc"
+FROM KAFKA (...);
+```
+
+#### Load Results
+```sql
+mysql> select * from example_table;
++------+----+
+| k1   | k2 |
++------+----+
+|    1 |  1 |
+|    3 |  3 |
++------+----+
+```
 
 ## Post-filtering
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

(doris-website) branch master updated: [doc](load) optimize Pre-filtering doc (#2214)

Reply via email to