This is an automated email from the ASF dual-hosted git repository. liaoxin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new b91551d37b4 [doc](load) optimize transform data (#1762) b91551d37b4 is described below commit b91551d37b41d50d305053acad3627d0a181d163 Author: Xin Liao <liao...@selectdb.com> AuthorDate: Sun Jan 12 22:23:22 2025 +0800 [doc](load) optimize transform data (#1762) --- docs/data-operate/import/load-data-convert.md | 433 +++++++++++---------- .../data-operate/import/load-data-convert.md | 285 ++++++++------ .../data-operate/import/load-data-convert.md | 285 ++++++++------ .../data-operate/import/load-data-convert.md | 285 ++++++++------ .../data-operate/import/load-data-convert.md | 433 +++++++++++---------- .../data-operate/import/load-data-convert.md | 433 +++++++++++---------- 6 files changed, 1176 insertions(+), 978 deletions(-) diff --git a/docs/data-operate/import/load-data-convert.md b/docs/data-operate/import/load-data-convert.md index dee8d02b3e9..d7b29d696ab 100644 --- a/docs/data-operate/import/load-data-convert.md +++ b/docs/data-operate/import/load-data-convert.md @@ -24,7 +24,7 @@ specific language governing permissions and limitations under the License. --> -Doris provides powerful data transformation capabilities during data loading, allowing you to process data without additional ETL tools. It mainly supports four types of transformations: +Doris provides powerful data transformation capabilities during data loading, which can simplify data processing workflows and reduce dependency on additional ETL tools. It mainly supports four types of transformations: - **Column Mapping**: Map source data columns to different columns in the target table. @@ -34,40 +34,42 @@ Doris provides powerful data transformation capabilities during data loading, al - **Post-filtering**: Filter the final results after column mapping and transformation. -Through these built-in data transformation functions, you can significantly simplify the data processing workflow, improve loading efficiency, and ensure consistency in data processing logic. +Through these built-in data transformation functions, you can improve loading efficiency and ensure consistency in data processing logic. ## Load Syntax -**Stream Load** +### Stream Load -Add `columns` and `where` parameters in the HTTP header. +Configure data transformation by setting the following parameters in HTTP headers: -- `columns` specifies column mapping and transformation. +| Parameter | Description | +|-----------|-------------| +| `columns` | Specify column mapping and transformation | +| `where` | Specify post-filtering | -- `where` specifies post-filtering. - -- Pre-filtering is not supported. +> **Note**: Stream Load does not support pre-filtering. +Example: ```shell curl --location-trusted -u user:passwd \ --H "columns: k1, k2, tmp_k3, k3 = tmp_k3 + 1" \ --H "where: k1 > 1"\ --T data.csv \ -http://host:port/api/example_db/example_table/_stream_load + -H "columns: k1, k2, tmp_k3, k3 = tmp_k3 + 1" \ + -H "where: k1 > 1" \ + -T data.csv \ + http://<fe_ip>:<fe_http_port>/api/example_db/example_table/_stream_load ``` -**Broker Load** - -Define data transformations in SQL statements, where: - -- `(k1, k2, tmp_k3)` specifies column mapping. - -- `SET` specifies column transformation. +### Broker Load -- `PRECEDING FILTER` specifies pre-filtering. +Implement data transformation in SQL statements using the following clauses: -- `WHERE` specifies post-filtering. +| Clause | Description | +|--------|-------------| +| `column list` | Specify column mapping, format: `(k1, k2, tmp_k3)` | +| `SET` | Specify column transformation | +| `PRECEDING FILTER` | Specify pre-filtering | +| `WHERE` | Specify post-filtering | +Example: ```sql LOAD LABEL test_db.label1 ( @@ -83,57 +85,55 @@ LOAD LABEL test_db.label1 WITH S3 (...); ``` -**Routine Load** +### Routine Load -Define data transformations in SQL statements, where: +Implement data transformation in SQL statements using the following clauses: -- `COLUMNS` specifies column mapping and transformation. - -- `PRECEDING FILTER` specifies pre-filtering. - -- `WHERE` specifies post-filtering. +| Clause | Description | +|--------|-------------| +| `COLUMNS` | Specify column mapping and transformation | +| `PRECEDING FILTER` | Specify pre-filtering | +| `WHERE` | Specify post-filtering | +Example: ```sql CREATE ROUTINE LOAD test_db.label1 ON test_tbl -COLUMNS(k1, k2, tmp_k3, k3 = tmp_k3 + 1), -PRECEDING FILTER k1 = 1, -WHERE k1 > 1 -... + COLUMNS(k1, k2, tmp_k3, k3 = tmp_k3 + 1), + PRECEDING FILTER k1 = 1, + WHERE k1 > 1 + ... ``` -## Column Mapping +### Insert Into -Column mapping is used to define the correspondence between source data columns and target table columns. It can handle the following scenarios: +Insert Into can directly perform data transformation in the `SELECT` statement, using the `WHERE` clause for data filtering. -- The order of source data columns and target table columns is inconsistent. - -- The number of source data columns and target table columns is inconsistent. +## Column Mapping -### Example +Column mapping is used to define the correspondence between source data columns and target table columns. It can handle the following scenarios: +- The order of source data columns and target table columns is inconsistent +- The number of source data columns and target table columns is inconsistent -Suppose we have the following source data (column names are only for illustration purposes and do not actually exist): +### Adjusting Column Order -```plain text -Column1,Column2,Column3,Column4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -Note: `\N` in the source file represents NULL. - -#### Adjusting Column Order - -Suppose the target table has four columns: k1, k2, k3, and k4. We want to map the columns as follows: - -- Column1 -> k1 -- Column2 -> k3 -- Column3 -> k2 -- Column4 -> k4 - -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to map the columns as follows: +```plain +column1 -> k1 +column2 -> k3 +column3 -> k2 +column4 -> k4 +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -146,10 +146,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -160,7 +158,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label_broker ( @@ -173,7 +170,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k3, k2, k4), @@ -181,8 +177,7 @@ COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+-----------+------+------+ @@ -195,16 +190,27 @@ mysql> select * from example_table; +------+-----------+------+------+ ``` -#### Source File Columns Exceed Table Columns +### Source File Columns Exceed Table Columns -Suppose the target table has three columns: k1, k2, and k3. We only need the first three columns of the source file. The mapping relationship is as follows: +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -- Column1 -> k1 -- Column2 -> k2 -- Column4 -> k3 +The target table has three columns: k1, k2, and k3. We only need the first, second, and fourth columns from the source file, with the following mapping relationship: +```plain +column1 -> k1 +column2 -> k2 +column4 -> k3 +``` -##### Creating the Target Table +To skip certain columns in the source file, you can use any column name that does not exist in the target table during column mapping. These column names can be customized and are not restricted. The data in these columns will be automatically ignored during loading. +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -216,12 +222,10 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql -curl --location-trusted -u user:passwd \ +curl --location-trusted -u usr:passwd \ -H "column_separator:," \ -H "columns: k1,k2,tmp_skip,k3" \ -T data.csv \ @@ -229,7 +233,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label_broker ( @@ -247,7 +250,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, tmp_skip, k3), @@ -259,10 +261,9 @@ PROPERTIES FROM KAFKA (...); ``` -The third column of the source file is named tmp_skip, which is a custom column name that does not exist in the table. It is ignored during loading and not written to the target table. - -##### Query Results +> Note: The `tmp_skip` in the example can be replaced with any name, as long as it is not in the column definition of the target table. +#### Query Results ``` mysql> select * from example_table; +------+------+------+ @@ -275,18 +276,27 @@ mysql> select * from example_table; +------+------+------+ ``` -#### Source File Columns Less Than Table Columns - -Suppose the target table has five columns: k1, k2, k3, k4, and k5. We only use the first three columns of the source file: +### Source File Columns Less Than Table Columns -- Column1 -> k1 -- Column2 -> k3 -- Column3 -> k2 -- Column4 -> k4 -- k5 uses the default value +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -##### Creating the Target Table +The target table has five columns: k1, k2, k3, k4, and k5. We only need the first, second, third, and fourth columns from the source file, with the following mapping relationship: +```plain +column1 -> k1 +column2 -> k3 +column3 -> k2 +column4 -> k4 +k5 uses the default value +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -300,10 +310,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -313,7 +321,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label_broker ( @@ -324,7 +331,7 @@ LOAD LABEL example_db.label_broker SET ( k1 = tmp_k1, k3 = tmp_k3, - k2 = tmp_k2 + k2 = tmp_k2, k4 = tmp_k4 ) ) @@ -332,7 +339,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k3, k2, k4), @@ -341,13 +347,11 @@ FROM KAFKA (...); ``` Note: +- If k5 has a default value, it will be filled with the default value +- If k5 is a nullable column but has no default value, it will be filled with NULL +- If k5 is a non-nullable column and has no default value, the load will fail -- If k5 has a default value, it will be used. -- If it is a nullable column, it will be filled with NULL. -- If it is a non-nullable column without a default value, the load will fail. - -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+-----------+------+------+------+ @@ -358,35 +362,32 @@ mysql> select * from example_table; | 3 | guangzhou | 300 | 1.3 | 2 | | 4 | chongqing | NULL | 1.4 | 2 | +------+-----------+------+------+------+ -``` +``` ## Column Transformation -Column transformation allows you to transform source data column values. It supports most built-in functions. Column transformation is usually defined together with column mapping. - -### Example +Column transformation allows users to transform column values in the source file, supporting the use of most built-in functions. Column transformation is usually defined together with column mapping, i.e., first map the columns and then transform them. -Suppose we have the following source data (column names are only for illustration purposes and do not actually exist): +### Transforming Source File Column Values Before Loading -```plain text -Column1,Column2,Column3,Column4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -#### Transforming Source Data Column Values - -Suppose the target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: - -- Column1 -> k1 -- Column2 * 100 -> k3 -- Column3 -> k2 -- Column4 -> k4 - -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +```plain +column1 -> k1 +column2 * 100 -> k3 +column3 -> k2 +column4 -> k4 +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -400,10 +401,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -413,7 +412,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -429,7 +427,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, tmp_k3, k2, k4, k3 = tmp_k3 * 100), @@ -437,12 +434,11 @@ COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+------+-------+------+ -| k1 | k2 | k3 | k4 | +| k1 | k2 | k3 | k4 | +------+------+-------+------+ | 1 | beijing | 10000 | 1.1 | | 2 | shanghai | 20000 | 1.2 | @@ -451,17 +447,26 @@ mysql> select * from example_table; +------+------+-------+------+ ``` -#### Using Case When Function for Conditional Transformation - -Suppose the target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +### Using Case When Function for Conditional Column Transformation -- Column1 -> k1 -- Column2 -> k2 -- Column3 -> k3 (transformed to area ID) -- Column4 -> k4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +```plain +column1 -> k1 +column2 -> k2 +column3 -> k3 (transformed to area id) +column4 -> k4 +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -475,10 +480,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -488,7 +491,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -504,16 +506,13 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table -COLUMNS(k1, k2, tmp_k3, k4, k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN NULL END), +COLUMNS(k1, k2, tmp_k3, k4, k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN 4 ELSE NULL END), COLUMNS TERMINATED BY "," FROM KAFKA (...); -``` - -##### Query Results - +``` +#### Query Results ``` mysql> select * from example_table; +------+------+------+------+ @@ -526,17 +525,26 @@ mysql> select * from example_table; +------+------+------+------+ ``` -#### Handling NULL Values in Source Files +### Handling NULL Values in Source Files -Suppose the target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: - -- Column1 -> k1 -- Column2 (if NULL, transform to 0) -> k2 -- Column3 -> k3 -- Column4 -> k4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +```plain +column1 -> k1 (transform NULL to 0) +column2 -> k2 +column3 -> k3 +column4 -> k4 +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -550,10 +558,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -563,7 +569,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -580,16 +585,13 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, tmp_k2, tmp_k3, k4, k2 = ifnull(tmp_k2, 0), k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN 4 ELSE NULL END), COLUMNS TERMINATED BY "," FROM KAFKA (...); -``` - -##### Query Results - +``` +#### Query Results ``` mysql> select * from example_table; +------+------+------+------+ @@ -605,14 +607,23 @@ mysql> select * from example_table; ## Pre-filtering -Pre-filtering is used to filter out unwanted raw data before column mapping and transformation. It is only supported in Broker Load and Routine Load. +Pre-filtering is the process of filtering out unwanted raw data before column mapping and transformation. This feature is only supported in Broker Load and Routine Load. -### Example +Pre-filtering has the following application scenarios: -Suppose we have the following source data (column names are only for illustration purposes and do not actually exist): +- Filtering before transformation + +Scenarios where filtering is needed before column mapping and transformation, allowing for the removal of unwanted data before processing. + +- Filtering columns that do not exist in the table, only as filtering indicators + +For example, source data contains multiple tables' data (or multiple tables' data is written to the same Kafka message queue). Each row of data has a column indicating which table the data belongs to. Users can use pre-filtering conditions to filter out the corresponding table data for loading. + +### Example -```plain text -Column1,Column2,Column3,Column4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 @@ -620,13 +631,11 @@ Column1,Column2,Column3,Column4 ``` Pre-filtering condition: - ``` -Column1 > 1 +column1 > 1, i.e., only load data where column1 > 1, and filter out other data. ``` #### Creating the Target Table - ```sql CREATE TABLE example_table ( @@ -639,11 +648,8 @@ ENGINE = OLAP DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` - #### Loading Data - - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -657,17 +663,15 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, k3, k4), COLUMNS TERMINATED BY "," PRECEDING FILTER k1 > 1 -FROM KAFKA (...); +FROM KAFKA (...) ``` #### Query Results - ``` mysql> select * from example_table; +------+------+-----------+------+ @@ -682,26 +686,23 @@ mysql> select * from example_table; ## Post-filtering -Post-filtering is used to filter the final results after column mapping and transformation. +Post-filtering is the process of filtering the final results after column mapping and transformation. -### Example -Suppose we have the following source data (column names are only for illustration purposes and do not actually exist): +### Filtering Without Column Mapping and Transformation -```plain text -Column1,Column2,Column3,Column4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -#### Filtering Without Column Mapping and Transformation - -Suppose the target table has four columns: k1, k2, k3, and k4. We want to filter out rows where Column4 is greater than 1.2. - -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to load only the data where the fourth column is greater than 1.2. +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -715,10 +716,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -729,7 +728,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -743,17 +741,15 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, k3, k4), COLUMNS TERMINATED BY "," WHERE k4 > 1.2; -FROM KAFKA (...); +FROM KAFKA (...) ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+------+-----------+------+ @@ -764,12 +760,28 @@ mysql> select * from example_table; +------+------+-----------+------+ ``` -#### Filtering After Column Transformation +### Filtering Transformed Data -Suppose the target table has four columns: k1, k2, k3, and k4. We want to transform Column3 to area ID and filter out rows where the area ID is 3. +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` + +The target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +```plain +column1 -> k1 +column2 -> k2 +column3 -> k3 (transformed to area id) +column4 -> k4 +``` -##### Creating the Target Table +We want to filter out the data where the transformed k3 value is 3. +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -783,10 +795,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -797,7 +807,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -810,11 +819,10 @@ LOAD LABEL example_db.label1 ) WHERE k3 != 3 ) -WITH s3 (...); +WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, tmp_k3, k4), @@ -823,11 +831,10 @@ SET ( k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN 4 ELSE NULL END ) WHERE k3 != 3; -FROM KAFKA (...); +FROM KAFKA (...) ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+------+------+------+ @@ -839,12 +846,20 @@ mysql> select * from example_table; +------+------+------+------+ ``` -#### Multiple Conditions +### Multiple Conditions Filtering -Suppose the target table has four columns: k1, k2, k3, and k4. We want to filter out rows where k1 is NULL and k4 is less than 1.2. +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to filter out the data where k1 is NULL and k4 is less than 1.2. +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -858,10 +873,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -872,7 +885,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -886,7 +898,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, k3, k4), @@ -895,8 +906,7 @@ WHERE k1 is not null and k4 > 1.2 FROM KAFKA (...); ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+------+-----------+------+ @@ -904,4 +914,5 @@ mysql> select * from example_table; +------+------+-----------+------+ | 3 | 300 | guangzhou | 1.3 | | 4 | NULL | chongqing | 1.4 | -+------+------+-----------+------+ \ No newline at end of file ++------+------+-----------+------+ +``` \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-data-convert.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-data-convert.md index 6f82f60d6e3..37051936905 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-data-convert.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-data-convert.md @@ -5,7 +5,7 @@ } --- -<!-- +<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information @@ -24,7 +24,7 @@ specific language governing permissions and limitations under the License. --> -Doris 在数据导入时提供了强大的数据转换能力,让您无需额外的 ETL 工具就能完成数据处理。主要支持以下四种转换方式: +Doris 在数据导入时提供了强大的数据转换能力,可以简化部分数据处理流程,减少对额外 ETL 工具的依赖。主要支持以下四种转换方式: - **列映射**:将源数据列映射到目标表的不同列。 @@ -34,40 +34,42 @@ Doris 在数据导入时提供了强大的数据转换能力,让您无需额 - **后置过滤**:在列映射和列变换后数据对最终结果进行过滤。 -通过这些内置的数据转换功能,可以显著简化数据处理流程,提高导入效率,并确保数据处理逻辑的一致性。 +通过这些内置的数据转换功能,可以提高导入效率,并确保数据处理逻辑的一致性。 ## 导入语法 -**Stream Load** +### Stream Load -在 HTTP header 中增加 `columns` 和 `where` 参数。 +通过在 HTTP header 中设置以下参数实现数据转换: -- `columns` 指定列映射和列变换。 +| 参数 | 说明 | +|------|------| +| `columns` | 指定列映射和列变换 | +| `where` | 指定后置过滤 | -- `where` 指定后置过滤。 - -- 不支持前置过滤。 +> **注意**: Stream Load 不支持前置过滤。 +示例: ```shell curl --location-trusted -u user:passwd \ --H "columns: k1, k2, tmp_k3, k3 = tmp_k3 + 1" \ --H "where: k1 > 1"\ --T data.csv \ -http://host:port/api/example_db/example_table/_stream_load + -H "columns: k1, k2, tmp_k3, k3 = tmp_k3 + 1" \ + -H "where: k1 > 1" \ + -T data.csv \ + http://<fe_ip>:<fe_http_port>/api/example_db/example_table/_stream_load ``` -**Broker Load** - -在 SQL 语句中定义数据变换,其中: +### Broker Load -- `(k1, k2, tmp_k3)` 指定列映射。 +在 SQL 语句中通过以下子句实现数据转换: -- `SET` 指定列变换。 - -- `PRECEDING FILTER` 指定前置过滤。 - -- `WHERE` 指定后置过滤。 +| 子句 | 说明 | +|------|------| +| `column list` | 指定列映射,格式为 `(k1, k2, tmp_k3)` | +| `SET` | 指定列变换 | +| `PRECEDING FILTER` | 指定前置过滤 | +| `WHERE` | 指定后置过滤 | +示例: ```sql LOAD LABEL test_db.label1 ( @@ -83,27 +85,28 @@ LOAD LABEL test_db.label1 WITH S3 (...); ``` -**Routine Load** - -在 SQL 语句中定义数据变换,其中: - -- `COLUMNS` 指定列映射和列变换。 +### Routine Load -- `PRECEDING FILTER` 指定前置过滤。 +在 SQL 语句中通过以下子句实现数据转换: -- `WHERE` 指定后置过滤。 +| 子句 | 说明 | +|------|------| +| `COLUMNS` | 指定列映射和列变换 | +| `PRECEDING FILTER` | 指定前置过滤 | +| `WHERE` | 指定后置过滤 | +示例: ```sql CREATE ROUTINE LOAD test_db.label1 ON test_tbl -COLUMNS(k1, k2, tmp_k3, k3 = tmp_k3 + 1), -PRECEDING FILTER k1 = 1, -WHERE k1 > 1 -... + COLUMNS(k1, k2, tmp_k3, k3 = tmp_k3 + 1), + PRECEDING FILTER k1 = 1, + WHERE k1 > 1 + ... ``` -**Insert Into** +### Insert Into -Insert Into 可以直接在 `SELECT` 语句中完成数据变换,增加 `WHERE` 子句完成数据过滤。 +Insert Into 可以直接在 `SELECT` 语句中完成数据转换,使用 `WHERE` 子句实现数据过滤。 ## 列映射 @@ -111,24 +114,23 @@ Insert Into 可以直接在 `SELECT` 语句中完成数据变换,增加 `WHERE - 源数据与目标表的列顺序不一致 - 源数据与目标表的列数量不匹配 -### 示例 - +### 调整列顺序 假设有以下源数据(表头列名仅为方便表述,实际并无表头): -```plain text +```plain 列1,列2,列3,列4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -注:`\N` 在源文件中表示 NULL。 -#### 调整列顺序 -假设目标表有 k1, k2, k3, k4 四列,要实现如下映射: -- 列1 -> k1 -- 列2 -> k3 -- 列3 -> k2 -- 列4 -> k4 +目标表有 k1, k2, k3, k4 四列,要实现如下映射: +```plain +列1 -> k1 +列2 -> k3 +列3 -> k2 +列4 -> k4 +``` ##### 创建目标表 ```sql @@ -173,6 +175,7 @@ COLUMNS(k1, k3, k2, k4), COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` + ##### 查询结果 ``` mysql> select * from example_table; @@ -185,13 +188,25 @@ mysql> select * from example_table; | 1 | beijing | 100 | 1.1 | +------+-----------+------+------+ ``` -#### 源文件列数量多于表列数 -假设目标表有 k1, k2, k3 三列,而源文件包含四列数据。我们只需要源文件的第1、第2、第4列,映射关系如下: -- 列1 -> k1 -- 列2 -> k2 -- 列4 -> k3 -##### 创建示例表 +### 源文件列数量多于表列数 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +目标表有 k1, k2, k3 三列,而源文件包含四列数据。我们只需要源文件的第1、第2、第4列,映射关系如下: +```plain +列1 -> k1 +列2 -> k2 +列4 -> k3 +``` +要跳过源文件中的某些列,只需在列映射时使用任意不存在于目标表的列名。这些列名可以自定义,不受限制,导入时会自动忽略这些列的数据。 + +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -203,7 +218,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u usr:passwd \ @@ -242,9 +257,9 @@ PROPERTIES FROM KAFKA (...); ``` -源文件的第三列被命名为 tmp_skip,tmp_skip 为一个自定义的、表中不存在的列名,在导入过程中被忽略,不会写入目标表。 +> 注意:示例中的 `tmp_skip` 可以替换为任意名称,只要这些名称不在目标表的列定义中即可。 -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+------+ @@ -257,15 +272,25 @@ mysql> select * from example_table; +------+------+------+ ``` -#### 源文件列数量少于表列数 -假设目标表有 k1, k2, k3, k4, k5 五列,只使用源文件的前三列: -- 列1 -> k1 -- 列2 -> k3 -- 列3 -> k2 -- 列4 -> k4 -- k5 使用默认值 +### 源文件列数量少于表列数 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +目标表有 k1, k2, k3, k4, k5 五列,而源文件包含四列数据。我们只需要源文件的第1、第2、第3、第4列,映射关系如下: +```plain +列1 -> k1 +列2 -> k3 +列3 -> k2 +列4 -> k4 +k5 使用默认值 +``` -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -279,7 +304,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -300,7 +325,7 @@ LOAD LABEL example_db.label_broker SET ( k1 = tmp_k1, k3 = tmp_k3, - k2 = tmp_k2 + k2 = tmp_k2, k4 = tmp_k4 ) ) @@ -317,8 +342,8 @@ FROM KAFKA (...); 说明: - 如果 k5 列有默认值,将使用默认值填充 -- 如果是可空列(nullable),将填充 NULL 值 -- 如果是非空列且没有默认值,导入会失败 +- 如果 k5 列是可空列(nullable)但没有默认值,将填充 NULL 值 +- 如果 k5 列是非空列且没有默认值,导入会失败 ##### 查询结果 ``` @@ -335,28 +360,26 @@ mysql> select * from example_table; ## 列变换 -列转换功能允许用户对源文件中列值进行变换,支持使用绝大部分内置函数。列变换操作通常是和列映射一起定义的,即先对列进行映射,再进行转换。 +列变换功能允许用户对源文件中列值进行变换,支持使用绝大部分内置函数。列变换操作通常是和列映射一起定义的,即先对列进行映射,再进行变换。 -### 示例 +### 将源文件中的列值经变换后导入表中 假设有以下源数据(表头列名仅为方便表述,实际并无表头): -```plain text +```plain 列1,列2,列3,列4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` - -#### 将源文件中的列值经转换后导入表中 - -假设表中有 k1,k2,k3,k4 4 列。我们希望的导入映射和转换关系如下: - +表中有 k1,k2,k3,k4 4 列,导入映射和变换关系如下: +```plain 列1 -> k1 列2 * 100 -> k3 列3 -> k2 列4 -> k4 +``` -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -370,7 +393,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -403,7 +426,7 @@ COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+-----------+-------+------+ @@ -416,15 +439,24 @@ mysql> select * from example_table; +------+-----------+-------+------+ ``` -#### 通过 case when 函数,有条件的进行列转换 -假设表中有 k1,k2,k3,k4 4 列。我们希望对于源数据中的 beijing, shanghai, guangzhou, chongqing 分别转换为对应的地区 id 后导入: - +### 通过 case when 函数,有条件的进行列变换 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +表中有 k1,k2,k3,k4 4 列。对于源数据中 beijing, shanghai, guangzhou, chongqing 分别转换为对应的地区 id 后导入: +```plain 列1 -> k1 列2 -> k2 列3 进行地区id转换后 -> k3 列4 -> k4 +``` -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -438,7 +470,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -466,33 +498,41 @@ WITH s3 (...); - Routine Load ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table -COLUMNS(k1, k2, tmp_k3, k4, k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN NULL END), +COLUMNS(k1, k2, tmp_k3, k4, k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN 4 ELSE NULL END), COLUMNS TERMINATED BY "," FROM KAFKA (...); -``` -##### 查询结果 +``` +#### 查询结果 ``` mysql> select * from example_table; +------+------+------+------+ | k1 | k2 | k3 | k4 | +------+------+------+------+ -| 1 | 100 | 1 | 1.1 | -| 2 | 200 | 2 | 1.2 | -| 3 | 300 | 3 | 1.3 | +| 1 | 100 | 1 | 1.1 | +| 2 | 200 | 2 | 1.2 | +| 3 | 300 | 3 | 1.3 | | 4 | NULL | 4 | 1.4 | +------+------+------+------+ ``` -#### 源文件中的 NULL 值处理 - -假设表中有 k1,k2,k3,k4 4 列。在对地区 id 转换的同时,我们也希望对于源数据中 k1 列的 null 值转换成 0 导入: - +### 源文件中的 NULL 值处理 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +表中有 k1,k2,k3,k4 4 列。在对地区 id 转换的同时,对于源数据中 k1 列的 null 值转换成 0 导入: +``` 列1 -> k1 列2 如果为null 则转换成0 -> k2 列3 -> k3 列4 -> k4 +``` -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -506,7 +546,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -539,7 +579,7 @@ COLUMNS(k1, tmp_k2, tmp_k3, k4, k2 = ifnull(tmp_k2, 0), k3 = CASE tmp_k3 WHEN 'b COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+------+------+ @@ -568,7 +608,7 @@ mysql> select * from example_table; ### 示例 假设有以下源数据(表头列名仅为方便表述,实际并无表头): -```plain text +```plain 列1,列2,列3,列4 1,100,beijing,1.1 2,200,shanghai,1.2 @@ -634,19 +674,18 @@ mysql> select * from example_table; 后置过滤在数据转换后执行,可以根据转换后的结果进行过滤。 -### 示例 +### 在列映射和转换缺省的情况下,直接过滤 假设有以下源数据(表头列名仅为方便表述,实际并无表头): -```plain text +```plain 列1,列2,列3,列4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -#### 在列映射和转换缺省的情况下,直接过滤 -假设表中有 k1,k2,k3,k4 4 列,在缺省列映射和转换的情况下,直接定义过滤条件。如我们希望只导入源文件中第 4 列为大于 1.2 的数据行。 +表中有 k1,k2,k3,k4 4 列,在缺省列映射和转换的情况下,只导入源文件中第 4 列为大于 1.2 的数据行。 -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -660,7 +699,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -693,7 +732,7 @@ WHERE k4 > 1.2; FROM KAFKA (...) ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+-----------+------+ @@ -704,10 +743,18 @@ mysql> select * from example_table; +------+------+-----------+------+ ``` -#### 对经过列转换的数据进行过滤 -假设表中有 k1,k2,k3,k4 4 列。在列转换示例中,我们将省份名称转换成了 id。这里我们想过滤掉 id 为 3 的数据。 +### 对经过列变换的数据进行过滤 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +表中有 k1,k2,k3,k4 4 列。在列变换示例中,我们将省份名称转换成了 id。这里我们希望过滤掉 id 为 3 的数据 -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -721,7 +768,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -745,8 +792,8 @@ LOAD LABEL example_db.label1 ) WHERE k3 != 3 ) -WITH s3 (...); -``` +WITH s3 (...); +``` - Routine Load ```sql @@ -760,7 +807,7 @@ WHERE k3 != 3; FROM KAFKA (...) ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+------+------+ @@ -772,10 +819,18 @@ mysql> select * from example_table; +------+------+------+------+ ``` -#### 多条件过滤 -假设表中有 k1,k2,k3,k4 4 列。我们想过滤掉 k1 列为 null 的数据,同时过滤掉 k4 列小于 1.2 的数据 +### 多条件过滤 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +表中有 k1,k2,k3,k4 4 列。过滤掉 k1 列为 null 的数据,同时过滤掉 k4 列小于 1.2 的数据 -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -789,7 +844,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -822,7 +877,7 @@ WHERE k1 is not null and k4 > 1.2 FROM KAFKA (...); ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+-----------+------+ diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/load-data-convert.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/load-data-convert.md index 6f82f60d6e3..37051936905 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/load-data-convert.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/load-data-convert.md @@ -5,7 +5,7 @@ } --- -<!-- +<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information @@ -24,7 +24,7 @@ specific language governing permissions and limitations under the License. --> -Doris 在数据导入时提供了强大的数据转换能力,让您无需额外的 ETL 工具就能完成数据处理。主要支持以下四种转换方式: +Doris 在数据导入时提供了强大的数据转换能力,可以简化部分数据处理流程,减少对额外 ETL 工具的依赖。主要支持以下四种转换方式: - **列映射**:将源数据列映射到目标表的不同列。 @@ -34,40 +34,42 @@ Doris 在数据导入时提供了强大的数据转换能力,让您无需额 - **后置过滤**:在列映射和列变换后数据对最终结果进行过滤。 -通过这些内置的数据转换功能,可以显著简化数据处理流程,提高导入效率,并确保数据处理逻辑的一致性。 +通过这些内置的数据转换功能,可以提高导入效率,并确保数据处理逻辑的一致性。 ## 导入语法 -**Stream Load** +### Stream Load -在 HTTP header 中增加 `columns` 和 `where` 参数。 +通过在 HTTP header 中设置以下参数实现数据转换: -- `columns` 指定列映射和列变换。 +| 参数 | 说明 | +|------|------| +| `columns` | 指定列映射和列变换 | +| `where` | 指定后置过滤 | -- `where` 指定后置过滤。 - -- 不支持前置过滤。 +> **注意**: Stream Load 不支持前置过滤。 +示例: ```shell curl --location-trusted -u user:passwd \ --H "columns: k1, k2, tmp_k3, k3 = tmp_k3 + 1" \ --H "where: k1 > 1"\ --T data.csv \ -http://host:port/api/example_db/example_table/_stream_load + -H "columns: k1, k2, tmp_k3, k3 = tmp_k3 + 1" \ + -H "where: k1 > 1" \ + -T data.csv \ + http://<fe_ip>:<fe_http_port>/api/example_db/example_table/_stream_load ``` -**Broker Load** - -在 SQL 语句中定义数据变换,其中: +### Broker Load -- `(k1, k2, tmp_k3)` 指定列映射。 +在 SQL 语句中通过以下子句实现数据转换: -- `SET` 指定列变换。 - -- `PRECEDING FILTER` 指定前置过滤。 - -- `WHERE` 指定后置过滤。 +| 子句 | 说明 | +|------|------| +| `column list` | 指定列映射,格式为 `(k1, k2, tmp_k3)` | +| `SET` | 指定列变换 | +| `PRECEDING FILTER` | 指定前置过滤 | +| `WHERE` | 指定后置过滤 | +示例: ```sql LOAD LABEL test_db.label1 ( @@ -83,27 +85,28 @@ LOAD LABEL test_db.label1 WITH S3 (...); ``` -**Routine Load** - -在 SQL 语句中定义数据变换,其中: - -- `COLUMNS` 指定列映射和列变换。 +### Routine Load -- `PRECEDING FILTER` 指定前置过滤。 +在 SQL 语句中通过以下子句实现数据转换: -- `WHERE` 指定后置过滤。 +| 子句 | 说明 | +|------|------| +| `COLUMNS` | 指定列映射和列变换 | +| `PRECEDING FILTER` | 指定前置过滤 | +| `WHERE` | 指定后置过滤 | +示例: ```sql CREATE ROUTINE LOAD test_db.label1 ON test_tbl -COLUMNS(k1, k2, tmp_k3, k3 = tmp_k3 + 1), -PRECEDING FILTER k1 = 1, -WHERE k1 > 1 -... + COLUMNS(k1, k2, tmp_k3, k3 = tmp_k3 + 1), + PRECEDING FILTER k1 = 1, + WHERE k1 > 1 + ... ``` -**Insert Into** +### Insert Into -Insert Into 可以直接在 `SELECT` 语句中完成数据变换,增加 `WHERE` 子句完成数据过滤。 +Insert Into 可以直接在 `SELECT` 语句中完成数据转换,使用 `WHERE` 子句实现数据过滤。 ## 列映射 @@ -111,24 +114,23 @@ Insert Into 可以直接在 `SELECT` 语句中完成数据变换,增加 `WHERE - 源数据与目标表的列顺序不一致 - 源数据与目标表的列数量不匹配 -### 示例 - +### 调整列顺序 假设有以下源数据(表头列名仅为方便表述,实际并无表头): -```plain text +```plain 列1,列2,列3,列4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -注:`\N` 在源文件中表示 NULL。 -#### 调整列顺序 -假设目标表有 k1, k2, k3, k4 四列,要实现如下映射: -- 列1 -> k1 -- 列2 -> k3 -- 列3 -> k2 -- 列4 -> k4 +目标表有 k1, k2, k3, k4 四列,要实现如下映射: +```plain +列1 -> k1 +列2 -> k3 +列3 -> k2 +列4 -> k4 +``` ##### 创建目标表 ```sql @@ -173,6 +175,7 @@ COLUMNS(k1, k3, k2, k4), COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` + ##### 查询结果 ``` mysql> select * from example_table; @@ -185,13 +188,25 @@ mysql> select * from example_table; | 1 | beijing | 100 | 1.1 | +------+-----------+------+------+ ``` -#### 源文件列数量多于表列数 -假设目标表有 k1, k2, k3 三列,而源文件包含四列数据。我们只需要源文件的第1、第2、第4列,映射关系如下: -- 列1 -> k1 -- 列2 -> k2 -- 列4 -> k3 -##### 创建示例表 +### 源文件列数量多于表列数 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +目标表有 k1, k2, k3 三列,而源文件包含四列数据。我们只需要源文件的第1、第2、第4列,映射关系如下: +```plain +列1 -> k1 +列2 -> k2 +列4 -> k3 +``` +要跳过源文件中的某些列,只需在列映射时使用任意不存在于目标表的列名。这些列名可以自定义,不受限制,导入时会自动忽略这些列的数据。 + +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -203,7 +218,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u usr:passwd \ @@ -242,9 +257,9 @@ PROPERTIES FROM KAFKA (...); ``` -源文件的第三列被命名为 tmp_skip,tmp_skip 为一个自定义的、表中不存在的列名,在导入过程中被忽略,不会写入目标表。 +> 注意:示例中的 `tmp_skip` 可以替换为任意名称,只要这些名称不在目标表的列定义中即可。 -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+------+ @@ -257,15 +272,25 @@ mysql> select * from example_table; +------+------+------+ ``` -#### 源文件列数量少于表列数 -假设目标表有 k1, k2, k3, k4, k5 五列,只使用源文件的前三列: -- 列1 -> k1 -- 列2 -> k3 -- 列3 -> k2 -- 列4 -> k4 -- k5 使用默认值 +### 源文件列数量少于表列数 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +目标表有 k1, k2, k3, k4, k5 五列,而源文件包含四列数据。我们只需要源文件的第1、第2、第3、第4列,映射关系如下: +```plain +列1 -> k1 +列2 -> k3 +列3 -> k2 +列4 -> k4 +k5 使用默认值 +``` -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -279,7 +304,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -300,7 +325,7 @@ LOAD LABEL example_db.label_broker SET ( k1 = tmp_k1, k3 = tmp_k3, - k2 = tmp_k2 + k2 = tmp_k2, k4 = tmp_k4 ) ) @@ -317,8 +342,8 @@ FROM KAFKA (...); 说明: - 如果 k5 列有默认值,将使用默认值填充 -- 如果是可空列(nullable),将填充 NULL 值 -- 如果是非空列且没有默认值,导入会失败 +- 如果 k5 列是可空列(nullable)但没有默认值,将填充 NULL 值 +- 如果 k5 列是非空列且没有默认值,导入会失败 ##### 查询结果 ``` @@ -335,28 +360,26 @@ mysql> select * from example_table; ## 列变换 -列转换功能允许用户对源文件中列值进行变换,支持使用绝大部分内置函数。列变换操作通常是和列映射一起定义的,即先对列进行映射,再进行转换。 +列变换功能允许用户对源文件中列值进行变换,支持使用绝大部分内置函数。列变换操作通常是和列映射一起定义的,即先对列进行映射,再进行变换。 -### 示例 +### 将源文件中的列值经变换后导入表中 假设有以下源数据(表头列名仅为方便表述,实际并无表头): -```plain text +```plain 列1,列2,列3,列4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` - -#### 将源文件中的列值经转换后导入表中 - -假设表中有 k1,k2,k3,k4 4 列。我们希望的导入映射和转换关系如下: - +表中有 k1,k2,k3,k4 4 列,导入映射和变换关系如下: +```plain 列1 -> k1 列2 * 100 -> k3 列3 -> k2 列4 -> k4 +``` -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -370,7 +393,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -403,7 +426,7 @@ COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+-----------+-------+------+ @@ -416,15 +439,24 @@ mysql> select * from example_table; +------+-----------+-------+------+ ``` -#### 通过 case when 函数,有条件的进行列转换 -假设表中有 k1,k2,k3,k4 4 列。我们希望对于源数据中的 beijing, shanghai, guangzhou, chongqing 分别转换为对应的地区 id 后导入: - +### 通过 case when 函数,有条件的进行列变换 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +表中有 k1,k2,k3,k4 4 列。对于源数据中 beijing, shanghai, guangzhou, chongqing 分别转换为对应的地区 id 后导入: +```plain 列1 -> k1 列2 -> k2 列3 进行地区id转换后 -> k3 列4 -> k4 +``` -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -438,7 +470,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -466,33 +498,41 @@ WITH s3 (...); - Routine Load ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table -COLUMNS(k1, k2, tmp_k3, k4, k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN NULL END), +COLUMNS(k1, k2, tmp_k3, k4, k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN 4 ELSE NULL END), COLUMNS TERMINATED BY "," FROM KAFKA (...); -``` -##### 查询结果 +``` +#### 查询结果 ``` mysql> select * from example_table; +------+------+------+------+ | k1 | k2 | k3 | k4 | +------+------+------+------+ -| 1 | 100 | 1 | 1.1 | -| 2 | 200 | 2 | 1.2 | -| 3 | 300 | 3 | 1.3 | +| 1 | 100 | 1 | 1.1 | +| 2 | 200 | 2 | 1.2 | +| 3 | 300 | 3 | 1.3 | | 4 | NULL | 4 | 1.4 | +------+------+------+------+ ``` -#### 源文件中的 NULL 值处理 - -假设表中有 k1,k2,k3,k4 4 列。在对地区 id 转换的同时,我们也希望对于源数据中 k1 列的 null 值转换成 0 导入: - +### 源文件中的 NULL 值处理 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +表中有 k1,k2,k3,k4 4 列。在对地区 id 转换的同时,对于源数据中 k1 列的 null 值转换成 0 导入: +``` 列1 -> k1 列2 如果为null 则转换成0 -> k2 列3 -> k3 列4 -> k4 +``` -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -506,7 +546,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -539,7 +579,7 @@ COLUMNS(k1, tmp_k2, tmp_k3, k4, k2 = ifnull(tmp_k2, 0), k3 = CASE tmp_k3 WHEN 'b COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+------+------+ @@ -568,7 +608,7 @@ mysql> select * from example_table; ### 示例 假设有以下源数据(表头列名仅为方便表述,实际并无表头): -```plain text +```plain 列1,列2,列3,列4 1,100,beijing,1.1 2,200,shanghai,1.2 @@ -634,19 +674,18 @@ mysql> select * from example_table; 后置过滤在数据转换后执行,可以根据转换后的结果进行过滤。 -### 示例 +### 在列映射和转换缺省的情况下,直接过滤 假设有以下源数据(表头列名仅为方便表述,实际并无表头): -```plain text +```plain 列1,列2,列3,列4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -#### 在列映射和转换缺省的情况下,直接过滤 -假设表中有 k1,k2,k3,k4 4 列,在缺省列映射和转换的情况下,直接定义过滤条件。如我们希望只导入源文件中第 4 列为大于 1.2 的数据行。 +表中有 k1,k2,k3,k4 4 列,在缺省列映射和转换的情况下,只导入源文件中第 4 列为大于 1.2 的数据行。 -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -660,7 +699,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -693,7 +732,7 @@ WHERE k4 > 1.2; FROM KAFKA (...) ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+-----------+------+ @@ -704,10 +743,18 @@ mysql> select * from example_table; +------+------+-----------+------+ ``` -#### 对经过列转换的数据进行过滤 -假设表中有 k1,k2,k3,k4 4 列。在列转换示例中,我们将省份名称转换成了 id。这里我们想过滤掉 id 为 3 的数据。 +### 对经过列变换的数据进行过滤 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +表中有 k1,k2,k3,k4 4 列。在列变换示例中,我们将省份名称转换成了 id。这里我们希望过滤掉 id 为 3 的数据 -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -721,7 +768,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -745,8 +792,8 @@ LOAD LABEL example_db.label1 ) WHERE k3 != 3 ) -WITH s3 (...); -``` +WITH s3 (...); +``` - Routine Load ```sql @@ -760,7 +807,7 @@ WHERE k3 != 3; FROM KAFKA (...) ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+------+------+ @@ -772,10 +819,18 @@ mysql> select * from example_table; +------+------+------+------+ ``` -#### 多条件过滤 -假设表中有 k1,k2,k3,k4 4 列。我们想过滤掉 k1 列为 null 的数据,同时过滤掉 k4 列小于 1.2 的数据 +### 多条件过滤 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +表中有 k1,k2,k3,k4 4 列。过滤掉 k1 列为 null 的数据,同时过滤掉 k4 列小于 1.2 的数据 -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -789,7 +844,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -822,7 +877,7 @@ WHERE k1 is not null and k4 > 1.2 FROM KAFKA (...); ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+-----------+------+ diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/load-data-convert.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/load-data-convert.md index 6f82f60d6e3..37051936905 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/load-data-convert.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/load-data-convert.md @@ -5,7 +5,7 @@ } --- -<!-- +<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information @@ -24,7 +24,7 @@ specific language governing permissions and limitations under the License. --> -Doris 在数据导入时提供了强大的数据转换能力,让您无需额外的 ETL 工具就能完成数据处理。主要支持以下四种转换方式: +Doris 在数据导入时提供了强大的数据转换能力,可以简化部分数据处理流程,减少对额外 ETL 工具的依赖。主要支持以下四种转换方式: - **列映射**:将源数据列映射到目标表的不同列。 @@ -34,40 +34,42 @@ Doris 在数据导入时提供了强大的数据转换能力,让您无需额 - **后置过滤**:在列映射和列变换后数据对最终结果进行过滤。 -通过这些内置的数据转换功能,可以显著简化数据处理流程,提高导入效率,并确保数据处理逻辑的一致性。 +通过这些内置的数据转换功能,可以提高导入效率,并确保数据处理逻辑的一致性。 ## 导入语法 -**Stream Load** +### Stream Load -在 HTTP header 中增加 `columns` 和 `where` 参数。 +通过在 HTTP header 中设置以下参数实现数据转换: -- `columns` 指定列映射和列变换。 +| 参数 | 说明 | +|------|------| +| `columns` | 指定列映射和列变换 | +| `where` | 指定后置过滤 | -- `where` 指定后置过滤。 - -- 不支持前置过滤。 +> **注意**: Stream Load 不支持前置过滤。 +示例: ```shell curl --location-trusted -u user:passwd \ --H "columns: k1, k2, tmp_k3, k3 = tmp_k3 + 1" \ --H "where: k1 > 1"\ --T data.csv \ -http://host:port/api/example_db/example_table/_stream_load + -H "columns: k1, k2, tmp_k3, k3 = tmp_k3 + 1" \ + -H "where: k1 > 1" \ + -T data.csv \ + http://<fe_ip>:<fe_http_port>/api/example_db/example_table/_stream_load ``` -**Broker Load** - -在 SQL 语句中定义数据变换,其中: +### Broker Load -- `(k1, k2, tmp_k3)` 指定列映射。 +在 SQL 语句中通过以下子句实现数据转换: -- `SET` 指定列变换。 - -- `PRECEDING FILTER` 指定前置过滤。 - -- `WHERE` 指定后置过滤。 +| 子句 | 说明 | +|------|------| +| `column list` | 指定列映射,格式为 `(k1, k2, tmp_k3)` | +| `SET` | 指定列变换 | +| `PRECEDING FILTER` | 指定前置过滤 | +| `WHERE` | 指定后置过滤 | +示例: ```sql LOAD LABEL test_db.label1 ( @@ -83,27 +85,28 @@ LOAD LABEL test_db.label1 WITH S3 (...); ``` -**Routine Load** - -在 SQL 语句中定义数据变换,其中: - -- `COLUMNS` 指定列映射和列变换。 +### Routine Load -- `PRECEDING FILTER` 指定前置过滤。 +在 SQL 语句中通过以下子句实现数据转换: -- `WHERE` 指定后置过滤。 +| 子句 | 说明 | +|------|------| +| `COLUMNS` | 指定列映射和列变换 | +| `PRECEDING FILTER` | 指定前置过滤 | +| `WHERE` | 指定后置过滤 | +示例: ```sql CREATE ROUTINE LOAD test_db.label1 ON test_tbl -COLUMNS(k1, k2, tmp_k3, k3 = tmp_k3 + 1), -PRECEDING FILTER k1 = 1, -WHERE k1 > 1 -... + COLUMNS(k1, k2, tmp_k3, k3 = tmp_k3 + 1), + PRECEDING FILTER k1 = 1, + WHERE k1 > 1 + ... ``` -**Insert Into** +### Insert Into -Insert Into 可以直接在 `SELECT` 语句中完成数据变换,增加 `WHERE` 子句完成数据过滤。 +Insert Into 可以直接在 `SELECT` 语句中完成数据转换,使用 `WHERE` 子句实现数据过滤。 ## 列映射 @@ -111,24 +114,23 @@ Insert Into 可以直接在 `SELECT` 语句中完成数据变换,增加 `WHERE - 源数据与目标表的列顺序不一致 - 源数据与目标表的列数量不匹配 -### 示例 - +### 调整列顺序 假设有以下源数据(表头列名仅为方便表述,实际并无表头): -```plain text +```plain 列1,列2,列3,列4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -注:`\N` 在源文件中表示 NULL。 -#### 调整列顺序 -假设目标表有 k1, k2, k3, k4 四列,要实现如下映射: -- 列1 -> k1 -- 列2 -> k3 -- 列3 -> k2 -- 列4 -> k4 +目标表有 k1, k2, k3, k4 四列,要实现如下映射: +```plain +列1 -> k1 +列2 -> k3 +列3 -> k2 +列4 -> k4 +``` ##### 创建目标表 ```sql @@ -173,6 +175,7 @@ COLUMNS(k1, k3, k2, k4), COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` + ##### 查询结果 ``` mysql> select * from example_table; @@ -185,13 +188,25 @@ mysql> select * from example_table; | 1 | beijing | 100 | 1.1 | +------+-----------+------+------+ ``` -#### 源文件列数量多于表列数 -假设目标表有 k1, k2, k3 三列,而源文件包含四列数据。我们只需要源文件的第1、第2、第4列,映射关系如下: -- 列1 -> k1 -- 列2 -> k2 -- 列4 -> k3 -##### 创建示例表 +### 源文件列数量多于表列数 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +目标表有 k1, k2, k3 三列,而源文件包含四列数据。我们只需要源文件的第1、第2、第4列,映射关系如下: +```plain +列1 -> k1 +列2 -> k2 +列4 -> k3 +``` +要跳过源文件中的某些列,只需在列映射时使用任意不存在于目标表的列名。这些列名可以自定义,不受限制,导入时会自动忽略这些列的数据。 + +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -203,7 +218,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u usr:passwd \ @@ -242,9 +257,9 @@ PROPERTIES FROM KAFKA (...); ``` -源文件的第三列被命名为 tmp_skip,tmp_skip 为一个自定义的、表中不存在的列名,在导入过程中被忽略,不会写入目标表。 +> 注意:示例中的 `tmp_skip` 可以替换为任意名称,只要这些名称不在目标表的列定义中即可。 -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+------+ @@ -257,15 +272,25 @@ mysql> select * from example_table; +------+------+------+ ``` -#### 源文件列数量少于表列数 -假设目标表有 k1, k2, k3, k4, k5 五列,只使用源文件的前三列: -- 列1 -> k1 -- 列2 -> k3 -- 列3 -> k2 -- 列4 -> k4 -- k5 使用默认值 +### 源文件列数量少于表列数 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +目标表有 k1, k2, k3, k4, k5 五列,而源文件包含四列数据。我们只需要源文件的第1、第2、第3、第4列,映射关系如下: +```plain +列1 -> k1 +列2 -> k3 +列3 -> k2 +列4 -> k4 +k5 使用默认值 +``` -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -279,7 +304,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -300,7 +325,7 @@ LOAD LABEL example_db.label_broker SET ( k1 = tmp_k1, k3 = tmp_k3, - k2 = tmp_k2 + k2 = tmp_k2, k4 = tmp_k4 ) ) @@ -317,8 +342,8 @@ FROM KAFKA (...); 说明: - 如果 k5 列有默认值,将使用默认值填充 -- 如果是可空列(nullable),将填充 NULL 值 -- 如果是非空列且没有默认值,导入会失败 +- 如果 k5 列是可空列(nullable)但没有默认值,将填充 NULL 值 +- 如果 k5 列是非空列且没有默认值,导入会失败 ##### 查询结果 ``` @@ -335,28 +360,26 @@ mysql> select * from example_table; ## 列变换 -列转换功能允许用户对源文件中列值进行变换,支持使用绝大部分内置函数。列变换操作通常是和列映射一起定义的,即先对列进行映射,再进行转换。 +列变换功能允许用户对源文件中列值进行变换,支持使用绝大部分内置函数。列变换操作通常是和列映射一起定义的,即先对列进行映射,再进行变换。 -### 示例 +### 将源文件中的列值经变换后导入表中 假设有以下源数据(表头列名仅为方便表述,实际并无表头): -```plain text +```plain 列1,列2,列3,列4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` - -#### 将源文件中的列值经转换后导入表中 - -假设表中有 k1,k2,k3,k4 4 列。我们希望的导入映射和转换关系如下: - +表中有 k1,k2,k3,k4 4 列,导入映射和变换关系如下: +```plain 列1 -> k1 列2 * 100 -> k3 列3 -> k2 列4 -> k4 +``` -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -370,7 +393,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -403,7 +426,7 @@ COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+-----------+-------+------+ @@ -416,15 +439,24 @@ mysql> select * from example_table; +------+-----------+-------+------+ ``` -#### 通过 case when 函数,有条件的进行列转换 -假设表中有 k1,k2,k3,k4 4 列。我们希望对于源数据中的 beijing, shanghai, guangzhou, chongqing 分别转换为对应的地区 id 后导入: - +### 通过 case when 函数,有条件的进行列变换 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +表中有 k1,k2,k3,k4 4 列。对于源数据中 beijing, shanghai, guangzhou, chongqing 分别转换为对应的地区 id 后导入: +```plain 列1 -> k1 列2 -> k2 列3 进行地区id转换后 -> k3 列4 -> k4 +``` -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -438,7 +470,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -466,33 +498,41 @@ WITH s3 (...); - Routine Load ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table -COLUMNS(k1, k2, tmp_k3, k4, k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN NULL END), +COLUMNS(k1, k2, tmp_k3, k4, k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN 4 ELSE NULL END), COLUMNS TERMINATED BY "," FROM KAFKA (...); -``` -##### 查询结果 +``` +#### 查询结果 ``` mysql> select * from example_table; +------+------+------+------+ | k1 | k2 | k3 | k4 | +------+------+------+------+ -| 1 | 100 | 1 | 1.1 | -| 2 | 200 | 2 | 1.2 | -| 3 | 300 | 3 | 1.3 | +| 1 | 100 | 1 | 1.1 | +| 2 | 200 | 2 | 1.2 | +| 3 | 300 | 3 | 1.3 | | 4 | NULL | 4 | 1.4 | +------+------+------+------+ ``` -#### 源文件中的 NULL 值处理 - -假设表中有 k1,k2,k3,k4 4 列。在对地区 id 转换的同时,我们也希望对于源数据中 k1 列的 null 值转换成 0 导入: - +### 源文件中的 NULL 值处理 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +表中有 k1,k2,k3,k4 4 列。在对地区 id 转换的同时,对于源数据中 k1 列的 null 值转换成 0 导入: +``` 列1 -> k1 列2 如果为null 则转换成0 -> k2 列3 -> k3 列4 -> k4 +``` -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -506,7 +546,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -539,7 +579,7 @@ COLUMNS(k1, tmp_k2, tmp_k3, k4, k2 = ifnull(tmp_k2, 0), k3 = CASE tmp_k3 WHEN 'b COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+------+------+ @@ -568,7 +608,7 @@ mysql> select * from example_table; ### 示例 假设有以下源数据(表头列名仅为方便表述,实际并无表头): -```plain text +```plain 列1,列2,列3,列4 1,100,beijing,1.1 2,200,shanghai,1.2 @@ -634,19 +674,18 @@ mysql> select * from example_table; 后置过滤在数据转换后执行,可以根据转换后的结果进行过滤。 -### 示例 +### 在列映射和转换缺省的情况下,直接过滤 假设有以下源数据(表头列名仅为方便表述,实际并无表头): -```plain text +```plain 列1,列2,列3,列4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -#### 在列映射和转换缺省的情况下,直接过滤 -假设表中有 k1,k2,k3,k4 4 列,在缺省列映射和转换的情况下,直接定义过滤条件。如我们希望只导入源文件中第 4 列为大于 1.2 的数据行。 +表中有 k1,k2,k3,k4 4 列,在缺省列映射和转换的情况下,只导入源文件中第 4 列为大于 1.2 的数据行。 -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -660,7 +699,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -693,7 +732,7 @@ WHERE k4 > 1.2; FROM KAFKA (...) ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+-----------+------+ @@ -704,10 +743,18 @@ mysql> select * from example_table; +------+------+-----------+------+ ``` -#### 对经过列转换的数据进行过滤 -假设表中有 k1,k2,k3,k4 4 列。在列转换示例中,我们将省份名称转换成了 id。这里我们想过滤掉 id 为 3 的数据。 +### 对经过列变换的数据进行过滤 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +表中有 k1,k2,k3,k4 4 列。在列变换示例中,我们将省份名称转换成了 id。这里我们希望过滤掉 id 为 3 的数据 -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -721,7 +768,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -745,8 +792,8 @@ LOAD LABEL example_db.label1 ) WHERE k3 != 3 ) -WITH s3 (...); -``` +WITH s3 (...); +``` - Routine Load ```sql @@ -760,7 +807,7 @@ WHERE k3 != 3; FROM KAFKA (...) ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+------+------+ @@ -772,10 +819,18 @@ mysql> select * from example_table; +------+------+------+------+ ``` -#### 多条件过滤 -假设表中有 k1,k2,k3,k4 4 列。我们想过滤掉 k1 列为 null 的数据,同时过滤掉 k4 列小于 1.2 的数据 +### 多条件过滤 +假设有以下源数据(表头列名仅为方便表述,实际并无表头): +```plain +列1,列2,列3,列4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` +表中有 k1,k2,k3,k4 4 列。过滤掉 k1 列为 null 的数据,同时过滤掉 k4 列小于 1.2 的数据 -##### 创建示例表 +#### 创建示例表 ```sql CREATE TABLE example_table ( @@ -789,7 +844,7 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### 导入数据 +#### 导入数据 - Stream Load ```sql curl --location-trusted -u user:passwd \ @@ -822,7 +877,7 @@ WHERE k1 is not null and k4 > 1.2 FROM KAFKA (...); ``` -##### 查询结果 +#### 查询结果 ``` mysql> select * from example_table; +------+------+-----------+------+ diff --git a/versioned_docs/version-2.1/data-operate/import/load-data-convert.md b/versioned_docs/version-2.1/data-operate/import/load-data-convert.md index dee8d02b3e9..d7b29d696ab 100644 --- a/versioned_docs/version-2.1/data-operate/import/load-data-convert.md +++ b/versioned_docs/version-2.1/data-operate/import/load-data-convert.md @@ -24,7 +24,7 @@ specific language governing permissions and limitations under the License. --> -Doris provides powerful data transformation capabilities during data loading, allowing you to process data without additional ETL tools. It mainly supports four types of transformations: +Doris provides powerful data transformation capabilities during data loading, which can simplify data processing workflows and reduce dependency on additional ETL tools. It mainly supports four types of transformations: - **Column Mapping**: Map source data columns to different columns in the target table. @@ -34,40 +34,42 @@ Doris provides powerful data transformation capabilities during data loading, al - **Post-filtering**: Filter the final results after column mapping and transformation. -Through these built-in data transformation functions, you can significantly simplify the data processing workflow, improve loading efficiency, and ensure consistency in data processing logic. +Through these built-in data transformation functions, you can improve loading efficiency and ensure consistency in data processing logic. ## Load Syntax -**Stream Load** +### Stream Load -Add `columns` and `where` parameters in the HTTP header. +Configure data transformation by setting the following parameters in HTTP headers: -- `columns` specifies column mapping and transformation. +| Parameter | Description | +|-----------|-------------| +| `columns` | Specify column mapping and transformation | +| `where` | Specify post-filtering | -- `where` specifies post-filtering. - -- Pre-filtering is not supported. +> **Note**: Stream Load does not support pre-filtering. +Example: ```shell curl --location-trusted -u user:passwd \ --H "columns: k1, k2, tmp_k3, k3 = tmp_k3 + 1" \ --H "where: k1 > 1"\ --T data.csv \ -http://host:port/api/example_db/example_table/_stream_load + -H "columns: k1, k2, tmp_k3, k3 = tmp_k3 + 1" \ + -H "where: k1 > 1" \ + -T data.csv \ + http://<fe_ip>:<fe_http_port>/api/example_db/example_table/_stream_load ``` -**Broker Load** - -Define data transformations in SQL statements, where: - -- `(k1, k2, tmp_k3)` specifies column mapping. - -- `SET` specifies column transformation. +### Broker Load -- `PRECEDING FILTER` specifies pre-filtering. +Implement data transformation in SQL statements using the following clauses: -- `WHERE` specifies post-filtering. +| Clause | Description | +|--------|-------------| +| `column list` | Specify column mapping, format: `(k1, k2, tmp_k3)` | +| `SET` | Specify column transformation | +| `PRECEDING FILTER` | Specify pre-filtering | +| `WHERE` | Specify post-filtering | +Example: ```sql LOAD LABEL test_db.label1 ( @@ -83,57 +85,55 @@ LOAD LABEL test_db.label1 WITH S3 (...); ``` -**Routine Load** +### Routine Load -Define data transformations in SQL statements, where: +Implement data transformation in SQL statements using the following clauses: -- `COLUMNS` specifies column mapping and transformation. - -- `PRECEDING FILTER` specifies pre-filtering. - -- `WHERE` specifies post-filtering. +| Clause | Description | +|--------|-------------| +| `COLUMNS` | Specify column mapping and transformation | +| `PRECEDING FILTER` | Specify pre-filtering | +| `WHERE` | Specify post-filtering | +Example: ```sql CREATE ROUTINE LOAD test_db.label1 ON test_tbl -COLUMNS(k1, k2, tmp_k3, k3 = tmp_k3 + 1), -PRECEDING FILTER k1 = 1, -WHERE k1 > 1 -... + COLUMNS(k1, k2, tmp_k3, k3 = tmp_k3 + 1), + PRECEDING FILTER k1 = 1, + WHERE k1 > 1 + ... ``` -## Column Mapping +### Insert Into -Column mapping is used to define the correspondence between source data columns and target table columns. It can handle the following scenarios: +Insert Into can directly perform data transformation in the `SELECT` statement, using the `WHERE` clause for data filtering. -- The order of source data columns and target table columns is inconsistent. - -- The number of source data columns and target table columns is inconsistent. +## Column Mapping -### Example +Column mapping is used to define the correspondence between source data columns and target table columns. It can handle the following scenarios: +- The order of source data columns and target table columns is inconsistent +- The number of source data columns and target table columns is inconsistent -Suppose we have the following source data (column names are only for illustration purposes and do not actually exist): +### Adjusting Column Order -```plain text -Column1,Column2,Column3,Column4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -Note: `\N` in the source file represents NULL. - -#### Adjusting Column Order - -Suppose the target table has four columns: k1, k2, k3, and k4. We want to map the columns as follows: - -- Column1 -> k1 -- Column2 -> k3 -- Column3 -> k2 -- Column4 -> k4 - -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to map the columns as follows: +```plain +column1 -> k1 +column2 -> k3 +column3 -> k2 +column4 -> k4 +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -146,10 +146,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -160,7 +158,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label_broker ( @@ -173,7 +170,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k3, k2, k4), @@ -181,8 +177,7 @@ COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+-----------+------+------+ @@ -195,16 +190,27 @@ mysql> select * from example_table; +------+-----------+------+------+ ``` -#### Source File Columns Exceed Table Columns +### Source File Columns Exceed Table Columns -Suppose the target table has three columns: k1, k2, and k3. We only need the first three columns of the source file. The mapping relationship is as follows: +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -- Column1 -> k1 -- Column2 -> k2 -- Column4 -> k3 +The target table has three columns: k1, k2, and k3. We only need the first, second, and fourth columns from the source file, with the following mapping relationship: +```plain +column1 -> k1 +column2 -> k2 +column4 -> k3 +``` -##### Creating the Target Table +To skip certain columns in the source file, you can use any column name that does not exist in the target table during column mapping. These column names can be customized and are not restricted. The data in these columns will be automatically ignored during loading. +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -216,12 +222,10 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql -curl --location-trusted -u user:passwd \ +curl --location-trusted -u usr:passwd \ -H "column_separator:," \ -H "columns: k1,k2,tmp_skip,k3" \ -T data.csv \ @@ -229,7 +233,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label_broker ( @@ -247,7 +250,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, tmp_skip, k3), @@ -259,10 +261,9 @@ PROPERTIES FROM KAFKA (...); ``` -The third column of the source file is named tmp_skip, which is a custom column name that does not exist in the table. It is ignored during loading and not written to the target table. - -##### Query Results +> Note: The `tmp_skip` in the example can be replaced with any name, as long as it is not in the column definition of the target table. +#### Query Results ``` mysql> select * from example_table; +------+------+------+ @@ -275,18 +276,27 @@ mysql> select * from example_table; +------+------+------+ ``` -#### Source File Columns Less Than Table Columns - -Suppose the target table has five columns: k1, k2, k3, k4, and k5. We only use the first three columns of the source file: +### Source File Columns Less Than Table Columns -- Column1 -> k1 -- Column2 -> k3 -- Column3 -> k2 -- Column4 -> k4 -- k5 uses the default value +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -##### Creating the Target Table +The target table has five columns: k1, k2, k3, k4, and k5. We only need the first, second, third, and fourth columns from the source file, with the following mapping relationship: +```plain +column1 -> k1 +column2 -> k3 +column3 -> k2 +column4 -> k4 +k5 uses the default value +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -300,10 +310,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -313,7 +321,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label_broker ( @@ -324,7 +331,7 @@ LOAD LABEL example_db.label_broker SET ( k1 = tmp_k1, k3 = tmp_k3, - k2 = tmp_k2 + k2 = tmp_k2, k4 = tmp_k4 ) ) @@ -332,7 +339,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k3, k2, k4), @@ -341,13 +347,11 @@ FROM KAFKA (...); ``` Note: +- If k5 has a default value, it will be filled with the default value +- If k5 is a nullable column but has no default value, it will be filled with NULL +- If k5 is a non-nullable column and has no default value, the load will fail -- If k5 has a default value, it will be used. -- If it is a nullable column, it will be filled with NULL. -- If it is a non-nullable column without a default value, the load will fail. - -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+-----------+------+------+------+ @@ -358,35 +362,32 @@ mysql> select * from example_table; | 3 | guangzhou | 300 | 1.3 | 2 | | 4 | chongqing | NULL | 1.4 | 2 | +------+-----------+------+------+------+ -``` +``` ## Column Transformation -Column transformation allows you to transform source data column values. It supports most built-in functions. Column transformation is usually defined together with column mapping. - -### Example +Column transformation allows users to transform column values in the source file, supporting the use of most built-in functions. Column transformation is usually defined together with column mapping, i.e., first map the columns and then transform them. -Suppose we have the following source data (column names are only for illustration purposes and do not actually exist): +### Transforming Source File Column Values Before Loading -```plain text -Column1,Column2,Column3,Column4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -#### Transforming Source Data Column Values - -Suppose the target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: - -- Column1 -> k1 -- Column2 * 100 -> k3 -- Column3 -> k2 -- Column4 -> k4 - -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +```plain +column1 -> k1 +column2 * 100 -> k3 +column3 -> k2 +column4 -> k4 +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -400,10 +401,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -413,7 +412,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -429,7 +427,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, tmp_k3, k2, k4, k3 = tmp_k3 * 100), @@ -437,12 +434,11 @@ COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+------+-------+------+ -| k1 | k2 | k3 | k4 | +| k1 | k2 | k3 | k4 | +------+------+-------+------+ | 1 | beijing | 10000 | 1.1 | | 2 | shanghai | 20000 | 1.2 | @@ -451,17 +447,26 @@ mysql> select * from example_table; +------+------+-------+------+ ``` -#### Using Case When Function for Conditional Transformation - -Suppose the target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +### Using Case When Function for Conditional Column Transformation -- Column1 -> k1 -- Column2 -> k2 -- Column3 -> k3 (transformed to area ID) -- Column4 -> k4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +```plain +column1 -> k1 +column2 -> k2 +column3 -> k3 (transformed to area id) +column4 -> k4 +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -475,10 +480,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -488,7 +491,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -504,16 +506,13 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table -COLUMNS(k1, k2, tmp_k3, k4, k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN NULL END), +COLUMNS(k1, k2, tmp_k3, k4, k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN 4 ELSE NULL END), COLUMNS TERMINATED BY "," FROM KAFKA (...); -``` - -##### Query Results - +``` +#### Query Results ``` mysql> select * from example_table; +------+------+------+------+ @@ -526,17 +525,26 @@ mysql> select * from example_table; +------+------+------+------+ ``` -#### Handling NULL Values in Source Files +### Handling NULL Values in Source Files -Suppose the target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: - -- Column1 -> k1 -- Column2 (if NULL, transform to 0) -> k2 -- Column3 -> k3 -- Column4 -> k4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +```plain +column1 -> k1 (transform NULL to 0) +column2 -> k2 +column3 -> k3 +column4 -> k4 +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -550,10 +558,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -563,7 +569,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -580,16 +585,13 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, tmp_k2, tmp_k3, k4, k2 = ifnull(tmp_k2, 0), k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN 4 ELSE NULL END), COLUMNS TERMINATED BY "," FROM KAFKA (...); -``` - -##### Query Results - +``` +#### Query Results ``` mysql> select * from example_table; +------+------+------+------+ @@ -605,14 +607,23 @@ mysql> select * from example_table; ## Pre-filtering -Pre-filtering is used to filter out unwanted raw data before column mapping and transformation. It is only supported in Broker Load and Routine Load. +Pre-filtering is the process of filtering out unwanted raw data before column mapping and transformation. This feature is only supported in Broker Load and Routine Load. -### Example +Pre-filtering has the following application scenarios: -Suppose we have the following source data (column names are only for illustration purposes and do not actually exist): +- Filtering before transformation + +Scenarios where filtering is needed before column mapping and transformation, allowing for the removal of unwanted data before processing. + +- Filtering columns that do not exist in the table, only as filtering indicators + +For example, source data contains multiple tables' data (or multiple tables' data is written to the same Kafka message queue). Each row of data has a column indicating which table the data belongs to. Users can use pre-filtering conditions to filter out the corresponding table data for loading. + +### Example -```plain text -Column1,Column2,Column3,Column4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 @@ -620,13 +631,11 @@ Column1,Column2,Column3,Column4 ``` Pre-filtering condition: - ``` -Column1 > 1 +column1 > 1, i.e., only load data where column1 > 1, and filter out other data. ``` #### Creating the Target Table - ```sql CREATE TABLE example_table ( @@ -639,11 +648,8 @@ ENGINE = OLAP DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` - #### Loading Data - - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -657,17 +663,15 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, k3, k4), COLUMNS TERMINATED BY "," PRECEDING FILTER k1 > 1 -FROM KAFKA (...); +FROM KAFKA (...) ``` #### Query Results - ``` mysql> select * from example_table; +------+------+-----------+------+ @@ -682,26 +686,23 @@ mysql> select * from example_table; ## Post-filtering -Post-filtering is used to filter the final results after column mapping and transformation. +Post-filtering is the process of filtering the final results after column mapping and transformation. -### Example -Suppose we have the following source data (column names are only for illustration purposes and do not actually exist): +### Filtering Without Column Mapping and Transformation -```plain text -Column1,Column2,Column3,Column4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -#### Filtering Without Column Mapping and Transformation - -Suppose the target table has four columns: k1, k2, k3, and k4. We want to filter out rows where Column4 is greater than 1.2. - -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to load only the data where the fourth column is greater than 1.2. +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -715,10 +716,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -729,7 +728,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -743,17 +741,15 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, k3, k4), COLUMNS TERMINATED BY "," WHERE k4 > 1.2; -FROM KAFKA (...); +FROM KAFKA (...) ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+------+-----------+------+ @@ -764,12 +760,28 @@ mysql> select * from example_table; +------+------+-----------+------+ ``` -#### Filtering After Column Transformation +### Filtering Transformed Data -Suppose the target table has four columns: k1, k2, k3, and k4. We want to transform Column3 to area ID and filter out rows where the area ID is 3. +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` + +The target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +```plain +column1 -> k1 +column2 -> k2 +column3 -> k3 (transformed to area id) +column4 -> k4 +``` -##### Creating the Target Table +We want to filter out the data where the transformed k3 value is 3. +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -783,10 +795,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -797,7 +807,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -810,11 +819,10 @@ LOAD LABEL example_db.label1 ) WHERE k3 != 3 ) -WITH s3 (...); +WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, tmp_k3, k4), @@ -823,11 +831,10 @@ SET ( k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN 4 ELSE NULL END ) WHERE k3 != 3; -FROM KAFKA (...); +FROM KAFKA (...) ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+------+------+------+ @@ -839,12 +846,20 @@ mysql> select * from example_table; +------+------+------+------+ ``` -#### Multiple Conditions +### Multiple Conditions Filtering -Suppose the target table has four columns: k1, k2, k3, and k4. We want to filter out rows where k1 is NULL and k4 is less than 1.2. +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to filter out the data where k1 is NULL and k4 is less than 1.2. +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -858,10 +873,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -872,7 +885,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -886,7 +898,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, k3, k4), @@ -895,8 +906,7 @@ WHERE k1 is not null and k4 > 1.2 FROM KAFKA (...); ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+------+-----------+------+ @@ -904,4 +914,5 @@ mysql> select * from example_table; +------+------+-----------+------+ | 3 | 300 | guangzhou | 1.3 | | 4 | NULL | chongqing | 1.4 | -+------+------+-----------+------+ \ No newline at end of file ++------+------+-----------+------+ +``` \ No newline at end of file diff --git a/versioned_docs/version-3.0/data-operate/import/load-data-convert.md b/versioned_docs/version-3.0/data-operate/import/load-data-convert.md index dee8d02b3e9..d7b29d696ab 100644 --- a/versioned_docs/version-3.0/data-operate/import/load-data-convert.md +++ b/versioned_docs/version-3.0/data-operate/import/load-data-convert.md @@ -24,7 +24,7 @@ specific language governing permissions and limitations under the License. --> -Doris provides powerful data transformation capabilities during data loading, allowing you to process data without additional ETL tools. It mainly supports four types of transformations: +Doris provides powerful data transformation capabilities during data loading, which can simplify data processing workflows and reduce dependency on additional ETL tools. It mainly supports four types of transformations: - **Column Mapping**: Map source data columns to different columns in the target table. @@ -34,40 +34,42 @@ Doris provides powerful data transformation capabilities during data loading, al - **Post-filtering**: Filter the final results after column mapping and transformation. -Through these built-in data transformation functions, you can significantly simplify the data processing workflow, improve loading efficiency, and ensure consistency in data processing logic. +Through these built-in data transformation functions, you can improve loading efficiency and ensure consistency in data processing logic. ## Load Syntax -**Stream Load** +### Stream Load -Add `columns` and `where` parameters in the HTTP header. +Configure data transformation by setting the following parameters in HTTP headers: -- `columns` specifies column mapping and transformation. +| Parameter | Description | +|-----------|-------------| +| `columns` | Specify column mapping and transformation | +| `where` | Specify post-filtering | -- `where` specifies post-filtering. - -- Pre-filtering is not supported. +> **Note**: Stream Load does not support pre-filtering. +Example: ```shell curl --location-trusted -u user:passwd \ --H "columns: k1, k2, tmp_k3, k3 = tmp_k3 + 1" \ --H "where: k1 > 1"\ --T data.csv \ -http://host:port/api/example_db/example_table/_stream_load + -H "columns: k1, k2, tmp_k3, k3 = tmp_k3 + 1" \ + -H "where: k1 > 1" \ + -T data.csv \ + http://<fe_ip>:<fe_http_port>/api/example_db/example_table/_stream_load ``` -**Broker Load** - -Define data transformations in SQL statements, where: - -- `(k1, k2, tmp_k3)` specifies column mapping. - -- `SET` specifies column transformation. +### Broker Load -- `PRECEDING FILTER` specifies pre-filtering. +Implement data transformation in SQL statements using the following clauses: -- `WHERE` specifies post-filtering. +| Clause | Description | +|--------|-------------| +| `column list` | Specify column mapping, format: `(k1, k2, tmp_k3)` | +| `SET` | Specify column transformation | +| `PRECEDING FILTER` | Specify pre-filtering | +| `WHERE` | Specify post-filtering | +Example: ```sql LOAD LABEL test_db.label1 ( @@ -83,57 +85,55 @@ LOAD LABEL test_db.label1 WITH S3 (...); ``` -**Routine Load** +### Routine Load -Define data transformations in SQL statements, where: +Implement data transformation in SQL statements using the following clauses: -- `COLUMNS` specifies column mapping and transformation. - -- `PRECEDING FILTER` specifies pre-filtering. - -- `WHERE` specifies post-filtering. +| Clause | Description | +|--------|-------------| +| `COLUMNS` | Specify column mapping and transformation | +| `PRECEDING FILTER` | Specify pre-filtering | +| `WHERE` | Specify post-filtering | +Example: ```sql CREATE ROUTINE LOAD test_db.label1 ON test_tbl -COLUMNS(k1, k2, tmp_k3, k3 = tmp_k3 + 1), -PRECEDING FILTER k1 = 1, -WHERE k1 > 1 -... + COLUMNS(k1, k2, tmp_k3, k3 = tmp_k3 + 1), + PRECEDING FILTER k1 = 1, + WHERE k1 > 1 + ... ``` -## Column Mapping +### Insert Into -Column mapping is used to define the correspondence between source data columns and target table columns. It can handle the following scenarios: +Insert Into can directly perform data transformation in the `SELECT` statement, using the `WHERE` clause for data filtering. -- The order of source data columns and target table columns is inconsistent. - -- The number of source data columns and target table columns is inconsistent. +## Column Mapping -### Example +Column mapping is used to define the correspondence between source data columns and target table columns. It can handle the following scenarios: +- The order of source data columns and target table columns is inconsistent +- The number of source data columns and target table columns is inconsistent -Suppose we have the following source data (column names are only for illustration purposes and do not actually exist): +### Adjusting Column Order -```plain text -Column1,Column2,Column3,Column4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -Note: `\N` in the source file represents NULL. - -#### Adjusting Column Order - -Suppose the target table has four columns: k1, k2, k3, and k4. We want to map the columns as follows: - -- Column1 -> k1 -- Column2 -> k3 -- Column3 -> k2 -- Column4 -> k4 - -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to map the columns as follows: +```plain +column1 -> k1 +column2 -> k3 +column3 -> k2 +column4 -> k4 +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -146,10 +146,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -160,7 +158,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label_broker ( @@ -173,7 +170,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k3, k2, k4), @@ -181,8 +177,7 @@ COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+-----------+------+------+ @@ -195,16 +190,27 @@ mysql> select * from example_table; +------+-----------+------+------+ ``` -#### Source File Columns Exceed Table Columns +### Source File Columns Exceed Table Columns -Suppose the target table has three columns: k1, k2, and k3. We only need the first three columns of the source file. The mapping relationship is as follows: +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -- Column1 -> k1 -- Column2 -> k2 -- Column4 -> k3 +The target table has three columns: k1, k2, and k3. We only need the first, second, and fourth columns from the source file, with the following mapping relationship: +```plain +column1 -> k1 +column2 -> k2 +column4 -> k3 +``` -##### Creating the Target Table +To skip certain columns in the source file, you can use any column name that does not exist in the target table during column mapping. These column names can be customized and are not restricted. The data in these columns will be automatically ignored during loading. +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -216,12 +222,10 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql -curl --location-trusted -u user:passwd \ +curl --location-trusted -u usr:passwd \ -H "column_separator:," \ -H "columns: k1,k2,tmp_skip,k3" \ -T data.csv \ @@ -229,7 +233,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label_broker ( @@ -247,7 +250,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, tmp_skip, k3), @@ -259,10 +261,9 @@ PROPERTIES FROM KAFKA (...); ``` -The third column of the source file is named tmp_skip, which is a custom column name that does not exist in the table. It is ignored during loading and not written to the target table. - -##### Query Results +> Note: The `tmp_skip` in the example can be replaced with any name, as long as it is not in the column definition of the target table. +#### Query Results ``` mysql> select * from example_table; +------+------+------+ @@ -275,18 +276,27 @@ mysql> select * from example_table; +------+------+------+ ``` -#### Source File Columns Less Than Table Columns - -Suppose the target table has five columns: k1, k2, k3, k4, and k5. We only use the first three columns of the source file: +### Source File Columns Less Than Table Columns -- Column1 -> k1 -- Column2 -> k3 -- Column3 -> k2 -- Column4 -> k4 -- k5 uses the default value +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -##### Creating the Target Table +The target table has five columns: k1, k2, k3, k4, and k5. We only need the first, second, third, and fourth columns from the source file, with the following mapping relationship: +```plain +column1 -> k1 +column2 -> k3 +column3 -> k2 +column4 -> k4 +k5 uses the default value +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -300,10 +310,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -313,7 +321,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label_broker ( @@ -324,7 +331,7 @@ LOAD LABEL example_db.label_broker SET ( k1 = tmp_k1, k3 = tmp_k3, - k2 = tmp_k2 + k2 = tmp_k2, k4 = tmp_k4 ) ) @@ -332,7 +339,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k3, k2, k4), @@ -341,13 +347,11 @@ FROM KAFKA (...); ``` Note: +- If k5 has a default value, it will be filled with the default value +- If k5 is a nullable column but has no default value, it will be filled with NULL +- If k5 is a non-nullable column and has no default value, the load will fail -- If k5 has a default value, it will be used. -- If it is a nullable column, it will be filled with NULL. -- If it is a non-nullable column without a default value, the load will fail. - -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+-----------+------+------+------+ @@ -358,35 +362,32 @@ mysql> select * from example_table; | 3 | guangzhou | 300 | 1.3 | 2 | | 4 | chongqing | NULL | 1.4 | 2 | +------+-----------+------+------+------+ -``` +``` ## Column Transformation -Column transformation allows you to transform source data column values. It supports most built-in functions. Column transformation is usually defined together with column mapping. - -### Example +Column transformation allows users to transform column values in the source file, supporting the use of most built-in functions. Column transformation is usually defined together with column mapping, i.e., first map the columns and then transform them. -Suppose we have the following source data (column names are only for illustration purposes and do not actually exist): +### Transforming Source File Column Values Before Loading -```plain text -Column1,Column2,Column3,Column4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -#### Transforming Source Data Column Values - -Suppose the target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: - -- Column1 -> k1 -- Column2 * 100 -> k3 -- Column3 -> k2 -- Column4 -> k4 - -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +```plain +column1 -> k1 +column2 * 100 -> k3 +column3 -> k2 +column4 -> k4 +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -400,10 +401,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -413,7 +412,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -429,7 +427,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, tmp_k3, k2, k4, k3 = tmp_k3 * 100), @@ -437,12 +434,11 @@ COLUMNS TERMINATED BY "," FROM KAFKA (...); ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+------+-------+------+ -| k1 | k2 | k3 | k4 | +| k1 | k2 | k3 | k4 | +------+------+-------+------+ | 1 | beijing | 10000 | 1.1 | | 2 | shanghai | 20000 | 1.2 | @@ -451,17 +447,26 @@ mysql> select * from example_table; +------+------+-------+------+ ``` -#### Using Case When Function for Conditional Transformation - -Suppose the target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +### Using Case When Function for Conditional Column Transformation -- Column1 -> k1 -- Column2 -> k2 -- Column3 -> k3 (transformed to area ID) -- Column4 -> k4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +```plain +column1 -> k1 +column2 -> k2 +column3 -> k3 (transformed to area id) +column4 -> k4 +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -475,10 +480,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -488,7 +491,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -504,16 +506,13 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table -COLUMNS(k1, k2, tmp_k3, k4, k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN NULL END), +COLUMNS(k1, k2, tmp_k3, k4, k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN 4 ELSE NULL END), COLUMNS TERMINATED BY "," FROM KAFKA (...); -``` - -##### Query Results - +``` +#### Query Results ``` mysql> select * from example_table; +------+------+------+------+ @@ -526,17 +525,26 @@ mysql> select * from example_table; +------+------+------+------+ ``` -#### Handling NULL Values in Source Files +### Handling NULL Values in Source Files -Suppose the target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: - -- Column1 -> k1 -- Column2 (if NULL, transform to 0) -> k2 -- Column3 -> k3 -- Column4 -> k4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +```plain +column1 -> k1 (transform NULL to 0) +column2 -> k2 +column3 -> k3 +column4 -> k4 +``` +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -550,10 +558,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -563,7 +569,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -580,16 +585,13 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, tmp_k2, tmp_k3, k4, k2 = ifnull(tmp_k2, 0), k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN 4 ELSE NULL END), COLUMNS TERMINATED BY "," FROM KAFKA (...); -``` - -##### Query Results - +``` +#### Query Results ``` mysql> select * from example_table; +------+------+------+------+ @@ -605,14 +607,23 @@ mysql> select * from example_table; ## Pre-filtering -Pre-filtering is used to filter out unwanted raw data before column mapping and transformation. It is only supported in Broker Load and Routine Load. +Pre-filtering is the process of filtering out unwanted raw data before column mapping and transformation. This feature is only supported in Broker Load and Routine Load. -### Example +Pre-filtering has the following application scenarios: -Suppose we have the following source data (column names are only for illustration purposes and do not actually exist): +- Filtering before transformation + +Scenarios where filtering is needed before column mapping and transformation, allowing for the removal of unwanted data before processing. + +- Filtering columns that do not exist in the table, only as filtering indicators + +For example, source data contains multiple tables' data (or multiple tables' data is written to the same Kafka message queue). Each row of data has a column indicating which table the data belongs to. Users can use pre-filtering conditions to filter out the corresponding table data for loading. + +### Example -```plain text -Column1,Column2,Column3,Column4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 @@ -620,13 +631,11 @@ Column1,Column2,Column3,Column4 ``` Pre-filtering condition: - ``` -Column1 > 1 +column1 > 1, i.e., only load data where column1 > 1, and filter out other data. ``` #### Creating the Target Table - ```sql CREATE TABLE example_table ( @@ -639,11 +648,8 @@ ENGINE = OLAP DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` - #### Loading Data - - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -657,17 +663,15 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, k3, k4), COLUMNS TERMINATED BY "," PRECEDING FILTER k1 > 1 -FROM KAFKA (...); +FROM KAFKA (...) ``` #### Query Results - ``` mysql> select * from example_table; +------+------+-----------+------+ @@ -682,26 +686,23 @@ mysql> select * from example_table; ## Post-filtering -Post-filtering is used to filter the final results after column mapping and transformation. +Post-filtering is the process of filtering the final results after column mapping and transformation. -### Example -Suppose we have the following source data (column names are only for illustration purposes and do not actually exist): +### Filtering Without Column Mapping and Transformation -```plain text -Column1,Column2,Column3,Column4 +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 1,100,beijing,1.1 2,200,shanghai,1.2 3,300,guangzhou,1.3 4,\N,chongqing,1.4 ``` -#### Filtering Without Column Mapping and Transformation - -Suppose the target table has four columns: k1, k2, k3, and k4. We want to filter out rows where Column4 is greater than 1.2. - -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to load only the data where the fourth column is greater than 1.2. +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -715,10 +716,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -729,7 +728,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -743,17 +741,15 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, k3, k4), COLUMNS TERMINATED BY "," WHERE k4 > 1.2; -FROM KAFKA (...); +FROM KAFKA (...) ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+------+-----------+------+ @@ -764,12 +760,28 @@ mysql> select * from example_table; +------+------+-----------+------+ ``` -#### Filtering After Column Transformation +### Filtering Transformed Data -Suppose the target table has four columns: k1, k2, k3, and k4. We want to transform Column3 to area ID and filter out rows where the area ID is 3. +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` + +The target table has four columns: k1, k2, k3, and k4. We want to transform the column values as follows: +```plain +column1 -> k1 +column2 -> k2 +column3 -> k3 (transformed to area id) +column4 -> k4 +``` -##### Creating the Target Table +We want to filter out the data where the transformed k3 value is 3. +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -783,10 +795,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -797,7 +807,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -810,11 +819,10 @@ LOAD LABEL example_db.label1 ) WHERE k3 != 3 ) -WITH s3 (...); +WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, tmp_k3, k4), @@ -823,11 +831,10 @@ SET ( k3 = CASE tmp_k3 WHEN 'beijing' THEN 1 WHEN 'shanghai' THEN 2 WHEN 'guangzhou' THEN 3 WHEN 'chongqing' THEN 4 ELSE NULL END ) WHERE k3 != 3; -FROM KAFKA (...); +FROM KAFKA (...) ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+------+------+------+ @@ -839,12 +846,20 @@ mysql> select * from example_table; +------+------+------+------+ ``` -#### Multiple Conditions +### Multiple Conditions Filtering -Suppose the target table has four columns: k1, k2, k3, and k4. We want to filter out rows where k1 is NULL and k4 is less than 1.2. +Suppose we have the following source data (column names are for illustration purposes only, and there is no actual header): +```plain +column1,column2,column3,column4 +1,100,beijing,1.1 +2,200,shanghai,1.2 +3,300,guangzhou,1.3 +4,\N,chongqing,1.4 +``` -##### Creating the Target Table +The target table has four columns: k1, k2, k3, and k4. We want to filter out the data where k1 is NULL and k4 is less than 1.2. +#### Creating the Target Table ```sql CREATE TABLE example_table ( @@ -858,10 +873,8 @@ DUPLICATE KEY(k1) DISTRIBUTED BY HASH(k1) BUCKETS 1; ``` -##### Loading Data - +#### Loading Data - Stream Load - ```sql curl --location-trusted -u user:passwd \ -H "column_separator:," \ @@ -872,7 +885,6 @@ curl --location-trusted -u user:passwd \ ``` - Broker Load - ```sql LOAD LABEL example_db.label1 ( @@ -886,7 +898,6 @@ WITH s3 (...); ``` - Routine Load - ```sql CREATE ROUTINE LOAD example_db.example_routine_load ON example_table COLUMNS(k1, k2, k3, k4), @@ -895,8 +906,7 @@ WHERE k1 is not null and k4 > 1.2 FROM KAFKA (...); ``` -##### Query Results - +#### Query Results ``` mysql> select * from example_table; +------+------+-----------+------+ @@ -904,4 +914,5 @@ mysql> select * from example_table; +------+------+-----------+------+ | 3 | 300 | guangzhou | 1.3 | | 4 | NULL | chongqing | 1.4 | -+------+------+-----------+------+ \ No newline at end of file ++------+------+-----------+------+ +``` \ No newline at end of file --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org