This is an automated email from the ASF dual-hosted git repository. morningman pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 93780962482 [export] add export best practice doc (#2233) 93780962482 is described below commit 93780962482215b7833d941e2096ccb726d601fe Author: Mingyu Chen (Rayner) <morning...@163.com> AuthorDate: Tue Mar 25 21:11:03 2025 +0800 [export] add export best practice doc (#2233) ## Versions - [x] dev - [x] 3.0 - [x] 2.1 - [ ] 2.0 ## Languages - [x] Chinese - [x] English ## Docs Checklist - [x] Checked by AI - [ ] Test Cases Built --- docs/data-operate/export/export-best-practice.md | 97 +++++++++ docs/data-operate/export/outfile.md | 10 + docs/lakehouse/file-analysis.md | 14 +- .../data-operate/export/export-best-practice.md | 102 ++++++++++ .../current/data-operate/export/outfile.md | 10 + .../current/lakehouse/file-analysis.md | 12 +- .../data-operate/export/export-best-practice.md | 102 ++++++++++ .../version-2.1/data-operate/export/outfile.md | 13 +- .../version-2.1/lakehouse/file.md | 218 +++++++++++--------- .../data-operate/export/export-best-practice.md | 102 ++++++++++ .../version-3.0/data-operate/export/outfile.md | 10 + .../version-3.0/lakehouse/file.md | 218 +++++++++++--------- sidebars.json | 5 +- .../data-operate/export/export-best-practice.md | 97 +++++++++ .../version-2.1/data-operate/export/outfile.md | 26 ++- versioned_docs/version-2.1/lakehouse/file.md | 220 ++++++++++++--------- .../data-operate/export/export-best-practice.md | 97 +++++++++ .../version-3.0/data-operate/export/outfile.md | 10 + versioned_docs/version-3.0/lakehouse/file.md | 220 ++++++++++++--------- versioned_sidebars/version-2.1-sidebars.json | 5 +- versioned_sidebars/version-3.0-sidebars.json | 5 +- 21 files changed, 1189 insertions(+), 404 deletions(-) diff --git a/docs/data-operate/export/export-best-practice.md b/docs/data-operate/export/export-best-practice.md new file mode 100644 index 00000000000..6f3d6e95da9 --- /dev/null +++ b/docs/data-operate/export/export-best-practice.md @@ -0,0 +1,97 @@ +--- +{ + "title": "Best Practices", + "language": "en" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +This document mainly introduces how to determine whether resource utilization is reasonable during data export operations, and how to adjust resource utilization to achieve better data export efficiency. + +## SELECT INTO OUTFILE + +1. Enable parallel export + + The number of rows returned by `SELECT INTO OUTFILE` represents the number of parallel Writers. The more Writers, the higher the export concurrency, but the number of output files will also increase. If you find there is only one Writer, you can try enabling the parallel export feature. + + `SET enable_parallel_outfile=true` + + After enabling, the `SELECT INTO OUTFILE` operation generates the corresponding number of Writers based on the query's parallelism. The query parallelism is controlled by the session variable `parallel_pipeline_task_num`. By default, it's half of the number of CPU cores per BE. + + For example, in a cluster with 3 BE nodes, where each node has 8 CPU cores. When parallel export is enabled, it will generate (4*3=) 12 Writers. + + Note that even if `enable_parallel_outfile` is enabled, not all queries can be exported in parallel. For example, if the query contains global sorting and aggregation semantics, it cannot be exported in parallel. For example: + + ``` + SELECT * FROM table ORDER BY id; + + SELECT SUM(cost) FROM table; + ``` + +2. Determine export speed + + Each row result returned by `SELECT INTO OUTFILE` contains the time (in seconds) and speed (in KB/s) of the corresponding Writer's output. + + Adding up the speeds of multiple Writers on the same node gives you the write speed of a single node. You can compare this speed with disk bandwidth (e.g., when exporting to local) or network bandwidth (e.g., when exporting to object storage) to see if it has reached the bandwidth bottleneck. + +## Export + +1. Determine export execution status from the return results + + The Export command essentially breaks down the task into multiple `SELECT INTO OUTFILE` clauses for execution. + + The results returned by the `SHOW EXPORT` command contain a JSON string, which is a two-dimensional array. The first dimension represents the number of concurrent threads in Export, with the number of concurrent threads representing how many Outfile statements were initiated concurrently. The second dimension represents the return results of a single Outfile statement. Example: + + ``` + [ + [ + { + "fileNumber": "1", + "totalRows": "640321", + "fileSize": "350758307", + "url": "file:///127.0.0.1/mnt/disk2/ftw/tmp/export/exp_59fd917c43874adc-9b1c3e9cd6e655be_*", + "writeTime": "17.989", + "writeSpeed": "19041.66" + }, + {...}, + {...}, + {...} + ], + [ + { + "fileNumber": "1", + "totalRows": "646609", + "fileSize": "354228704", + "url": "file:///127.0.0.1/mnt/disk2/ftw/tmp/export/exp_c75b9d4b59bf4943-92eb94a7b97e46cb_*", + "writeTime": "17.249", + "writeSpeed": "20054.64" + }, + {...}, + {...}, + {...} + ] + ] + ``` + + In the above example, 2 Outfile commands were initiated. Each command has 4 Writers writing concurrently. + + By adjusting the `parallelism` parameter in the Export command properties, you can control the number of concurrent Outfile operations, thereby controlling the concurrency level. + diff --git a/docs/data-operate/export/outfile.md b/docs/data-operate/export/outfile.md index b66772a0823..c41cb32ba33 100644 --- a/docs/data-operate/export/outfile.md +++ b/docs/data-operate/export/outfile.md @@ -117,6 +117,16 @@ The `SELECT INTO OUTFILE` currently supports exporting the following file format * csv\_with\_names * csv\_with\_names\_and\_types +### Export concurrency + +You can enable concurrent export through the session variable `enable_parallel_outfile`. + +`SET enable_parallel_outfile=true;` + +Concurrent export will use multi-node and multi-thread to export result data to improve the overall export throughout. However, concurrent export may generate more files. + +Note that some queries cannot perform concurrent export even if this variable is turned on, such as queries containing global sorting. If the number of rows returned by the export command is greater than 1, it means that concurrent export is enabled. + ## Export Examples ### Export to an HDFS Cluster with High Availability Enabled diff --git a/docs/lakehouse/file-analysis.md b/docs/lakehouse/file-analysis.md index ed54ca9e1dc..fff76453e16 100644 --- a/docs/lakehouse/file-analysis.md +++ b/docs/lakehouse/file-analysis.md @@ -41,9 +41,10 @@ Here we illustrate how to analyze files on object storage using the S3 Table Val ```sql SELECT * FROM S3 ( 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', 's3.region' = 'us-east-1', - 's3.access_key' = 'ak' + 's3.access_key' = 'ak', 's3.secret_key'='sk' ) ``` @@ -154,9 +155,10 @@ You can use any SQL statement for file analysis, such as: ```sql SELECT * FROM s3( 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', 's3.region' = 'us-east-1', - 's3.access_key' = 'ak' + 's3.access_key' = 'ak', 's3.secret_key'='sk' ) ORDER BY p_partkey LIMIT 5; @@ -180,9 +182,10 @@ You can also create a logical view for a TVF using the `CREATE VIEW` statement. CREATE VIEW tvf_view AS SELECT * FROM s3( 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', 's3.region' = 'us-east-1', - 's3.access_key' = 'ak' + 's3.access_key' = 'ak', 's3.secret_key'='sk' ); @@ -216,9 +219,10 @@ INSERT INTO test_table (id,name,age) SELECT cast(id as INT) as id, name, cast (age as INT) as age FROM s3( 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', 's3.region' = 'us-east-1', - 's3.access_key' = 'ak' + 's3.access_key' = 'ak', 's3.secret_key'='sk' ); ``` @@ -227,4 +231,4 @@ FROM s3( 1. If the specified `uri` does not match any files, or all matched files are empty, the TVF will return an empty result set. In this case, using `DESC FUNCTION` to view the Schema of this TVF will yield a virtual column `__dummy_col`, which is meaningless and only serves as a placeholder. -2. If the specified file format is `csv`, and the file read is not empty but the first line of the file is empty, an error `The first line is empty, can not parse column numbers` will be prompted, as the Schema cannot be parsed from the first line of the file. \ No newline at end of file +2. If the specified file format is `csv`, and the file read is not empty but the first line of the file is empty, an error `The first line is empty, can not parse column numbers` will be prompted, as the Schema cannot be parsed from the first line of the file. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/export-best-practice.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/export-best-practice.md new file mode 100644 index 00000000000..fd48a64f81e --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/export-best-practice.md @@ -0,0 +1,102 @@ +--- +{ + "title": "最佳实践", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +本文档主要用于介绍在进行数据导出操作中,如何判断资源利用是否合理,以及如何调整资源利用率已达到更好的数据导出效率。 + +## SELECT INTO OUTFILE + +1. 开启并行导出 + + `SELECT INTO OUTFILE` 返回的行数即代表并行的 Writer 数量。Writer 的数量越多,则导出的并发度越高,但写出的文件数量也会越多。如果发现只有一个 Writer,则可以尝试开启并行导出功能。 + + `SET enable_parallel_outfile=true` + + 开启后,`SELECT INTO OUTFILE` 操作为根据查询的并行度来生成对应数量的 Writer。查询并行度由会话变量 `parallel_pipeline_task_num` 控制。默认为单 BE CPU 核数的一半。 + + 比如在一个 3 BE 节点的集群中,每个节点的 CPU 核数为 8。则开启并行导出情况下,会产生(4*3=)12 个 Writer。 + + 注意,即使开启了并行导出功能,也不是所有查询都能够并行导出。比如查询中包含全局排序、聚合语义时,则是无法并行导出的。如: + + ``` + SELECT * FROM table ORDER BY id; + + SELECT SUM(cost) FROM table; + ``` + +2. 判断导出速度 + + `SELECT INTO OUTFILE` 返回的每一行结果中,都带有对应的 Writer 的写出时的时间(单位:秒)和速度(单位:KB/s)。 + + 将同一个节点的多个 Writer 的速度相加,即为单个节点的写出速度。可以用这个速度,和磁盘带宽(比如导出到本地)或网络带宽(比如导出到对象存储)进行比较,看是否已经达到带宽瓶颈。 + +## Export + +1. 根据返回结果判断导出执行情况 + + Export 命令本质上是将任务拆分成多个 `SELECT INTO OUTFILE` 子句进行执行。 + + 通过 `SHOW EXPORT` 命令返回的结果中包含一个 Json 字符串,是一个二维数组。第一维代表 Export 并发的线程数,并发多少个线程代表并发发起了多少个 Outfile 语句。第二维代表单个 Outfile 语句的返回结果。示例: + + ``` + [ + [ + { + "fileNumber": "1", + "totalRows": "640321", + "fileSize": "350758307", + "url": "file:///127.0.0.1/mnt/disk2/ftw/tmp/export/exp_59fd917c43874adc-9b1c3e9cd6e655be_*", + "writeTime": "17.989", + "writeSpeed": "19041.66" + }, + {...}, + {...}, + {...} + ], + [ + { + "fileNumber": "1", + "totalRows": "646609", + "fileSize": "354228704", + "url": "file:///127.0.0.1/mnt/disk2/ftw/tmp/export/exp_c75b9d4b59bf4943-92eb94a7b97e46cb_*", + "writeTime": "17.249", + "writeSpeed": "20054.64" + }, + {...}, + {...}, + {...} + ] + ] + ``` + + 上面的示例中,发起了 2 个 Outfile 命令。每个命令有 4 个 Writer 并发写出。 + + 通过调整 Export 命令属性中的 `parallelism` 参数,可以控制并发 Outfile 的个数,从而控制并发度。 + + + + + + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/outfile.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/outfile.md index c2ac0439b9b..3b46b1d8f9d 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/outfile.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/outfile.md @@ -116,6 +116,16 @@ PROPERTIES( * csv\_with\_names * csv\_with\_names\_and\_types +### 导出并发度 + +可以通过会话参数 `enable_parallel_outfile` 开启并发导出。 + +`SET enable_parallel_outfile=true;` + +并发导出会利用多节点、多线程导出结果数据,以提升整体的导出效率。但并发导出可能会产生更多的文件。 + +注意,某些查询即使打开此参数,也无法执行并发导出,如包含全局排序的查询。如果导出命令返回的行数大于 1 行,则表示开启了并发导出。 + ## 导出示例 ### 导出到开启了高可用的 HDFS 集群 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-analysis.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-analysis.md index e0ae2429727..71b578923eb 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-analysis.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-analysis.md @@ -41,9 +41,10 @@ under the License. ```sql SELECT * FROM S3 ( 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', 's3.region' = 'us-east-1', - 's3.access_key' = 'ak' + 's3.access_key' = 'ak', 's3.secret_key'='sk' ) ``` @@ -154,9 +155,10 @@ TVF 非常适用于对存储系统上的独立文件进行直接分析,而无 ```sql SELECT * FROM s3( 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', 's3.region' = 'us-east-1', - 's3.access_key' = 'ak' + 's3.access_key' = 'ak', 's3.secret_key'='sk' ) ORDER BY p_partkey LIMIT 5; @@ -180,9 +182,10 @@ TVF 可以出现在 SQL 中,Table 能出现的任意位置。如 `CTE` 的 `WI CREATE VIEW tvf_view AS SELECT * FROM s3( 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', 's3.region' = 'us-east-1', - 's3.access_key' = 'ak' + 's3.access_key' = 'ak', 's3.secret_key'='sk' ); @@ -216,9 +219,10 @@ INSERT INTO test_table (id,name,age) SELECT cast(id as INT) as id, name, cast (age as INT) as age FROM s3( 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', 's3.region' = 'us-east-1', - 's3.access_key' = 'ak' + 's3.access_key' = 'ak', 's3.secret_key'='sk' ); ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/export-best-practice.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/export-best-practice.md new file mode 100644 index 00000000000..fd48a64f81e --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/export-best-practice.md @@ -0,0 +1,102 @@ +--- +{ + "title": "最佳实践", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +本文档主要用于介绍在进行数据导出操作中,如何判断资源利用是否合理,以及如何调整资源利用率已达到更好的数据导出效率。 + +## SELECT INTO OUTFILE + +1. 开启并行导出 + + `SELECT INTO OUTFILE` 返回的行数即代表并行的 Writer 数量。Writer 的数量越多,则导出的并发度越高,但写出的文件数量也会越多。如果发现只有一个 Writer,则可以尝试开启并行导出功能。 + + `SET enable_parallel_outfile=true` + + 开启后,`SELECT INTO OUTFILE` 操作为根据查询的并行度来生成对应数量的 Writer。查询并行度由会话变量 `parallel_pipeline_task_num` 控制。默认为单 BE CPU 核数的一半。 + + 比如在一个 3 BE 节点的集群中,每个节点的 CPU 核数为 8。则开启并行导出情况下,会产生(4*3=)12 个 Writer。 + + 注意,即使开启了并行导出功能,也不是所有查询都能够并行导出。比如查询中包含全局排序、聚合语义时,则是无法并行导出的。如: + + ``` + SELECT * FROM table ORDER BY id; + + SELECT SUM(cost) FROM table; + ``` + +2. 判断导出速度 + + `SELECT INTO OUTFILE` 返回的每一行结果中,都带有对应的 Writer 的写出时的时间(单位:秒)和速度(单位:KB/s)。 + + 将同一个节点的多个 Writer 的速度相加,即为单个节点的写出速度。可以用这个速度,和磁盘带宽(比如导出到本地)或网络带宽(比如导出到对象存储)进行比较,看是否已经达到带宽瓶颈。 + +## Export + +1. 根据返回结果判断导出执行情况 + + Export 命令本质上是将任务拆分成多个 `SELECT INTO OUTFILE` 子句进行执行。 + + 通过 `SHOW EXPORT` 命令返回的结果中包含一个 Json 字符串,是一个二维数组。第一维代表 Export 并发的线程数,并发多少个线程代表并发发起了多少个 Outfile 语句。第二维代表单个 Outfile 语句的返回结果。示例: + + ``` + [ + [ + { + "fileNumber": "1", + "totalRows": "640321", + "fileSize": "350758307", + "url": "file:///127.0.0.1/mnt/disk2/ftw/tmp/export/exp_59fd917c43874adc-9b1c3e9cd6e655be_*", + "writeTime": "17.989", + "writeSpeed": "19041.66" + }, + {...}, + {...}, + {...} + ], + [ + { + "fileNumber": "1", + "totalRows": "646609", + "fileSize": "354228704", + "url": "file:///127.0.0.1/mnt/disk2/ftw/tmp/export/exp_c75b9d4b59bf4943-92eb94a7b97e46cb_*", + "writeTime": "17.249", + "writeSpeed": "20054.64" + }, + {...}, + {...}, + {...} + ] + ] + ``` + + 上面的示例中,发起了 2 个 Outfile 命令。每个命令有 4 个 Writer 并发写出。 + + 通过调整 Export 命令属性中的 `parallelism` 参数,可以控制并发 Outfile 的个数,从而控制并发度。 + + + + + + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/outfile.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/outfile.md index 7d067fcd026..3b46b1d8f9d 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/outfile.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/outfile.md @@ -116,6 +116,16 @@ PROPERTIES( * csv\_with\_names * csv\_with\_names\_and\_types +### 导出并发度 + +可以通过会话参数 `enable_parallel_outfile` 开启并发导出。 + +`SET enable_parallel_outfile=true;` + +并发导出会利用多节点、多线程导出结果数据,以提升整体的导出效率。但并发导出可能会产生更多的文件。 + +注意,某些查询即使打开此参数,也无法执行并发导出,如包含全局排序的查询。如果导出命令返回的行数大于 1 行,则表示开启了并发导出。 + ## 导出示例 ### 导出到开启了高可用的 HDFS 集群 @@ -229,9 +239,6 @@ PROPERTIES( ## 注意事项 -- 使用并发 Outfile 时的限制 - 当前版本 pipeline 引擎不支持并发 Outfile。所以若开启了 pipeline 引擎,并发 Outfile 将回退到单并发导出。 - - 导出数据量和导出效率 `SELECT INTO OUTFILE`功能本质上是执行一个 SQL 查询命令。如果不开启并发导出,查询结果是由单个 BE 节点,单线程导出的,因此整个导出的耗时包括查询本身的耗时和最终结果集写出的耗时。开启并发导出可以降低导出的时间。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/file.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/file.md index 51cf2e86b26..71b578923eb 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/file.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/file.md @@ -5,7 +5,7 @@ } --- -<!-- +<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information @@ -24,29 +24,62 @@ specific language governing permissions and limitations under the License. --> +通过 Table Value Function 功能,Doris 可以直接将对象存储或 HDFS 上的文件作为 Table 进行查询分析。并且支持自动的列类型推断。 +更多使用方式可参阅 Table Value Function 文档: -通过 Table Value Function 功能,Doris 可以直接将对象存储或 HDFS 上的文件作为 Table 进行查询分析。并且支持自动的列类型推断。 +* [S3](../sql-manual/sql-functions/table-valued-functions/s3.md):支持 S3 兼容的对象存储上的文件分析。 -:::tip +* [HDFS](../sql-manual/sql-functions/table-valued-functions/hdfs.md):支持 HDFS 上的文件分析。 -**使用方式** +## 基础使用 -更多使用方式可参阅 Table Value Function 文档: +这里我们通过 S3 Table Value Function 举例说明如何对对象存储上的文件进行分析。 + +### 查询 + +```sql +SELECT * FROM S3 ( + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +) +``` + +其中 `S3(...)`是一个 TVF(Table Value Function)。Table Value Function 本质上是一张表,因此他可以出现在任意 SQL 语句中“表”可以出现的位置上。 + +TVF 的属性包括要分析的文件路径,文件格式、对象存储的连接信息等。其中文件路径(URI)可以使用通配符匹配多个文件,以下的文件路径都是合法的: + +* 匹配指定的文件 -* [S3](../sql-manual/sql-functions/table-valued-functions/s3):支持 S3 兼容的对象存储上的文件分析。 + `s3://bucket/path/to/tvf_test/test.parquet` -* [HDFS](../data-operate/import/data-source/hdfs):支持 HDFS 上的文件分析。 +* 匹配所有 `test_` 开头的文件 -::: + `s3://bucket/path/to/tvf_test/test_*` -这里我们通过 S3 Table Value Function 举例说明如何进行文件分析。 +* 匹配所有 `.parquet` 后缀的文件 -## 自动推断文件列类型 + `s3://bucket/path/to/tvf_test/*.parquet` + +* 匹配 `tvf_test`目录下的所有文件 + + `s3://bucket/path/to/tvf_test/*` + +* 匹配文件名中包含 `test`的文件 + + `s3://bucket/path/to/tvf_test/*test*` + +### 自动推断文件列类型 + +可以通过 `DESC FUNCTION` 语法可以查看 TVF 的 Schema: ```sql -> DESC FUNCTION s3 ( - "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", +DESC FUNCTION s3 ( + "URI" = "s3://bucket/path/to/tvf_test/test.parquet", "s3.access_key"= "ak", "s3.secret_key" = "sk", "format" = "parquet", @@ -66,79 +99,69 @@ under the License. | p_comment | TEXT | Yes | false | NULL | NONE | +---------------+--------------+------+-------+---------+-------+ ``` - -这里我们定义了一个 S3 Table Value Function: - -```sql -s3( - "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "parquet", - "use_path_style"="true") -``` -其中指定了文件的路径、连接信息、认证信息等。 +Doris 根据以下规则推断 Schema: -之后,通过 `DESC FUNCTION` 语法可以查看这个文件的 Schema。 +* 对于 Parquet、ORC 格式,Doris 会根据文件元信息获取 Schema。 -可以看到,对于 Parquet 文件,Doris 会根据文件内的元信息自动推断列类型。 +* 对于匹配多个文件的情况,会使用第一个文件的 Schema 作为 TVF 的 Schema。 -目前支持对 Parquet、ORC、CSV、JSON 格式进行分析和列类型推断。 +* 对于 CSV、JSON 格式,Doris 会根据字段、分隔符等属性,解析**第一行数据**获取 Schema。 -**CSV Schema** + 默认情况下,所有列类型均为 `string`。可以通过 `csv_schema` 属性单独指定列名和列类型。Doris 会使用指定的列类型进行文件读取。格式如下:`name1:type1;name2:type2;...`。如: -在默认情况下,对 CSV 格式文件,所有列类型均为 String。可以通过 `csv_schema` 属性单独指定列名和列类型。Doris 会使用指定的列类型进行文件读取。格式如下: + ```sql + S3 ( + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak' + 's3.secret_key'='sk', + 'format' = 'csv', + 'column_separator' = '|', + 'csv_schema' = 'k1:int;k2:int;k3:int;k4:decimal(38,10)' + ) + ``` -`name1:type1;name2:type2;...` + 当前支持的列类型名称如下: -对于格式不匹配的列(比如文件中为字符串,用户定义为 int),或缺失列(比如文件中有 4 列,用户定义了 5 列),则这些列将返回 null。 + | 列类型名称 | + | ------------ | + | tinyint | + | smallint | + | int | + | bigint | + | largeint | + | float | + | double | + | decimal(p,s) | + | date | + | datetime | + | char | + | varchar | + | string | + | boolean | -当前支持的列类型为: +* 对于格式不匹配的列(比如文件中为字符串,用户定义为 `int`;或者其他文件和第一个文件的 Schema 不相同),或缺失列(比如文件中有 4 列,用户定义了 5 列),则这些列将返回 `null`。 -| 名称 | 映射类型 | -| --- | --- | -|tinyint |tinyint | -|smallint |smallint | -|int |int | -| bigint | bigint | -| largeint | largeint | -| float| float | -| double| double| -| decimal(p,s) | decimalv3(p,s) | -| date | datev2 | -| datetime | datetimev2 | -| char |string | -|varchar |string | -|string|string | -|boolean| boolean | +## 适用场景 -示例: +### 查询分析 -``` -s3 ( - "URI" = "https://bucket1/inventory.dat", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "csv", - "column_separator" = "|", - "csv_schema" = "k1:int;k2:int;k3:int;k4:decimal(38,10)", - "use_path_style"="true" -) -``` +TVF 非常适用于对存储系统上的独立文件进行直接分析,而无需事先将数据导入到 Doris 中。 -## 查询分析 - -你可以使用任意的 SQL 语句对这个文件进行分析 +可以使用任意的 SQL 语句进行文件分析,如: ```sql SELECT * FROM s3( - "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "parquet", - "use_path_style"="true") -LIMIT 5; + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +) +ORDER BY p_partkey LIMIT 5; +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment | +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ @@ -150,34 +173,38 @@ LIMIT 5; +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ ``` -Table Value Function 可以出现在 SQL 中,Table 能出现的任意位置。如 CTE 的 WITH 子句中,FROM 子句中。 +TVF 可以出现在 SQL 中,Table 能出现的任意位置。如 `CTE` 的 `WITH` 子句中,`FROM` 子句中等等。这样,您可以把文件当做一张普通的表进行任意分析。 -这样,你可以把文件当做一张普通的表进行任意分析。 - -你也可以用过 `CREATE VIEW` 语句为 Table Value Function 创建一个逻辑视图。这样,你可以想其他视图一样,对这个 Table Value Function 进行访问、权限管理等操作,也可以让其他用户访问这个 Table Value Function。 +您也可以用过 `CREATE VIEW` 语句为 TVF 创建一个逻辑视图。之后,可以像其他视图一样,对这个 TVF 进行访问、权限管理等操作,也可以让其他用户访问这个 View,而无需重复书写连接信息等属性。 ```sql -CREATE VIEW v1 AS +-- Create a view based on a TVF +CREATE VIEW tvf_view AS SELECT * FROM s3( - "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "parquet", - "use_path_style"="true"); + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +); -DESC v1; +-- Describe the view as usual +DESC tvf_view; -SELECT * FROM v1; +-- Query the view as usual +SELECT * FROM tvf_view; -GRANT SELECT_PRIV ON db1.v1 TO user1; +-- Grant SELECT priv to other user on this view +GRANT SELECT_PRIV ON db.tvf_view TO other_user; ``` -## 数据导入 +### 数据导入 -配合 `INSERT INTO SELECT` 语法,我们可以方便将文件导入到 Doris 表中进行更快速的分析: +TVF 可以作为 Doris 数据导入方式的一种。配合 `INSERT INTO SELECT` 语法,我们可以很方便的将文件导入到 Doris 中。 ```sql -// 1. 创建doris内部表 +-- Create a Doris table CREATE TABLE IF NOT EXISTS test_table ( id int, @@ -187,20 +214,21 @@ CREATE TABLE IF NOT EXISTS test_table DISTRIBUTED BY HASH(id) BUCKETS 4 PROPERTIES("replication_num" = "1"); -// 2. 使用 S3 Table Value Function 插入数据 +-- 2. Load data into table from TVF INSERT INTO test_table (id,name,age) SELECT cast(id as INT) as id, name, cast (age as INT) as age FROM s3( - "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "parquet", - "use_path_style" = "true"); -``` + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +); +``` ## 注意事项 -1. 如果 `S3 / hdfs` tvf 指定的 uri 匹配不到文件,或者匹配到的所有文件都是空文件,那么 `S3 / hdfs` tvf 将会返回空结果集。在这种情况下使用`DESC FUNCTION`查看这个文件的 Schema,会得到一列虚假的列`__dummy_col`,可忽略这一列。 - -2. 如果指定 tvf 的 format 为 csv,所读文件不为空文件但文件第一行为空,则会提示错误`The first line is empty, can not parse column numbers`, 这因为无法通过该文件的第一行解析出 schema。 +1. 如果指定的 `uri` 匹配不到文件,或者匹配到的所有文件都是空文件,那么 TVF 将会返回空结果集。在这种情况下使用`DESC FUNCTION`查看这个 TVF 的 Schema,会得到一列虚拟的列`__dummy_col`,该列无意义,仅作为占位符使用。 +2. 如果指定的文件格式为 `csv`,所读文件不为空文件但文件第一行为空,则会提示错误`The first line is empty, can not parse column numbers`,这是因为无法通过该文件的第一行解析出 Schema。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/export-best-practice.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/export-best-practice.md new file mode 100644 index 00000000000..fd48a64f81e --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/export-best-practice.md @@ -0,0 +1,102 @@ +--- +{ + "title": "最佳实践", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +本文档主要用于介绍在进行数据导出操作中,如何判断资源利用是否合理,以及如何调整资源利用率已达到更好的数据导出效率。 + +## SELECT INTO OUTFILE + +1. 开启并行导出 + + `SELECT INTO OUTFILE` 返回的行数即代表并行的 Writer 数量。Writer 的数量越多,则导出的并发度越高,但写出的文件数量也会越多。如果发现只有一个 Writer,则可以尝试开启并行导出功能。 + + `SET enable_parallel_outfile=true` + + 开启后,`SELECT INTO OUTFILE` 操作为根据查询的并行度来生成对应数量的 Writer。查询并行度由会话变量 `parallel_pipeline_task_num` 控制。默认为单 BE CPU 核数的一半。 + + 比如在一个 3 BE 节点的集群中,每个节点的 CPU 核数为 8。则开启并行导出情况下,会产生(4*3=)12 个 Writer。 + + 注意,即使开启了并行导出功能,也不是所有查询都能够并行导出。比如查询中包含全局排序、聚合语义时,则是无法并行导出的。如: + + ``` + SELECT * FROM table ORDER BY id; + + SELECT SUM(cost) FROM table; + ``` + +2. 判断导出速度 + + `SELECT INTO OUTFILE` 返回的每一行结果中,都带有对应的 Writer 的写出时的时间(单位:秒)和速度(单位:KB/s)。 + + 将同一个节点的多个 Writer 的速度相加,即为单个节点的写出速度。可以用这个速度,和磁盘带宽(比如导出到本地)或网络带宽(比如导出到对象存储)进行比较,看是否已经达到带宽瓶颈。 + +## Export + +1. 根据返回结果判断导出执行情况 + + Export 命令本质上是将任务拆分成多个 `SELECT INTO OUTFILE` 子句进行执行。 + + 通过 `SHOW EXPORT` 命令返回的结果中包含一个 Json 字符串,是一个二维数组。第一维代表 Export 并发的线程数,并发多少个线程代表并发发起了多少个 Outfile 语句。第二维代表单个 Outfile 语句的返回结果。示例: + + ``` + [ + [ + { + "fileNumber": "1", + "totalRows": "640321", + "fileSize": "350758307", + "url": "file:///127.0.0.1/mnt/disk2/ftw/tmp/export/exp_59fd917c43874adc-9b1c3e9cd6e655be_*", + "writeTime": "17.989", + "writeSpeed": "19041.66" + }, + {...}, + {...}, + {...} + ], + [ + { + "fileNumber": "1", + "totalRows": "646609", + "fileSize": "354228704", + "url": "file:///127.0.0.1/mnt/disk2/ftw/tmp/export/exp_c75b9d4b59bf4943-92eb94a7b97e46cb_*", + "writeTime": "17.249", + "writeSpeed": "20054.64" + }, + {...}, + {...}, + {...} + ] + ] + ``` + + 上面的示例中,发起了 2 个 Outfile 命令。每个命令有 4 个 Writer 并发写出。 + + 通过调整 Export 命令属性中的 `parallelism` 参数,可以控制并发 Outfile 的个数,从而控制并发度。 + + + + + + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/outfile.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/outfile.md index c2ac0439b9b..3b46b1d8f9d 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/outfile.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/outfile.md @@ -116,6 +116,16 @@ PROPERTIES( * csv\_with\_names * csv\_with\_names\_and\_types +### 导出并发度 + +可以通过会话参数 `enable_parallel_outfile` 开启并发导出。 + +`SET enable_parallel_outfile=true;` + +并发导出会利用多节点、多线程导出结果数据,以提升整体的导出效率。但并发导出可能会产生更多的文件。 + +注意,某些查询即使打开此参数,也无法执行并发导出,如包含全局排序的查询。如果导出命令返回的行数大于 1 行,则表示开启了并发导出。 + ## 导出示例 ### 导出到开启了高可用的 HDFS 集群 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/file.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/file.md index 51cf2e86b26..71b578923eb 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/file.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/file.md @@ -5,7 +5,7 @@ } --- -<!-- +<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information @@ -24,29 +24,62 @@ specific language governing permissions and limitations under the License. --> +通过 Table Value Function 功能,Doris 可以直接将对象存储或 HDFS 上的文件作为 Table 进行查询分析。并且支持自动的列类型推断。 +更多使用方式可参阅 Table Value Function 文档: -通过 Table Value Function 功能,Doris 可以直接将对象存储或 HDFS 上的文件作为 Table 进行查询分析。并且支持自动的列类型推断。 +* [S3](../sql-manual/sql-functions/table-valued-functions/s3.md):支持 S3 兼容的对象存储上的文件分析。 -:::tip +* [HDFS](../sql-manual/sql-functions/table-valued-functions/hdfs.md):支持 HDFS 上的文件分析。 -**使用方式** +## 基础使用 -更多使用方式可参阅 Table Value Function 文档: +这里我们通过 S3 Table Value Function 举例说明如何对对象存储上的文件进行分析。 + +### 查询 + +```sql +SELECT * FROM S3 ( + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +) +``` + +其中 `S3(...)`是一个 TVF(Table Value Function)。Table Value Function 本质上是一张表,因此他可以出现在任意 SQL 语句中“表”可以出现的位置上。 + +TVF 的属性包括要分析的文件路径,文件格式、对象存储的连接信息等。其中文件路径(URI)可以使用通配符匹配多个文件,以下的文件路径都是合法的: + +* 匹配指定的文件 -* [S3](../sql-manual/sql-functions/table-valued-functions/s3):支持 S3 兼容的对象存储上的文件分析。 + `s3://bucket/path/to/tvf_test/test.parquet` -* [HDFS](../data-operate/import/data-source/hdfs):支持 HDFS 上的文件分析。 +* 匹配所有 `test_` 开头的文件 -::: + `s3://bucket/path/to/tvf_test/test_*` -这里我们通过 S3 Table Value Function 举例说明如何进行文件分析。 +* 匹配所有 `.parquet` 后缀的文件 -## 自动推断文件列类型 + `s3://bucket/path/to/tvf_test/*.parquet` + +* 匹配 `tvf_test`目录下的所有文件 + + `s3://bucket/path/to/tvf_test/*` + +* 匹配文件名中包含 `test`的文件 + + `s3://bucket/path/to/tvf_test/*test*` + +### 自动推断文件列类型 + +可以通过 `DESC FUNCTION` 语法可以查看 TVF 的 Schema: ```sql -> DESC FUNCTION s3 ( - "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", +DESC FUNCTION s3 ( + "URI" = "s3://bucket/path/to/tvf_test/test.parquet", "s3.access_key"= "ak", "s3.secret_key" = "sk", "format" = "parquet", @@ -66,79 +99,69 @@ under the License. | p_comment | TEXT | Yes | false | NULL | NONE | +---------------+--------------+------+-------+---------+-------+ ``` - -这里我们定义了一个 S3 Table Value Function: - -```sql -s3( - "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "parquet", - "use_path_style"="true") -``` -其中指定了文件的路径、连接信息、认证信息等。 +Doris 根据以下规则推断 Schema: -之后,通过 `DESC FUNCTION` 语法可以查看这个文件的 Schema。 +* 对于 Parquet、ORC 格式,Doris 会根据文件元信息获取 Schema。 -可以看到,对于 Parquet 文件,Doris 会根据文件内的元信息自动推断列类型。 +* 对于匹配多个文件的情况,会使用第一个文件的 Schema 作为 TVF 的 Schema。 -目前支持对 Parquet、ORC、CSV、JSON 格式进行分析和列类型推断。 +* 对于 CSV、JSON 格式,Doris 会根据字段、分隔符等属性,解析**第一行数据**获取 Schema。 -**CSV Schema** + 默认情况下,所有列类型均为 `string`。可以通过 `csv_schema` 属性单独指定列名和列类型。Doris 会使用指定的列类型进行文件读取。格式如下:`name1:type1;name2:type2;...`。如: -在默认情况下,对 CSV 格式文件,所有列类型均为 String。可以通过 `csv_schema` 属性单独指定列名和列类型。Doris 会使用指定的列类型进行文件读取。格式如下: + ```sql + S3 ( + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak' + 's3.secret_key'='sk', + 'format' = 'csv', + 'column_separator' = '|', + 'csv_schema' = 'k1:int;k2:int;k3:int;k4:decimal(38,10)' + ) + ``` -`name1:type1;name2:type2;...` + 当前支持的列类型名称如下: -对于格式不匹配的列(比如文件中为字符串,用户定义为 int),或缺失列(比如文件中有 4 列,用户定义了 5 列),则这些列将返回 null。 + | 列类型名称 | + | ------------ | + | tinyint | + | smallint | + | int | + | bigint | + | largeint | + | float | + | double | + | decimal(p,s) | + | date | + | datetime | + | char | + | varchar | + | string | + | boolean | -当前支持的列类型为: +* 对于格式不匹配的列(比如文件中为字符串,用户定义为 `int`;或者其他文件和第一个文件的 Schema 不相同),或缺失列(比如文件中有 4 列,用户定义了 5 列),则这些列将返回 `null`。 -| 名称 | 映射类型 | -| --- | --- | -|tinyint |tinyint | -|smallint |smallint | -|int |int | -| bigint | bigint | -| largeint | largeint | -| float| float | -| double| double| -| decimal(p,s) | decimalv3(p,s) | -| date | datev2 | -| datetime | datetimev2 | -| char |string | -|varchar |string | -|string|string | -|boolean| boolean | +## 适用场景 -示例: +### 查询分析 -``` -s3 ( - "URI" = "https://bucket1/inventory.dat", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "csv", - "column_separator" = "|", - "csv_schema" = "k1:int;k2:int;k3:int;k4:decimal(38,10)", - "use_path_style"="true" -) -``` +TVF 非常适用于对存储系统上的独立文件进行直接分析,而无需事先将数据导入到 Doris 中。 -## 查询分析 - -你可以使用任意的 SQL 语句对这个文件进行分析 +可以使用任意的 SQL 语句进行文件分析,如: ```sql SELECT * FROM s3( - "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "parquet", - "use_path_style"="true") -LIMIT 5; + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +) +ORDER BY p_partkey LIMIT 5; +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment | +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ @@ -150,34 +173,38 @@ LIMIT 5; +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ ``` -Table Value Function 可以出现在 SQL 中,Table 能出现的任意位置。如 CTE 的 WITH 子句中,FROM 子句中。 +TVF 可以出现在 SQL 中,Table 能出现的任意位置。如 `CTE` 的 `WITH` 子句中,`FROM` 子句中等等。这样,您可以把文件当做一张普通的表进行任意分析。 -这样,你可以把文件当做一张普通的表进行任意分析。 - -你也可以用过 `CREATE VIEW` 语句为 Table Value Function 创建一个逻辑视图。这样,你可以想其他视图一样,对这个 Table Value Function 进行访问、权限管理等操作,也可以让其他用户访问这个 Table Value Function。 +您也可以用过 `CREATE VIEW` 语句为 TVF 创建一个逻辑视图。之后,可以像其他视图一样,对这个 TVF 进行访问、权限管理等操作,也可以让其他用户访问这个 View,而无需重复书写连接信息等属性。 ```sql -CREATE VIEW v1 AS +-- Create a view based on a TVF +CREATE VIEW tvf_view AS SELECT * FROM s3( - "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "parquet", - "use_path_style"="true"); + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +); -DESC v1; +-- Describe the view as usual +DESC tvf_view; -SELECT * FROM v1; +-- Query the view as usual +SELECT * FROM tvf_view; -GRANT SELECT_PRIV ON db1.v1 TO user1; +-- Grant SELECT priv to other user on this view +GRANT SELECT_PRIV ON db.tvf_view TO other_user; ``` -## 数据导入 +### 数据导入 -配合 `INSERT INTO SELECT` 语法,我们可以方便将文件导入到 Doris 表中进行更快速的分析: +TVF 可以作为 Doris 数据导入方式的一种。配合 `INSERT INTO SELECT` 语法,我们可以很方便的将文件导入到 Doris 中。 ```sql -// 1. 创建doris内部表 +-- Create a Doris table CREATE TABLE IF NOT EXISTS test_table ( id int, @@ -187,20 +214,21 @@ CREATE TABLE IF NOT EXISTS test_table DISTRIBUTED BY HASH(id) BUCKETS 4 PROPERTIES("replication_num" = "1"); -// 2. 使用 S3 Table Value Function 插入数据 +-- 2. Load data into table from TVF INSERT INTO test_table (id,name,age) SELECT cast(id as INT) as id, name, cast (age as INT) as age FROM s3( - "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "parquet", - "use_path_style" = "true"); -``` + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +); +``` ## 注意事项 -1. 如果 `S3 / hdfs` tvf 指定的 uri 匹配不到文件,或者匹配到的所有文件都是空文件,那么 `S3 / hdfs` tvf 将会返回空结果集。在这种情况下使用`DESC FUNCTION`查看这个文件的 Schema,会得到一列虚假的列`__dummy_col`,可忽略这一列。 - -2. 如果指定 tvf 的 format 为 csv,所读文件不为空文件但文件第一行为空,则会提示错误`The first line is empty, can not parse column numbers`, 这因为无法通过该文件的第一行解析出 schema。 +1. 如果指定的 `uri` 匹配不到文件,或者匹配到的所有文件都是空文件,那么 TVF 将会返回空结果集。在这种情况下使用`DESC FUNCTION`查看这个 TVF 的 Schema,会得到一列虚拟的列`__dummy_col`,该列无意义,仅作为占位符使用。 +2. 如果指定的文件格式为 `csv`,所读文件不为空文件但文件第一行为空,则会提示错误`The first line is empty, can not parse column numbers`,这是因为无法通过该文件的第一行解析出 Schema。 diff --git a/sidebars.json b/sidebars.json index d51e3a8a06c..2adf64e17e6 100644 --- a/sidebars.json +++ b/sidebars.json @@ -239,7 +239,8 @@ "data-operate/export/export-overview", "data-operate/export/export-manual", "data-operate/export/outfile", - "data-operate/export/export-with-mysql-dump" + "data-operate/export/export-with-mysql-dump", + "data-operate/export/export-best-practice" ] }, { @@ -2127,4 +2128,4 @@ ] } ] -} \ No newline at end of file +} diff --git a/versioned_docs/version-2.1/data-operate/export/export-best-practice.md b/versioned_docs/version-2.1/data-operate/export/export-best-practice.md new file mode 100644 index 00000000000..6f3d6e95da9 --- /dev/null +++ b/versioned_docs/version-2.1/data-operate/export/export-best-practice.md @@ -0,0 +1,97 @@ +--- +{ + "title": "Best Practices", + "language": "en" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +This document mainly introduces how to determine whether resource utilization is reasonable during data export operations, and how to adjust resource utilization to achieve better data export efficiency. + +## SELECT INTO OUTFILE + +1. Enable parallel export + + The number of rows returned by `SELECT INTO OUTFILE` represents the number of parallel Writers. The more Writers, the higher the export concurrency, but the number of output files will also increase. If you find there is only one Writer, you can try enabling the parallel export feature. + + `SET enable_parallel_outfile=true` + + After enabling, the `SELECT INTO OUTFILE` operation generates the corresponding number of Writers based on the query's parallelism. The query parallelism is controlled by the session variable `parallel_pipeline_task_num`. By default, it's half of the number of CPU cores per BE. + + For example, in a cluster with 3 BE nodes, where each node has 8 CPU cores. When parallel export is enabled, it will generate (4*3=) 12 Writers. + + Note that even if `enable_parallel_outfile` is enabled, not all queries can be exported in parallel. For example, if the query contains global sorting and aggregation semantics, it cannot be exported in parallel. For example: + + ``` + SELECT * FROM table ORDER BY id; + + SELECT SUM(cost) FROM table; + ``` + +2. Determine export speed + + Each row result returned by `SELECT INTO OUTFILE` contains the time (in seconds) and speed (in KB/s) of the corresponding Writer's output. + + Adding up the speeds of multiple Writers on the same node gives you the write speed of a single node. You can compare this speed with disk bandwidth (e.g., when exporting to local) or network bandwidth (e.g., when exporting to object storage) to see if it has reached the bandwidth bottleneck. + +## Export + +1. Determine export execution status from the return results + + The Export command essentially breaks down the task into multiple `SELECT INTO OUTFILE` clauses for execution. + + The results returned by the `SHOW EXPORT` command contain a JSON string, which is a two-dimensional array. The first dimension represents the number of concurrent threads in Export, with the number of concurrent threads representing how many Outfile statements were initiated concurrently. The second dimension represents the return results of a single Outfile statement. Example: + + ``` + [ + [ + { + "fileNumber": "1", + "totalRows": "640321", + "fileSize": "350758307", + "url": "file:///127.0.0.1/mnt/disk2/ftw/tmp/export/exp_59fd917c43874adc-9b1c3e9cd6e655be_*", + "writeTime": "17.989", + "writeSpeed": "19041.66" + }, + {...}, + {...}, + {...} + ], + [ + { + "fileNumber": "1", + "totalRows": "646609", + "fileSize": "354228704", + "url": "file:///127.0.0.1/mnt/disk2/ftw/tmp/export/exp_c75b9d4b59bf4943-92eb94a7b97e46cb_*", + "writeTime": "17.249", + "writeSpeed": "20054.64" + }, + {...}, + {...}, + {...} + ] + ] + ``` + + In the above example, 2 Outfile commands were initiated. Each command has 4 Writers writing concurrently. + + By adjusting the `parallelism` parameter in the Export command properties, you can control the number of concurrent Outfile operations, thereby controlling the concurrency level. + diff --git a/versioned_docs/version-2.1/data-operate/export/outfile.md b/versioned_docs/version-2.1/data-operate/export/outfile.md index 8d774d8fa32..c41cb32ba33 100644 --- a/versioned_docs/version-2.1/data-operate/export/outfile.md +++ b/versioned_docs/version-2.1/data-operate/export/outfile.md @@ -117,6 +117,16 @@ The `SELECT INTO OUTFILE` currently supports exporting the following file format * csv\_with\_names * csv\_with\_names\_and\_types +### Export concurrency + +You can enable concurrent export through the session variable `enable_parallel_outfile`. + +`SET enable_parallel_outfile=true;` + +Concurrent export will use multi-node and multi-thread to export result data to improve the overall export throughout. However, concurrent export may generate more files. + +Note that some queries cannot perform concurrent export even if this variable is turned on, such as queries containing global sorting. If the number of rows returned by the export command is greater than 1, it means that concurrent export is enabled. + ## Export Examples ### Export to an HDFS Cluster with High Availability Enabled @@ -231,33 +241,29 @@ Since `"max_file_size" = "2048MB"` is specified, if the final generated file is ## Notice -1. Limitations when using concurrent Outfile - - The current version of the pipeline engine does not support concurrent Outfile. Therefore, if the pipeline engine is enabled, concurrent Outfile will revert to single-threaded export. - -2. Export Data Volume and Export Efficiency +1. Export Data Volume and Export Efficiency The `SELECT INTO OUTFILE` function is essentially executing an SQL query command. If concurrent exports are not enabled, the query results are exported by a single BE node in a single thread. Therefore, the entire export time includes the time consumed by the query itself and the time consumed by writing out the final result set. Enabling concurrent exports can reduce the export time. -3. Export Timeout +2. Export Timeout The timeout period of the export command is the same as that of the query. If the export data times out due to a large amount of data, you can set the session variable `query_timeout` to appropriately extend the query timeout period. -4. Management of Exported Files +3. Management of Exported Files Doris does not manage the exported files. Whether the files are successfully exported or left over after a failed export, users need to handle them on their own. In addition, the `SELECT INTO OUTFILE` command does not check whether files or file paths exist. Whether the `SELECT INTO OUTFILE` command will automatically create paths or overwrite existing files is completely determined by the semantics of the remote storage system. -5. If the Query Result Set Is Empty +4. If the Query Result Set Is Empty For an export with an empty result set, an empty file will still be generated. -6. File Splitting +5. File Splitting File splitting ensures that a row of data is completely stored in a single file. Therefore, the size of the file is not strictly equal to `max_file_size`. -7. Functions with Non-visible Characters +6. Functions with Non-visible Characters For some functions whose output is non-visible characters, such as BITMAP and HLL types, when exported to the CSV file format, the output is `\N`. diff --git a/versioned_docs/version-2.1/lakehouse/file.md b/versioned_docs/version-2.1/lakehouse/file.md index 5694086386a..fff76453e16 100644 --- a/versioned_docs/version-2.1/lakehouse/file.md +++ b/versioned_docs/version-2.1/lakehouse/file.md @@ -1,11 +1,11 @@ --- { - "title": "Querying Files on S3/HDFS", + "title": "Analyze Files on S3/HDFS", "language": "en" } --- -<!-- +<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information @@ -24,25 +24,62 @@ specific language governing permissions and limitations under the License. --> -With the Table Value Function feature, Doris is able to query files in object storage or HDFS as simply as querying Tables. In addition, it supports automatic column type inference. +Through the Table Value Function feature, Doris can directly query and analyze files on object storage or HDFS as a Table. It also supports automatic column type inference. -## Usage +For more usage methods, refer to the Table Value Function documentation: -For more usage details, please see the documentation: +* [S3](../sql-manual/sql-functions/table-valued-functions/s3.md): Supports file analysis on S3-compatible object storage. -* [S3](../sql-manual/sql-functions/table-valued-functions/s3.md): supports file analysis on object storage compatible with S3 +* [HDFS](../sql-manual/sql-functions/table-valued-functions/hdfs.md): Supports file analysis on HDFS. -* [HDFS](../sql-manual/sql-functions/table-valued-functions/hdfs.md): supports file analysis on HDFS +## Basic Usage -* [LOCAL](../sql-manual/sql-functions/table-valued-functions/local.md): supports file analysis on local file system +Here we illustrate how to analyze files on object storage using the S3 Table Value Function as an example. -The followings illustrate how file analysis is conducted with the example of S3 Table Value Function. +### Query -### Automatic Column Type Inference +```sql +SELECT * FROM S3 ( + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +) +``` + +The `S3(...)` is a TVF (Table Value Function). A Table Value Function is essentially a table, so it can appear in any SQL statement where a "table" can appear. + +The attributes of a TVF include the file path to be analyzed, file format, connection information of the object storage, etc. The file path (URI) can use wildcards to match multiple files. The following file paths are valid: + +* Match a specific file + + `s3://bucket/path/to/tvf_test/test.parquet` + +* Match all files starting with `test_` + + `s3://bucket/path/to/tvf_test/test_*` + +* Match all files with the `.parquet` suffix + + `s3://bucket/path/to/tvf_test/*.parquet` + +* Match all files in the `tvf_test` directory + + `s3://bucket/path/to/tvf_test/*` + +* Match files with `test` in the filename + + `s3://bucket/path/to/tvf_test/*test*` + +### Automatic Inference of File Column Types + +You can view the Schema of a TVF using the `DESC FUNCTION` syntax: ```sql -> DESC FUNCTION s3 ( - "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", +DESC FUNCTION s3 ( + "URI" = "s3://bucket/path/to/tvf_test/test.parquet", "s3.access_key"= "ak", "s3.secret_key" = "sk", "format" = "parquet", @@ -63,78 +100,68 @@ The followings illustrate how file analysis is conducted with the example of S3 +---------------+--------------+------+-------+---------+-------+ ``` -An S3 Table Value Function is defined as follows: +Doris infers the Schema based on the following rules: -```sql -s3( - "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "Format" = "parquet", - "use_path_style"="true") -``` +* For Parquet and ORC formats, Doris obtains the Schema from the file metadata. -It specifies the file path, connection, and authentication. +* In the case of matching multiple files, the Schema of the first file is used as the TVF's Schema. -After defining, you can view the schema of this file using the `DESC FUNCTION` statement. +* For CSV and JSON formats, Doris parses the **first line of data** to obtain the Schema based on fields, delimiters, etc. -As can be seen, Doris is able to automatically infer column types based on the metadata of the Parquet file. + By default, all column types are `string`. You can specify column names and types individually using the `csv_schema` attribute. Doris will use the specified column types for file reading. The format is: `name1:type1;name2:type2;...`. For example: -Besides Parquet, Doris supports analysis and auto column type inference of ORC, CSV, and Json files. + ```sql + S3 ( + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak' + 's3.secret_key'='sk', + 'format' = 'csv', + 'column_separator' = '|', + 'csv_schema' = 'k1:int;k2:int;k3:int;k4:decimal(38,10)' + ) + ``` -**CSV Schema** + The currently supported column type names are as follows: -By default, for CSV format files, all columns are of type String. Column names and column types can be specified individually via the `csv_schema` attribute. Doris will use the specified column type for file reading. The format is as follows: + | Column Type Name | + | ------------ | + | tinyint | + | smallint | + | int | + | bigint | + | largeint | + | float | + | double | + | decimal(p,s) | + | date | + | datetime | + | char | + | varchar | + | string | + | boolean | -`name1:type1;name2:type2;...` +* For columns with mismatched formats (e.g., the file contains a string, but the user defines it as `int`; or other files have a different Schema than the first file), or missing columns (e.g., the file has 4 columns, but the user defines 5 columns), these columns will return `null`. -For columns with mismatched formats (such as string in the file and int defined by the user), or missing columns (such as 4 columns in the file and 5 columns defined by the user), these columns will return null. +## Applicable Scenarios -Currently supported column types are: +### Query Analysis -| name | mapping type | -| --- | --- | -|tinyint |tinyint | -|smallint |smallint | -|int |int | -| bigint | bigint | -| largeint | largeint | -| float| float | -| double| double| -| decimal(p,s) | decimalv3(p,s) | -| date | datev2 | -| datetime | datetimev2 | -| char |string | -|varchar |string | -|string|string | -|boolean| boolean | +TVF is very suitable for directly analyzing independent files on storage systems without having to import the data into Doris in advance. -Example: - -```sql -s3 ( - "uri" = "https://bucket1/inventory.dat", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "csv", - "column_separator" = "|", - "csv_schema" = "k1:int;k2:int;k3:int;k4:decimal(38,10)", - "use_path_style"="true" -) -``` - -### Query and Analysis - -You can conduct queries and analysis on this Parquet file using any SQL statements: +You can use any SQL statement for file analysis, such as: ```sql SELECT * FROM s3( - "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "parquet", - "use_path_style"="true") -LIMIT 5; + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +) +ORDER BY p_partkey LIMIT 5; +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment | +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ @@ -146,32 +173,38 @@ LIMIT 5; +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ ``` -You can put the Table Value Function anywhere that you used to put Table in the SQL, such as in the WITH or FROM clause in CTE. In this way, you can treat the file as a normal table and conduct analysis conveniently. +TVF can appear in any position in SQL where a Table can appear, such as in the `WITH` clause of a `CTE`, in the `FROM` clause, etc. This way, you can treat the file as a regular table for any analysis. -You can also create a logic view by using `CREATE VIEW` statement for a Table Value Function. So that you can query this view, grant priv on this view or allow other user to access this Table Value Function. +You can also create a logical view for a TVF using the `CREATE VIEW` statement. After that, you can access this TVF like other views, manage permissions, etc., and allow other users to access this View without having to repeatedly write connection information and other attributes. ```sql -CREATE VIEW v1 AS +-- Create a view based on a TVF +CREATE VIEW tvf_view AS SELECT * FROM s3( - "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "parquet", - "use_path_style"="true"); + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +); -DESC v1; +-- Describe the view as usual +DESC tvf_view; -SELECT * FROM v1; +-- Query the view as usual +SELECT * FROM tvf_view; -GRANT SELECT_PRIV ON db1.v1 TO user1; +-- Grant SELECT priv to other user on this view +GRANT SELECT_PRIV ON db.tvf_view TO other_user; ``` -### Data Ingestion +### Data Import -Users can ingest files into Doris tables via `INSERT INTO SELECT` for faster file analysis: +TVF can be used as a method for data import into Doris. With the `INSERT INTO SELECT` syntax, we can easily import files into Doris. ```sql -// 1. Create Doris internal table +-- Create a Doris table CREATE TABLE IF NOT EXISTS test_table ( id int, @@ -181,20 +214,21 @@ CREATE TABLE IF NOT EXISTS test_table DISTRIBUTED BY HASH(id) BUCKETS 4 PROPERTIES("replication_num" = "1"); -// 2. Insert data using S3 Table Value Function +-- 2. Load data into table from TVF INSERT INTO test_table (id,name,age) SELECT cast(id as INT) as id, name, cast (age as INT) as age FROM s3( - "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "parquet", - "use_path_style" = "true"); + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +); ``` +## Notes -### Note - -1. If the URI specified by the `S3 / HDFS` TVF is not matched with the file, or all the matched files are empty files, then the` S3 / HDFS` TVF will return to the empty result set. In this case, using the `DESC FUNCTION` to view the schema of this file, you will get a dummy column` __dummy_col`, which can be ignored. +1. If the specified `uri` does not match any files, or all matched files are empty, the TVF will return an empty result set. In this case, using `DESC FUNCTION` to view the Schema of this TVF will yield a virtual column `__dummy_col`, which is meaningless and only serves as a placeholder. -2. If the format of the TVF is specified to `CSV`, and the read file is not a empty file but the first line of this file is empty, then it will prompt the error `The first line is empty, can not parse column numbers`. This is because the schema cannot be parsed from the first line of the file +2. If the specified file format is `csv`, and the file read is not empty but the first line of the file is empty, an error `The first line is empty, can not parse column numbers` will be prompted, as the Schema cannot be parsed from the first line of the file. diff --git a/versioned_docs/version-3.0/data-operate/export/export-best-practice.md b/versioned_docs/version-3.0/data-operate/export/export-best-practice.md new file mode 100644 index 00000000000..6f3d6e95da9 --- /dev/null +++ b/versioned_docs/version-3.0/data-operate/export/export-best-practice.md @@ -0,0 +1,97 @@ +--- +{ + "title": "Best Practices", + "language": "en" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +This document mainly introduces how to determine whether resource utilization is reasonable during data export operations, and how to adjust resource utilization to achieve better data export efficiency. + +## SELECT INTO OUTFILE + +1. Enable parallel export + + The number of rows returned by `SELECT INTO OUTFILE` represents the number of parallel Writers. The more Writers, the higher the export concurrency, but the number of output files will also increase. If you find there is only one Writer, you can try enabling the parallel export feature. + + `SET enable_parallel_outfile=true` + + After enabling, the `SELECT INTO OUTFILE` operation generates the corresponding number of Writers based on the query's parallelism. The query parallelism is controlled by the session variable `parallel_pipeline_task_num`. By default, it's half of the number of CPU cores per BE. + + For example, in a cluster with 3 BE nodes, where each node has 8 CPU cores. When parallel export is enabled, it will generate (4*3=) 12 Writers. + + Note that even if `enable_parallel_outfile` is enabled, not all queries can be exported in parallel. For example, if the query contains global sorting and aggregation semantics, it cannot be exported in parallel. For example: + + ``` + SELECT * FROM table ORDER BY id; + + SELECT SUM(cost) FROM table; + ``` + +2. Determine export speed + + Each row result returned by `SELECT INTO OUTFILE` contains the time (in seconds) and speed (in KB/s) of the corresponding Writer's output. + + Adding up the speeds of multiple Writers on the same node gives you the write speed of a single node. You can compare this speed with disk bandwidth (e.g., when exporting to local) or network bandwidth (e.g., when exporting to object storage) to see if it has reached the bandwidth bottleneck. + +## Export + +1. Determine export execution status from the return results + + The Export command essentially breaks down the task into multiple `SELECT INTO OUTFILE` clauses for execution. + + The results returned by the `SHOW EXPORT` command contain a JSON string, which is a two-dimensional array. The first dimension represents the number of concurrent threads in Export, with the number of concurrent threads representing how many Outfile statements were initiated concurrently. The second dimension represents the return results of a single Outfile statement. Example: + + ``` + [ + [ + { + "fileNumber": "1", + "totalRows": "640321", + "fileSize": "350758307", + "url": "file:///127.0.0.1/mnt/disk2/ftw/tmp/export/exp_59fd917c43874adc-9b1c3e9cd6e655be_*", + "writeTime": "17.989", + "writeSpeed": "19041.66" + }, + {...}, + {...}, + {...} + ], + [ + { + "fileNumber": "1", + "totalRows": "646609", + "fileSize": "354228704", + "url": "file:///127.0.0.1/mnt/disk2/ftw/tmp/export/exp_c75b9d4b59bf4943-92eb94a7b97e46cb_*", + "writeTime": "17.249", + "writeSpeed": "20054.64" + }, + {...}, + {...}, + {...} + ] + ] + ``` + + In the above example, 2 Outfile commands were initiated. Each command has 4 Writers writing concurrently. + + By adjusting the `parallelism` parameter in the Export command properties, you can control the number of concurrent Outfile operations, thereby controlling the concurrency level. + diff --git a/versioned_docs/version-3.0/data-operate/export/outfile.md b/versioned_docs/version-3.0/data-operate/export/outfile.md index b66772a0823..c41cb32ba33 100644 --- a/versioned_docs/version-3.0/data-operate/export/outfile.md +++ b/versioned_docs/version-3.0/data-operate/export/outfile.md @@ -117,6 +117,16 @@ The `SELECT INTO OUTFILE` currently supports exporting the following file format * csv\_with\_names * csv\_with\_names\_and\_types +### Export concurrency + +You can enable concurrent export through the session variable `enable_parallel_outfile`. + +`SET enable_parallel_outfile=true;` + +Concurrent export will use multi-node and multi-thread to export result data to improve the overall export throughout. However, concurrent export may generate more files. + +Note that some queries cannot perform concurrent export even if this variable is turned on, such as queries containing global sorting. If the number of rows returned by the export command is greater than 1, it means that concurrent export is enabled. + ## Export Examples ### Export to an HDFS Cluster with High Availability Enabled diff --git a/versioned_docs/version-3.0/lakehouse/file.md b/versioned_docs/version-3.0/lakehouse/file.md index 5694086386a..fff76453e16 100644 --- a/versioned_docs/version-3.0/lakehouse/file.md +++ b/versioned_docs/version-3.0/lakehouse/file.md @@ -1,11 +1,11 @@ --- { - "title": "Querying Files on S3/HDFS", + "title": "Analyze Files on S3/HDFS", "language": "en" } --- -<!-- +<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information @@ -24,25 +24,62 @@ specific language governing permissions and limitations under the License. --> -With the Table Value Function feature, Doris is able to query files in object storage or HDFS as simply as querying Tables. In addition, it supports automatic column type inference. +Through the Table Value Function feature, Doris can directly query and analyze files on object storage or HDFS as a Table. It also supports automatic column type inference. -## Usage +For more usage methods, refer to the Table Value Function documentation: -For more usage details, please see the documentation: +* [S3](../sql-manual/sql-functions/table-valued-functions/s3.md): Supports file analysis on S3-compatible object storage. -* [S3](../sql-manual/sql-functions/table-valued-functions/s3.md): supports file analysis on object storage compatible with S3 +* [HDFS](../sql-manual/sql-functions/table-valued-functions/hdfs.md): Supports file analysis on HDFS. -* [HDFS](../sql-manual/sql-functions/table-valued-functions/hdfs.md): supports file analysis on HDFS +## Basic Usage -* [LOCAL](../sql-manual/sql-functions/table-valued-functions/local.md): supports file analysis on local file system +Here we illustrate how to analyze files on object storage using the S3 Table Value Function as an example. -The followings illustrate how file analysis is conducted with the example of S3 Table Value Function. +### Query -### Automatic Column Type Inference +```sql +SELECT * FROM S3 ( + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +) +``` + +The `S3(...)` is a TVF (Table Value Function). A Table Value Function is essentially a table, so it can appear in any SQL statement where a "table" can appear. + +The attributes of a TVF include the file path to be analyzed, file format, connection information of the object storage, etc. The file path (URI) can use wildcards to match multiple files. The following file paths are valid: + +* Match a specific file + + `s3://bucket/path/to/tvf_test/test.parquet` + +* Match all files starting with `test_` + + `s3://bucket/path/to/tvf_test/test_*` + +* Match all files with the `.parquet` suffix + + `s3://bucket/path/to/tvf_test/*.parquet` + +* Match all files in the `tvf_test` directory + + `s3://bucket/path/to/tvf_test/*` + +* Match files with `test` in the filename + + `s3://bucket/path/to/tvf_test/*test*` + +### Automatic Inference of File Column Types + +You can view the Schema of a TVF using the `DESC FUNCTION` syntax: ```sql -> DESC FUNCTION s3 ( - "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", +DESC FUNCTION s3 ( + "URI" = "s3://bucket/path/to/tvf_test/test.parquet", "s3.access_key"= "ak", "s3.secret_key" = "sk", "format" = "parquet", @@ -63,78 +100,68 @@ The followings illustrate how file analysis is conducted with the example of S3 +---------------+--------------+------+-------+---------+-------+ ``` -An S3 Table Value Function is defined as follows: +Doris infers the Schema based on the following rules: -```sql -s3( - "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "Format" = "parquet", - "use_path_style"="true") -``` +* For Parquet and ORC formats, Doris obtains the Schema from the file metadata. -It specifies the file path, connection, and authentication. +* In the case of matching multiple files, the Schema of the first file is used as the TVF's Schema. -After defining, you can view the schema of this file using the `DESC FUNCTION` statement. +* For CSV and JSON formats, Doris parses the **first line of data** to obtain the Schema based on fields, delimiters, etc. -As can be seen, Doris is able to automatically infer column types based on the metadata of the Parquet file. + By default, all column types are `string`. You can specify column names and types individually using the `csv_schema` attribute. Doris will use the specified column types for file reading. The format is: `name1:type1;name2:type2;...`. For example: -Besides Parquet, Doris supports analysis and auto column type inference of ORC, CSV, and Json files. + ```sql + S3 ( + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak' + 's3.secret_key'='sk', + 'format' = 'csv', + 'column_separator' = '|', + 'csv_schema' = 'k1:int;k2:int;k3:int;k4:decimal(38,10)' + ) + ``` -**CSV Schema** + The currently supported column type names are as follows: -By default, for CSV format files, all columns are of type String. Column names and column types can be specified individually via the `csv_schema` attribute. Doris will use the specified column type for file reading. The format is as follows: + | Column Type Name | + | ------------ | + | tinyint | + | smallint | + | int | + | bigint | + | largeint | + | float | + | double | + | decimal(p,s) | + | date | + | datetime | + | char | + | varchar | + | string | + | boolean | -`name1:type1;name2:type2;...` +* For columns with mismatched formats (e.g., the file contains a string, but the user defines it as `int`; or other files have a different Schema than the first file), or missing columns (e.g., the file has 4 columns, but the user defines 5 columns), these columns will return `null`. -For columns with mismatched formats (such as string in the file and int defined by the user), or missing columns (such as 4 columns in the file and 5 columns defined by the user), these columns will return null. +## Applicable Scenarios -Currently supported column types are: +### Query Analysis -| name | mapping type | -| --- | --- | -|tinyint |tinyint | -|smallint |smallint | -|int |int | -| bigint | bigint | -| largeint | largeint | -| float| float | -| double| double| -| decimal(p,s) | decimalv3(p,s) | -| date | datev2 | -| datetime | datetimev2 | -| char |string | -|varchar |string | -|string|string | -|boolean| boolean | +TVF is very suitable for directly analyzing independent files on storage systems without having to import the data into Doris in advance. -Example: - -```sql -s3 ( - "uri" = "https://bucket1/inventory.dat", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "csv", - "column_separator" = "|", - "csv_schema" = "k1:int;k2:int;k3:int;k4:decimal(38,10)", - "use_path_style"="true" -) -``` - -### Query and Analysis - -You can conduct queries and analysis on this Parquet file using any SQL statements: +You can use any SQL statement for file analysis, such as: ```sql SELECT * FROM s3( - "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "parquet", - "use_path_style"="true") -LIMIT 5; + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +) +ORDER BY p_partkey LIMIT 5; +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment | +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ @@ -146,32 +173,38 @@ LIMIT 5; +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ ``` -You can put the Table Value Function anywhere that you used to put Table in the SQL, such as in the WITH or FROM clause in CTE. In this way, you can treat the file as a normal table and conduct analysis conveniently. +TVF can appear in any position in SQL where a Table can appear, such as in the `WITH` clause of a `CTE`, in the `FROM` clause, etc. This way, you can treat the file as a regular table for any analysis. -You can also create a logic view by using `CREATE VIEW` statement for a Table Value Function. So that you can query this view, grant priv on this view or allow other user to access this Table Value Function. +You can also create a logical view for a TVF using the `CREATE VIEW` statement. After that, you can access this TVF like other views, manage permissions, etc., and allow other users to access this View without having to repeatedly write connection information and other attributes. ```sql -CREATE VIEW v1 AS +-- Create a view based on a TVF +CREATE VIEW tvf_view AS SELECT * FROM s3( - "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "parquet", - "use_path_style"="true"); + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +); -DESC v1; +-- Describe the view as usual +DESC tvf_view; -SELECT * FROM v1; +-- Query the view as usual +SELECT * FROM tvf_view; -GRANT SELECT_PRIV ON db1.v1 TO user1; +-- Grant SELECT priv to other user on this view +GRANT SELECT_PRIV ON db.tvf_view TO other_user; ``` -### Data Ingestion +### Data Import -Users can ingest files into Doris tables via `INSERT INTO SELECT` for faster file analysis: +TVF can be used as a method for data import into Doris. With the `INSERT INTO SELECT` syntax, we can easily import files into Doris. ```sql -// 1. Create Doris internal table +-- Create a Doris table CREATE TABLE IF NOT EXISTS test_table ( id int, @@ -181,20 +214,21 @@ CREATE TABLE IF NOT EXISTS test_table DISTRIBUTED BY HASH(id) BUCKETS 4 PROPERTIES("replication_num" = "1"); -// 2. Insert data using S3 Table Value Function +-- 2. Load data into table from TVF INSERT INTO test_table (id,name,age) SELECT cast(id as INT) as id, name, cast (age as INT) as age FROM s3( - "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet", - "s3.access_key"= "ak", - "s3.secret_key" = "sk", - "format" = "parquet", - "use_path_style" = "true"); + 'uri' = 's3://bucket/path/to/tvf_test/test.parquet', + 'format' = 'parquet', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key'='sk' +); ``` +## Notes -### Note - -1. If the URI specified by the `S3 / HDFS` TVF is not matched with the file, or all the matched files are empty files, then the` S3 / HDFS` TVF will return to the empty result set. In this case, using the `DESC FUNCTION` to view the schema of this file, you will get a dummy column` __dummy_col`, which can be ignored. +1. If the specified `uri` does not match any files, or all matched files are empty, the TVF will return an empty result set. In this case, using `DESC FUNCTION` to view the Schema of this TVF will yield a virtual column `__dummy_col`, which is meaningless and only serves as a placeholder. -2. If the format of the TVF is specified to `CSV`, and the read file is not a empty file but the first line of this file is empty, then it will prompt the error `The first line is empty, can not parse column numbers`. This is because the schema cannot be parsed from the first line of the file +2. If the specified file format is `csv`, and the file read is not empty but the first line of the file is empty, an error `The first line is empty, can not parse column numbers` will be prompted, as the Schema cannot be parsed from the first line of the file. diff --git a/versioned_sidebars/version-2.1-sidebars.json b/versioned_sidebars/version-2.1-sidebars.json index c105480dc65..96a2ac0a413 100644 --- a/versioned_sidebars/version-2.1-sidebars.json +++ b/versioned_sidebars/version-2.1-sidebars.json @@ -215,7 +215,8 @@ "data-operate/export/export-overview", "data-operate/export/export-manual", "data-operate/export/outfile", - "data-operate/export/export-with-mysql-dump" + "data-operate/export/export-with-mysql-dump", + "data-operate/export/export-best-practice" ] }, { @@ -2094,4 +2095,4 @@ ] } ] -} \ No newline at end of file +} diff --git a/versioned_sidebars/version-3.0-sidebars.json b/versioned_sidebars/version-3.0-sidebars.json index c60236a1c6b..40910a20718 100644 --- a/versioned_sidebars/version-3.0-sidebars.json +++ b/versioned_sidebars/version-3.0-sidebars.json @@ -239,7 +239,8 @@ "data-operate/export/export-overview", "data-operate/export/export-manual", "data-operate/export/outfile", - "data-operate/export/export-with-mysql-dump" + "data-operate/export/export-with-mysql-dump", + "data-operate/export/export-best-practice" ] }, { @@ -2172,4 +2173,4 @@ ] } ] -} \ No newline at end of file +} --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org