This is an automated email from the ASF dual-hosted git repository. morningman pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new e9b689b0f98 [mod](catalog) update doc for 2.1.7 and 3.0.3 (#1262) e9b689b0f98 is described below commit e9b689b0f987f59e64bc671961d0a1f8def153a1 Author: Mingyu Chen (Rayner) <morning...@163.com> AuthorDate: Fri Nov 1 18:23:46 2024 +0800 [mod](catalog) update doc for 2.1.7 and 3.0.3 (#1262) # Versions - [x] dev - [x] 3.0 - [x] 2.1 - [ ] 2.0 # Languages - [x] Chinese - [x] English --- docs/lakehouse/database/gbase-8a.md | 2 +- docs/lakehouse/datalake-analytics/hive.md | 16 ++++++++ docs/lakehouse/datalake-analytics/paimon.md | 24 +++++++++++- docs/lakehouse/datalake-building/hive-build.md | 44 +++++++++++++++++++--- .../current/lakehouse/database/gbase-8a.md | 4 +- .../current/lakehouse/datalake-analytics/hive.md | 16 ++++++++ .../current/lakehouse/datalake-analytics/paimon.md | 20 ++++++++++ .../lakehouse/datalake-building/hive-build.md | 44 +++++++++++++++++++--- .../lakehouse/datalake-analytics/hive.md | 16 ++++++++ .../lakehouse/datalake-analytics/paimon.md | 20 ++++++++++ .../lakehouse/datalake-building/hive-build.md | 44 +++++++++++++++++++--- .../lakehouse/datalake-analytics/hive.md | 16 ++++++++ .../lakehouse/datalake-analytics/paimon.md | 20 ++++++++++ .../lakehouse/datalake-building/hive-build.md | 44 +++++++++++++++++++--- .../lakehouse/datalake-analytics/hive.md | 16 ++++++++ .../lakehouse/datalake-analytics/paimon.md | 24 +++++++++++- .../lakehouse/datalake-building/hive-build.md | 44 +++++++++++++++++++--- .../lakehouse/datalake-analytics/hive.md | 16 ++++++++ .../lakehouse/datalake-analytics/paimon.md | 25 +++++++++++- .../lakehouse/datalake-building/hive-build.md | 44 +++++++++++++++++++--- 20 files changed, 455 insertions(+), 44 deletions(-) diff --git a/docs/lakehouse/database/gbase-8a.md b/docs/lakehouse/database/gbase-8a.md index d018bf292fe..66641b5f69c 100644 --- a/docs/lakehouse/database/gbase-8a.md +++ b/docs/lakehouse/database/gbase-8a.md @@ -40,7 +40,7 @@ To connect to a GBase 8a database you need - Doris Network connection between each FE and BE node and the GBase 8a server, default port is 5258. -::: warning Notice +:::warning Notice When this Catalog is tested against GBase 8a, the versions used are as follows: - GBase 8a: GBase8a_MPP_Cluster-NoLicense-FREE-9.5.3.28.12-redhat7-x86_64 - JDBC driver: gbase-connector-java-9.5.0.7-build1-bin.jar diff --git a/docs/lakehouse/datalake-analytics/hive.md b/docs/lakehouse/datalake-analytics/hive.md index c00767150fe..2b037a57c40 100644 --- a/docs/lakehouse/datalake-analytics/hive.md +++ b/docs/lakehouse/datalake-analytics/hive.md @@ -326,6 +326,22 @@ For Hive/Iceberge/Hudi > The variable default is false. +## Query Hive partitions + +You can query Hive partition information in the following two ways. + +- `SHOW PARTITIONS FROM hive_table` + + This statement can list all partitions and partition value information of the specified Hive table. + +- Use `table$partitions` metadata table + + Since versions 2.1.7 and 3.0.3, users can query Hive partition information through the `table$partitions` metadata table. `table$partitions` is essentially a relational table, so it can be used in any SELECT statement. + + ``` + SELECT * FROM hive_table$partitions; + ``` + ## Access HMS with broker Add following setting when creating an HMS catalog, file splitting and scanning for Hive external table will be completed by broker named `test_broker` diff --git a/docs/lakehouse/datalake-analytics/paimon.md b/docs/lakehouse/datalake-analytics/paimon.md index 7c4869d0e61..c40fed83946 100644 --- a/docs/lakehouse/datalake-analytics/paimon.md +++ b/docs/lakehouse/datalake-analytics/paimon.md @@ -38,7 +38,7 @@ Paimon Catalog Currently supports two types of Metastore creation catalogs: * filesystem(default),Store both metadata and data in the file system. * hive metastore,It also stores metadata in Hive metastore. Users can access these tables directly from Hive. -### Creating a Catalog Based on FileSystem +### Creating a Catalog based on FileSystem #### HDFS @@ -116,7 +116,7 @@ CREATE CATALOG `paimon_oss` PROPERTIES ( ); ``` -### Creating a Catalog Based on Hive Metastore +### Creating a Catalog based on Hive Metastore ```sql CREATE CATALOG `paimon_hms` PROPERTIES ( @@ -150,6 +150,26 @@ CREATE CATALOG `paimon_kerberos` PROPERTIES ( ); ``` +### Create a Catalog based on Aliyun DLF + +This feature is supported since version 2.1.7 and 3.0.3. + +``` +CREATE CATALOG `paimon_dlf` PROPERTIES ( + "type" = "paimon", + "paimon.catalog.type" = "dlf", + "warehouse" = "oss://xx/yy/", + "dlf.proxy.mode" = "DLF_ONLY", + "dlf.uid" = "xxxxx", + "dlf.region" = "cn-beijing", + "dlf.access_key" = "ak", + "dlf.secret_key" = "sk" + + -- "dlf.endpoint" = "dlf.cn-beijing.aliyuncs.com", -- optional + -- "dlf.catalog.id" = "xxxx", -- optional +); +``` + ## Column Type Mapping | Paimon Data Type | Doris Data Type | Comment | diff --git a/docs/lakehouse/datalake-building/hive-build.md b/docs/lakehouse/datalake-building/hive-build.md index b3c83a39d91..a63e550e826 100644 --- a/docs/lakehouse/datalake-building/hive-build.md +++ b/docs/lakehouse/datalake-building/hive-build.md @@ -146,6 +146,21 @@ For clusters upgraded from old versions, these variables may change. 'file_format'='orc', 'compression'='zlib' ); + + -- Create text format table(Since 2.1.7 & 3.0.3) + CREATE TABLE text_table ( + `id` INT, + `name` STRING + ) PROPERTIES ( + 'file_format'='text', + 'compression'='gzip', + 'field.delim'='\t', + 'line.delim'='\n', + 'collection.delim'=';', + 'mapkey.delim'=':', + 'serialization.null.format'='\\N', + 'escape.delim'='\\' + ); ``` After creation, you can view the Hive table creation statement using the `SHOW CREATE TABLE` command. @@ -209,13 +224,24 @@ For clusters upgraded from old versions, these variables may change. - File Formats - - Parquet - ORC (default) - + - Parquet + - Text (supported since version 2.1.7 & 3.0.3) + + The Text format also supports the following table properties: + + - `field.delim`: column delimiter. Default `\1`. + - `line.delim`: row delimiter. Default `\n`. + - `collection.delim`: delimiter between elements in complex types. Default `\2`. + - `mapkey.delim`: key value delimiter of Map type. Default `\3` + - `serialization.null.format`: storage format of NULL values. Default `\N`. + - `escape.delim`: escape character. Default `\`. + - Compression Formats - - Parquet: snappy(default), zstd, plain. (plain means no compression is used.) - - ORC: snappy, zlib(default), zstd, plain. (plain means no compression is used.) + - Parquet: snappy(default), zstd, plain. (plain means no compression) + - ORC: snappy, zlib(default), zstd, plain. (plain means no compression) + - Text: gzip, defalte, bzip2, zstd, lz4, lzo, snappy, plain (default). (plain means no compression) - Storage Medium @@ -234,7 +260,7 @@ Currently, writing to specific partitions is not supported. ### INSERT -The INSERT operation appends data to the target table. +The INSERT operation appends data to the target table. Currently, writing to a specific partition is not supported. ``` INSERT INTO hive_tbl values (val1, val2, val3, val4); @@ -246,13 +272,19 @@ INSERT INTO hive_tbl(col1, col2, partition_col1, partition_col2) values (1, 2, " ### INSERT OVERWRITE -The INSERT OVERWRITE operation completely overwrites the existing data in the table with new data. +The INSERT OVERWRITE operation completely overwrites the existing data in the table with new data. Currently, writing to a specific partition is not supported. ``` INSERT OVERWRITE TABLE VALUES(val1, val2, val3, val4) INSERT OVERWRITE TABLE hive.hive_db.hive_tbl(col1, col2) SELECT col1, col2 FROM internal.db1.tbl1; ``` +The semantics of INSERT OVERWRITE is consistent with Hive, and has the following behaviors: + +- When the target table is a partitioned table and the source table is empty, the operation will not have any effect. The target table data will not change. +- When the target table is a non-partitioned table and the source table is empty, the target table will be cleared. +- Currently, writing to a specified partition is not supported, so INSERT OVERWRITE automatically processes the corresponding target table partition according to the value in the source table. If the target table is a partitioned table, only the partitions involved will be overwritten, and the data of the partitions not involved will not change. + ### CTAS (CREATE TABLE AS SELECT) A Hive table can be created and populated with data using the `CTAS (CREATE TABLE AS SELECT)` statement: diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/database/gbase-8a.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/database/gbase-8a.md index 40c486d6fb5..14c4a343dcb 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/database/gbase-8a.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/database/gbase-8a.md @@ -40,10 +40,12 @@ Doris JDBC Catalog 支持通过标准 JDBC 接口连接 GBase 8a 数据库。本 - Doris 每个 FE 和 BE 节点和 GBase 8a 服务器之间的网络连接,默认端口为 5258。 -::: warning 注意 +:::warning 注意 此 Catalog 对接测试 GBase 8a 时,使用的版本如下: + - GBase 8a: GBase8a_MPP_Cluster-NoLicense-FREE-9.5.3.28.12-redhat7-x86_64 - JDBC 驱动程序: gbase-connector-java-9.5.0.7-build1-bin.jar + 其他版本未经测试,可能会有兼容性问题。 ::: diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/hive.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/hive.md index 25ff1592978..647a1b17265 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/hive.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/hive.md @@ -306,6 +306,22 @@ CREATE CATALOG hive PROPERTIES ( > 该变量默认为 `false`。 +## 查询 Hive 分区 + +可以通过下面两种方式查询 Hive 分区信息。 + +- `SHOW PARTITIONS FROM hive_table` + + 该语句可以列出指定 Hive 表的所有分区以及分区值信息。 + +- 使用 `table$partitions` 元数据表 + + 自 2.1.7 和 3.0.3 版本开始,用户可以通过 `table$partitions` 元数据表查询 Hive 分区信息。`table$partitions` 本质上是一个关系表,所以可以使用在任意 SELECT 语句中。 + + ``` + SELECT * FROM hive_table$partitions; + ``` + ## 使用 broker 访问 HMS 创建 HMS Catalog 时增加如下配置,Hive 外表文件分片和文件扫描将会由名为 `test_broker` 的 Broker 完成 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/paimon.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/paimon.md index e66cc50b317..ecc2484ad6e 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/paimon.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/paimon.md @@ -150,6 +150,26 @@ CREATE CATALOG `paimon_kerberos` PROPERTIES ( ); ``` +### 基于 Aliyun DLF 创建 Catalog + +该功能自 2.1.7 和 3.0.3 版本支持。 + +``` +CREATE CATALOG `paimon_dlf` PROPERTIES ( + "type" = "paimon", + "paimon.catalog.type" = "dlf", + "warehouse" = "oss://xx/yy/", + "dlf.proxy.mode" = "DLF_ONLY", + "dlf.uid" = "xxxxx", + "dlf.region" = "cn-beijing", + "dlf.access_key" = "ak", + "dlf.secret_key" = "sk" + + -- "dlf.endpoint" = "dlf.cn-beijing.aliyuncs.com", -- optional + -- "dlf.catalog.id" = "xxxx", -- optional +); +``` + ## 列类型映射 | Paimon Data Type | Doris Data Type | Comment | diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-building/hive-build.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-building/hive-build.md index 2b76690cd82..f16186a0c1c 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-building/hive-build.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-building/hive-build.md @@ -147,8 +147,23 @@ set global enable_fallback_to_original_planner = false; 'file_format'='orc', 'compression'='zlib' ); + + -- Create text format table(Since 2.1.7 & 3.0.3) + CREATE TABLE text_table ( + `id` INT, + `name` STRING + ) PROPERTIES ( + 'file_format'='text', + 'compression'='gzip', + 'field.delim'='\t', + 'line.delim'='\n', + 'collection.delim'=';', + 'mapkey.delim'=':', + 'serialization.null.format'='\\N', + 'escape.delim'='\\' + ); ``` - + 创建后,可以通过 `SHOW CREATE TABLE` 命令查看 Hive 的建表语句。 注意,不同于 Hive 中的建表语句。在 Apache Doris 中创建 Hive 分区表时,分区列也必须写到 Table 的 Schema 中。同时,分区列必须在所有 Schema 的最后,且顺序保持一致。 @@ -208,13 +223,24 @@ set global enable_fallback_to_original_planner = false; - 文件格式 - - Parquet - ORC(默认) + - Parquet + - Text (自 2.1.7 和 3.0.3 版本开始支持) + + Text 格式还支持以下表属性: + + - `field.delim`:列分隔符。默认 `\1`。 + - `line.delim`:行分隔符。默认 `\n`。 + - `collection.delim`:复杂类型中各元素之间的分隔符。默认 `\2`。 + - `mapkey.delim`:Map 类型的键值分割符。默认 `\3` + - `serialization.null.format`:NULL 值的存储格式。默认 `\N`。 + - `escape.delim`:转移字符。默认 `\`。 - 压缩格式 - - Parquet:snappy(默认),zstd,plain。(Plain 就是不采用压缩) - - ORC:snappy,zlib(默认),zstd,plain。(Plain 就是不采用压缩) + - Parquet:snappy(默认)、zstd、plain。(Plain 就是不采用压缩) + - ORC:snappy、zlib(默认)、zstd、plain。(Plain 就是不采用压缩) + - Text:gzipm、defalte、bzip2、zstd、lz4、lzo、snappy、plain(默认)。(Plain 就是不采用压缩) - 存储介质 @@ -233,7 +259,7 @@ set global enable_fallback_to_original_planner = false; ### INSERT -INSERT 操作会数据以追加的方式写入到目标表中。 +INSERT 操作会数据以追加的方式写入到目标表中。当前不支持指定分区写入。 ``` INSERT INTO hive_tbl values (val1, val2, val3, val4); @@ -245,13 +271,19 @@ INSERT INTO hive_tbl(col1, col2, partition_col1, partition_col2) values (1, 2, " ### INSERT OVERWRITE -INSERT OVERWRITE 会使用新的数据完全覆盖原有表中的数据。 +INSERT OVERWRITE 会使用新的数据完全覆盖原有表中的数据。当前不支持指定分区写入。 ``` INSERT OVERWRITE TABLE VALUES(val1, val2, val3, val4) INSERT OVERWRITE TABLE hive.hive_db.hive_tbl(col1, col2) SELECT col1, col2 FROM internal.db1.tbl1; ``` +INSERT OVERWRITE 的语义与 Hive 一致,有如下行为: + +- 当目的表是分区表,而源表为空表时,操作不会产生任何影响。目的表数据无变化。 +- 当目的表是非分区表,而源表是空表是,目的表会被清空。 +- 当前不支持指定分区写入,因此 INSERT OVERWRITE 为根据源表中的数值,自动处理对应的目的表分区。如果目的表是分区表,则只会覆盖涉及到的分区,不涉及的分区,数据无变化。 + ### CTAS(CREATE TABLE AS SELECT) 可以通过 `CTAS(CREATE TABLE AS SELECT)` 语句创建 Hive 表并写入数据: diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/datalake-analytics/hive.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/datalake-analytics/hive.md index 25ff1592978..647a1b17265 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/datalake-analytics/hive.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/datalake-analytics/hive.md @@ -306,6 +306,22 @@ CREATE CATALOG hive PROPERTIES ( > 该变量默认为 `false`。 +## 查询 Hive 分区 + +可以通过下面两种方式查询 Hive 分区信息。 + +- `SHOW PARTITIONS FROM hive_table` + + 该语句可以列出指定 Hive 表的所有分区以及分区值信息。 + +- 使用 `table$partitions` 元数据表 + + 自 2.1.7 和 3.0.3 版本开始,用户可以通过 `table$partitions` 元数据表查询 Hive 分区信息。`table$partitions` 本质上是一个关系表,所以可以使用在任意 SELECT 语句中。 + + ``` + SELECT * FROM hive_table$partitions; + ``` + ## 使用 broker 访问 HMS 创建 HMS Catalog 时增加如下配置,Hive 外表文件分片和文件扫描将会由名为 `test_broker` 的 Broker 完成 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/datalake-analytics/paimon.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/datalake-analytics/paimon.md index e66cc50b317..ecc2484ad6e 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/datalake-analytics/paimon.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/datalake-analytics/paimon.md @@ -150,6 +150,26 @@ CREATE CATALOG `paimon_kerberos` PROPERTIES ( ); ``` +### 基于 Aliyun DLF 创建 Catalog + +该功能自 2.1.7 和 3.0.3 版本支持。 + +``` +CREATE CATALOG `paimon_dlf` PROPERTIES ( + "type" = "paimon", + "paimon.catalog.type" = "dlf", + "warehouse" = "oss://xx/yy/", + "dlf.proxy.mode" = "DLF_ONLY", + "dlf.uid" = "xxxxx", + "dlf.region" = "cn-beijing", + "dlf.access_key" = "ak", + "dlf.secret_key" = "sk" + + -- "dlf.endpoint" = "dlf.cn-beijing.aliyuncs.com", -- optional + -- "dlf.catalog.id" = "xxxx", -- optional +); +``` + ## 列类型映射 | Paimon Data Type | Doris Data Type | Comment | diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/datalake-building/hive-build.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/datalake-building/hive-build.md index 2b76690cd82..f16186a0c1c 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/datalake-building/hive-build.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/datalake-building/hive-build.md @@ -147,8 +147,23 @@ set global enable_fallback_to_original_planner = false; 'file_format'='orc', 'compression'='zlib' ); + + -- Create text format table(Since 2.1.7 & 3.0.3) + CREATE TABLE text_table ( + `id` INT, + `name` STRING + ) PROPERTIES ( + 'file_format'='text', + 'compression'='gzip', + 'field.delim'='\t', + 'line.delim'='\n', + 'collection.delim'=';', + 'mapkey.delim'=':', + 'serialization.null.format'='\\N', + 'escape.delim'='\\' + ); ``` - + 创建后,可以通过 `SHOW CREATE TABLE` 命令查看 Hive 的建表语句。 注意,不同于 Hive 中的建表语句。在 Apache Doris 中创建 Hive 分区表时,分区列也必须写到 Table 的 Schema 中。同时,分区列必须在所有 Schema 的最后,且顺序保持一致。 @@ -208,13 +223,24 @@ set global enable_fallback_to_original_planner = false; - 文件格式 - - Parquet - ORC(默认) + - Parquet + - Text (自 2.1.7 和 3.0.3 版本开始支持) + + Text 格式还支持以下表属性: + + - `field.delim`:列分隔符。默认 `\1`。 + - `line.delim`:行分隔符。默认 `\n`。 + - `collection.delim`:复杂类型中各元素之间的分隔符。默认 `\2`。 + - `mapkey.delim`:Map 类型的键值分割符。默认 `\3` + - `serialization.null.format`:NULL 值的存储格式。默认 `\N`。 + - `escape.delim`:转移字符。默认 `\`。 - 压缩格式 - - Parquet:snappy(默认),zstd,plain。(Plain 就是不采用压缩) - - ORC:snappy,zlib(默认),zstd,plain。(Plain 就是不采用压缩) + - Parquet:snappy(默认)、zstd、plain。(Plain 就是不采用压缩) + - ORC:snappy、zlib(默认)、zstd、plain。(Plain 就是不采用压缩) + - Text:gzipm、defalte、bzip2、zstd、lz4、lzo、snappy、plain(默认)。(Plain 就是不采用压缩) - 存储介质 @@ -233,7 +259,7 @@ set global enable_fallback_to_original_planner = false; ### INSERT -INSERT 操作会数据以追加的方式写入到目标表中。 +INSERT 操作会数据以追加的方式写入到目标表中。当前不支持指定分区写入。 ``` INSERT INTO hive_tbl values (val1, val2, val3, val4); @@ -245,13 +271,19 @@ INSERT INTO hive_tbl(col1, col2, partition_col1, partition_col2) values (1, 2, " ### INSERT OVERWRITE -INSERT OVERWRITE 会使用新的数据完全覆盖原有表中的数据。 +INSERT OVERWRITE 会使用新的数据完全覆盖原有表中的数据。当前不支持指定分区写入。 ``` INSERT OVERWRITE TABLE VALUES(val1, val2, val3, val4) INSERT OVERWRITE TABLE hive.hive_db.hive_tbl(col1, col2) SELECT col1, col2 FROM internal.db1.tbl1; ``` +INSERT OVERWRITE 的语义与 Hive 一致,有如下行为: + +- 当目的表是分区表,而源表为空表时,操作不会产生任何影响。目的表数据无变化。 +- 当目的表是非分区表,而源表是空表是,目的表会被清空。 +- 当前不支持指定分区写入,因此 INSERT OVERWRITE 为根据源表中的数值,自动处理对应的目的表分区。如果目的表是分区表,则只会覆盖涉及到的分区,不涉及的分区,数据无变化。 + ### CTAS(CREATE TABLE AS SELECT) 可以通过 `CTAS(CREATE TABLE AS SELECT)` 语句创建 Hive 表并写入数据: diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/datalake-analytics/hive.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/datalake-analytics/hive.md index 25ff1592978..647a1b17265 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/datalake-analytics/hive.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/datalake-analytics/hive.md @@ -306,6 +306,22 @@ CREATE CATALOG hive PROPERTIES ( > 该变量默认为 `false`。 +## 查询 Hive 分区 + +可以通过下面两种方式查询 Hive 分区信息。 + +- `SHOW PARTITIONS FROM hive_table` + + 该语句可以列出指定 Hive 表的所有分区以及分区值信息。 + +- 使用 `table$partitions` 元数据表 + + 自 2.1.7 和 3.0.3 版本开始,用户可以通过 `table$partitions` 元数据表查询 Hive 分区信息。`table$partitions` 本质上是一个关系表,所以可以使用在任意 SELECT 语句中。 + + ``` + SELECT * FROM hive_table$partitions; + ``` + ## 使用 broker 访问 HMS 创建 HMS Catalog 时增加如下配置,Hive 外表文件分片和文件扫描将会由名为 `test_broker` 的 Broker 完成 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/datalake-analytics/paimon.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/datalake-analytics/paimon.md index e66cc50b317..ecc2484ad6e 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/datalake-analytics/paimon.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/datalake-analytics/paimon.md @@ -150,6 +150,26 @@ CREATE CATALOG `paimon_kerberos` PROPERTIES ( ); ``` +### 基于 Aliyun DLF 创建 Catalog + +该功能自 2.1.7 和 3.0.3 版本支持。 + +``` +CREATE CATALOG `paimon_dlf` PROPERTIES ( + "type" = "paimon", + "paimon.catalog.type" = "dlf", + "warehouse" = "oss://xx/yy/", + "dlf.proxy.mode" = "DLF_ONLY", + "dlf.uid" = "xxxxx", + "dlf.region" = "cn-beijing", + "dlf.access_key" = "ak", + "dlf.secret_key" = "sk" + + -- "dlf.endpoint" = "dlf.cn-beijing.aliyuncs.com", -- optional + -- "dlf.catalog.id" = "xxxx", -- optional +); +``` + ## 列类型映射 | Paimon Data Type | Doris Data Type | Comment | diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/datalake-building/hive-build.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/datalake-building/hive-build.md index 2b76690cd82..f16186a0c1c 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/datalake-building/hive-build.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/datalake-building/hive-build.md @@ -147,8 +147,23 @@ set global enable_fallback_to_original_planner = false; 'file_format'='orc', 'compression'='zlib' ); + + -- Create text format table(Since 2.1.7 & 3.0.3) + CREATE TABLE text_table ( + `id` INT, + `name` STRING + ) PROPERTIES ( + 'file_format'='text', + 'compression'='gzip', + 'field.delim'='\t', + 'line.delim'='\n', + 'collection.delim'=';', + 'mapkey.delim'=':', + 'serialization.null.format'='\\N', + 'escape.delim'='\\' + ); ``` - + 创建后,可以通过 `SHOW CREATE TABLE` 命令查看 Hive 的建表语句。 注意,不同于 Hive 中的建表语句。在 Apache Doris 中创建 Hive 分区表时,分区列也必须写到 Table 的 Schema 中。同时,分区列必须在所有 Schema 的最后,且顺序保持一致。 @@ -208,13 +223,24 @@ set global enable_fallback_to_original_planner = false; - 文件格式 - - Parquet - ORC(默认) + - Parquet + - Text (自 2.1.7 和 3.0.3 版本开始支持) + + Text 格式还支持以下表属性: + + - `field.delim`:列分隔符。默认 `\1`。 + - `line.delim`:行分隔符。默认 `\n`。 + - `collection.delim`:复杂类型中各元素之间的分隔符。默认 `\2`。 + - `mapkey.delim`:Map 类型的键值分割符。默认 `\3` + - `serialization.null.format`:NULL 值的存储格式。默认 `\N`。 + - `escape.delim`:转移字符。默认 `\`。 - 压缩格式 - - Parquet:snappy(默认),zstd,plain。(Plain 就是不采用压缩) - - ORC:snappy,zlib(默认),zstd,plain。(Plain 就是不采用压缩) + - Parquet:snappy(默认)、zstd、plain。(Plain 就是不采用压缩) + - ORC:snappy、zlib(默认)、zstd、plain。(Plain 就是不采用压缩) + - Text:gzipm、defalte、bzip2、zstd、lz4、lzo、snappy、plain(默认)。(Plain 就是不采用压缩) - 存储介质 @@ -233,7 +259,7 @@ set global enable_fallback_to_original_planner = false; ### INSERT -INSERT 操作会数据以追加的方式写入到目标表中。 +INSERT 操作会数据以追加的方式写入到目标表中。当前不支持指定分区写入。 ``` INSERT INTO hive_tbl values (val1, val2, val3, val4); @@ -245,13 +271,19 @@ INSERT INTO hive_tbl(col1, col2, partition_col1, partition_col2) values (1, 2, " ### INSERT OVERWRITE -INSERT OVERWRITE 会使用新的数据完全覆盖原有表中的数据。 +INSERT OVERWRITE 会使用新的数据完全覆盖原有表中的数据。当前不支持指定分区写入。 ``` INSERT OVERWRITE TABLE VALUES(val1, val2, val3, val4) INSERT OVERWRITE TABLE hive.hive_db.hive_tbl(col1, col2) SELECT col1, col2 FROM internal.db1.tbl1; ``` +INSERT OVERWRITE 的语义与 Hive 一致,有如下行为: + +- 当目的表是分区表,而源表为空表时,操作不会产生任何影响。目的表数据无变化。 +- 当目的表是非分区表,而源表是空表是,目的表会被清空。 +- 当前不支持指定分区写入,因此 INSERT OVERWRITE 为根据源表中的数值,自动处理对应的目的表分区。如果目的表是分区表,则只会覆盖涉及到的分区,不涉及的分区,数据无变化。 + ### CTAS(CREATE TABLE AS SELECT) 可以通过 `CTAS(CREATE TABLE AS SELECT)` 语句创建 Hive 表并写入数据: diff --git a/versioned_docs/version-2.1/lakehouse/datalake-analytics/hive.md b/versioned_docs/version-2.1/lakehouse/datalake-analytics/hive.md index c00767150fe..2b037a57c40 100644 --- a/versioned_docs/version-2.1/lakehouse/datalake-analytics/hive.md +++ b/versioned_docs/version-2.1/lakehouse/datalake-analytics/hive.md @@ -326,6 +326,22 @@ For Hive/Iceberge/Hudi > The variable default is false. +## Query Hive partitions + +You can query Hive partition information in the following two ways. + +- `SHOW PARTITIONS FROM hive_table` + + This statement can list all partitions and partition value information of the specified Hive table. + +- Use `table$partitions` metadata table + + Since versions 2.1.7 and 3.0.3, users can query Hive partition information through the `table$partitions` metadata table. `table$partitions` is essentially a relational table, so it can be used in any SELECT statement. + + ``` + SELECT * FROM hive_table$partitions; + ``` + ## Access HMS with broker Add following setting when creating an HMS catalog, file splitting and scanning for Hive external table will be completed by broker named `test_broker` diff --git a/versioned_docs/version-2.1/lakehouse/datalake-analytics/paimon.md b/versioned_docs/version-2.1/lakehouse/datalake-analytics/paimon.md index 7c4869d0e61..c40fed83946 100644 --- a/versioned_docs/version-2.1/lakehouse/datalake-analytics/paimon.md +++ b/versioned_docs/version-2.1/lakehouse/datalake-analytics/paimon.md @@ -38,7 +38,7 @@ Paimon Catalog Currently supports two types of Metastore creation catalogs: * filesystem(default),Store both metadata and data in the file system. * hive metastore,It also stores metadata in Hive metastore. Users can access these tables directly from Hive. -### Creating a Catalog Based on FileSystem +### Creating a Catalog based on FileSystem #### HDFS @@ -116,7 +116,7 @@ CREATE CATALOG `paimon_oss` PROPERTIES ( ); ``` -### Creating a Catalog Based on Hive Metastore +### Creating a Catalog based on Hive Metastore ```sql CREATE CATALOG `paimon_hms` PROPERTIES ( @@ -150,6 +150,26 @@ CREATE CATALOG `paimon_kerberos` PROPERTIES ( ); ``` +### Create a Catalog based on Aliyun DLF + +This feature is supported since version 2.1.7 and 3.0.3. + +``` +CREATE CATALOG `paimon_dlf` PROPERTIES ( + "type" = "paimon", + "paimon.catalog.type" = "dlf", + "warehouse" = "oss://xx/yy/", + "dlf.proxy.mode" = "DLF_ONLY", + "dlf.uid" = "xxxxx", + "dlf.region" = "cn-beijing", + "dlf.access_key" = "ak", + "dlf.secret_key" = "sk" + + -- "dlf.endpoint" = "dlf.cn-beijing.aliyuncs.com", -- optional + -- "dlf.catalog.id" = "xxxx", -- optional +); +``` + ## Column Type Mapping | Paimon Data Type | Doris Data Type | Comment | diff --git a/versioned_docs/version-2.1/lakehouse/datalake-building/hive-build.md b/versioned_docs/version-2.1/lakehouse/datalake-building/hive-build.md index b3c83a39d91..a63e550e826 100644 --- a/versioned_docs/version-2.1/lakehouse/datalake-building/hive-build.md +++ b/versioned_docs/version-2.1/lakehouse/datalake-building/hive-build.md @@ -146,6 +146,21 @@ For clusters upgraded from old versions, these variables may change. 'file_format'='orc', 'compression'='zlib' ); + + -- Create text format table(Since 2.1.7 & 3.0.3) + CREATE TABLE text_table ( + `id` INT, + `name` STRING + ) PROPERTIES ( + 'file_format'='text', + 'compression'='gzip', + 'field.delim'='\t', + 'line.delim'='\n', + 'collection.delim'=';', + 'mapkey.delim'=':', + 'serialization.null.format'='\\N', + 'escape.delim'='\\' + ); ``` After creation, you can view the Hive table creation statement using the `SHOW CREATE TABLE` command. @@ -209,13 +224,24 @@ For clusters upgraded from old versions, these variables may change. - File Formats - - Parquet - ORC (default) - + - Parquet + - Text (supported since version 2.1.7 & 3.0.3) + + The Text format also supports the following table properties: + + - `field.delim`: column delimiter. Default `\1`. + - `line.delim`: row delimiter. Default `\n`. + - `collection.delim`: delimiter between elements in complex types. Default `\2`. + - `mapkey.delim`: key value delimiter of Map type. Default `\3` + - `serialization.null.format`: storage format of NULL values. Default `\N`. + - `escape.delim`: escape character. Default `\`. + - Compression Formats - - Parquet: snappy(default), zstd, plain. (plain means no compression is used.) - - ORC: snappy, zlib(default), zstd, plain. (plain means no compression is used.) + - Parquet: snappy(default), zstd, plain. (plain means no compression) + - ORC: snappy, zlib(default), zstd, plain. (plain means no compression) + - Text: gzip, defalte, bzip2, zstd, lz4, lzo, snappy, plain (default). (plain means no compression) - Storage Medium @@ -234,7 +260,7 @@ Currently, writing to specific partitions is not supported. ### INSERT -The INSERT operation appends data to the target table. +The INSERT operation appends data to the target table. Currently, writing to a specific partition is not supported. ``` INSERT INTO hive_tbl values (val1, val2, val3, val4); @@ -246,13 +272,19 @@ INSERT INTO hive_tbl(col1, col2, partition_col1, partition_col2) values (1, 2, " ### INSERT OVERWRITE -The INSERT OVERWRITE operation completely overwrites the existing data in the table with new data. +The INSERT OVERWRITE operation completely overwrites the existing data in the table with new data. Currently, writing to a specific partition is not supported. ``` INSERT OVERWRITE TABLE VALUES(val1, val2, val3, val4) INSERT OVERWRITE TABLE hive.hive_db.hive_tbl(col1, col2) SELECT col1, col2 FROM internal.db1.tbl1; ``` +The semantics of INSERT OVERWRITE is consistent with Hive, and has the following behaviors: + +- When the target table is a partitioned table and the source table is empty, the operation will not have any effect. The target table data will not change. +- When the target table is a non-partitioned table and the source table is empty, the target table will be cleared. +- Currently, writing to a specified partition is not supported, so INSERT OVERWRITE automatically processes the corresponding target table partition according to the value in the source table. If the target table is a partitioned table, only the partitions involved will be overwritten, and the data of the partitions not involved will not change. + ### CTAS (CREATE TABLE AS SELECT) A Hive table can be created and populated with data using the `CTAS (CREATE TABLE AS SELECT)` statement: diff --git a/versioned_docs/version-3.0/lakehouse/datalake-analytics/hive.md b/versioned_docs/version-3.0/lakehouse/datalake-analytics/hive.md index c00767150fe..2b037a57c40 100644 --- a/versioned_docs/version-3.0/lakehouse/datalake-analytics/hive.md +++ b/versioned_docs/version-3.0/lakehouse/datalake-analytics/hive.md @@ -326,6 +326,22 @@ For Hive/Iceberge/Hudi > The variable default is false. +## Query Hive partitions + +You can query Hive partition information in the following two ways. + +- `SHOW PARTITIONS FROM hive_table` + + This statement can list all partitions and partition value information of the specified Hive table. + +- Use `table$partitions` metadata table + + Since versions 2.1.7 and 3.0.3, users can query Hive partition information through the `table$partitions` metadata table. `table$partitions` is essentially a relational table, so it can be used in any SELECT statement. + + ``` + SELECT * FROM hive_table$partitions; + ``` + ## Access HMS with broker Add following setting when creating an HMS catalog, file splitting and scanning for Hive external table will be completed by broker named `test_broker` diff --git a/versioned_docs/version-3.0/lakehouse/datalake-analytics/paimon.md b/versioned_docs/version-3.0/lakehouse/datalake-analytics/paimon.md index 0b809470a73..c40fed83946 100644 --- a/versioned_docs/version-3.0/lakehouse/datalake-analytics/paimon.md +++ b/versioned_docs/version-3.0/lakehouse/datalake-analytics/paimon.md @@ -24,6 +24,7 @@ specific language governing permissions and limitations under the License. --> +[Apache Doris & Paimon Quick Start](https://doris.apache.org/docs/gettingStarted/tutorials/doris-paimon) ## Instructions for use @@ -37,7 +38,7 @@ Paimon Catalog Currently supports two types of Metastore creation catalogs: * filesystem(default),Store both metadata and data in the file system. * hive metastore,It also stores metadata in Hive metastore. Users can access these tables directly from Hive. -### Creating a Catalog Based on FileSystem +### Creating a Catalog based on FileSystem #### HDFS @@ -115,7 +116,7 @@ CREATE CATALOG `paimon_oss` PROPERTIES ( ); ``` -### Creating a Catalog Based on Hive Metastore +### Creating a Catalog based on Hive Metastore ```sql CREATE CATALOG `paimon_hms` PROPERTIES ( @@ -149,6 +150,26 @@ CREATE CATALOG `paimon_kerberos` PROPERTIES ( ); ``` +### Create a Catalog based on Aliyun DLF + +This feature is supported since version 2.1.7 and 3.0.3. + +``` +CREATE CATALOG `paimon_dlf` PROPERTIES ( + "type" = "paimon", + "paimon.catalog.type" = "dlf", + "warehouse" = "oss://xx/yy/", + "dlf.proxy.mode" = "DLF_ONLY", + "dlf.uid" = "xxxxx", + "dlf.region" = "cn-beijing", + "dlf.access_key" = "ak", + "dlf.secret_key" = "sk" + + -- "dlf.endpoint" = "dlf.cn-beijing.aliyuncs.com", -- optional + -- "dlf.catalog.id" = "xxxx", -- optional +); +``` + ## Column Type Mapping | Paimon Data Type | Doris Data Type | Comment | diff --git a/versioned_docs/version-3.0/lakehouse/datalake-building/hive-build.md b/versioned_docs/version-3.0/lakehouse/datalake-building/hive-build.md index b3c83a39d91..a63e550e826 100644 --- a/versioned_docs/version-3.0/lakehouse/datalake-building/hive-build.md +++ b/versioned_docs/version-3.0/lakehouse/datalake-building/hive-build.md @@ -146,6 +146,21 @@ For clusters upgraded from old versions, these variables may change. 'file_format'='orc', 'compression'='zlib' ); + + -- Create text format table(Since 2.1.7 & 3.0.3) + CREATE TABLE text_table ( + `id` INT, + `name` STRING + ) PROPERTIES ( + 'file_format'='text', + 'compression'='gzip', + 'field.delim'='\t', + 'line.delim'='\n', + 'collection.delim'=';', + 'mapkey.delim'=':', + 'serialization.null.format'='\\N', + 'escape.delim'='\\' + ); ``` After creation, you can view the Hive table creation statement using the `SHOW CREATE TABLE` command. @@ -209,13 +224,24 @@ For clusters upgraded from old versions, these variables may change. - File Formats - - Parquet - ORC (default) - + - Parquet + - Text (supported since version 2.1.7 & 3.0.3) + + The Text format also supports the following table properties: + + - `field.delim`: column delimiter. Default `\1`. + - `line.delim`: row delimiter. Default `\n`. + - `collection.delim`: delimiter between elements in complex types. Default `\2`. + - `mapkey.delim`: key value delimiter of Map type. Default `\3` + - `serialization.null.format`: storage format of NULL values. Default `\N`. + - `escape.delim`: escape character. Default `\`. + - Compression Formats - - Parquet: snappy(default), zstd, plain. (plain means no compression is used.) - - ORC: snappy, zlib(default), zstd, plain. (plain means no compression is used.) + - Parquet: snappy(default), zstd, plain. (plain means no compression) + - ORC: snappy, zlib(default), zstd, plain. (plain means no compression) + - Text: gzip, defalte, bzip2, zstd, lz4, lzo, snappy, plain (default). (plain means no compression) - Storage Medium @@ -234,7 +260,7 @@ Currently, writing to specific partitions is not supported. ### INSERT -The INSERT operation appends data to the target table. +The INSERT operation appends data to the target table. Currently, writing to a specific partition is not supported. ``` INSERT INTO hive_tbl values (val1, val2, val3, val4); @@ -246,13 +272,19 @@ INSERT INTO hive_tbl(col1, col2, partition_col1, partition_col2) values (1, 2, " ### INSERT OVERWRITE -The INSERT OVERWRITE operation completely overwrites the existing data in the table with new data. +The INSERT OVERWRITE operation completely overwrites the existing data in the table with new data. Currently, writing to a specific partition is not supported. ``` INSERT OVERWRITE TABLE VALUES(val1, val2, val3, val4) INSERT OVERWRITE TABLE hive.hive_db.hive_tbl(col1, col2) SELECT col1, col2 FROM internal.db1.tbl1; ``` +The semantics of INSERT OVERWRITE is consistent with Hive, and has the following behaviors: + +- When the target table is a partitioned table and the source table is empty, the operation will not have any effect. The target table data will not change. +- When the target table is a non-partitioned table and the source table is empty, the target table will be cleared. +- Currently, writing to a specified partition is not supported, so INSERT OVERWRITE automatically processes the corresponding target table partition according to the value in the source table. If the target table is a partitioned table, only the partitions involved will be overwritten, and the data of the partitions not involved will not change. + ### CTAS (CREATE TABLE AS SELECT) A Hive table can be created and populated with data using the `CTAS (CREATE TABLE AS SELECT)` statement: --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org