This is an automated email from the ASF dual-hosted git repository. morningman pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new d1e2e2cb979 [opt](catalog) add english version for part of catalog docs (#1771) d1e2e2cb979 is described below commit d1e2e2cb979226b4e7b5cb8b77c83c585897d46f Author: Mingyu Chen (Rayner) <morning...@163.com> AuthorDate: Mon Jan 13 15:42:17 2025 +0800 [opt](catalog) add english version for part of catalog docs (#1771) ## Versions - [x] dev - [ ] 3.0 - [ ] 2.1 - [ ] 2.0 ## Languages - [x] Chinese - [x] English ## Docs Checklist - [ ] Checked by AI - [ ] Test Cases Built --- docs/lakehouse/catalog-overview.md | 242 +++++++- docs/lakehouse/catalogs/hive-catalog.md | 658 ++++++++++++++++++++- docs/lakehouse/catalogs/hudi-catalog.md | 203 ++++++- docs/lakehouse/catalogs/iceberg-catalog.md | 495 +++++++++++++++- docs/lakehouse/catalogs/jdbc-catalog-overview.md | 2 +- docs/lakehouse/catalogs/paimon-catalog.md | 188 +++++- .../docusaurus-plugin-content-docs/current.json | 4 +- .../current/lakehouse/catalog-overview.md | 4 +- .../current/lakehouse/catalogs/hive-catalog.md | 4 +- .../current/lakehouse/catalogs/hudi-catalog.md | 27 +- .../lakehouse/catalogs/jdbc-catalog-overview.md | 2 +- .../current/lakehouse/catalogs/paimon-catalog.md | 18 +- sidebars.json | 2 +- 13 files changed, 1819 insertions(+), 30 deletions(-) diff --git a/docs/lakehouse/catalog-overview.md b/docs/lakehouse/catalog-overview.md index 4af8cf97b78..0cd088b8ef6 100644 --- a/docs/lakehouse/catalog-overview.md +++ b/docs/lakehouse/catalog-overview.md @@ -1,6 +1,6 @@ --- { - "title": "Catalog Overview", + "title": "Data Catalog Overview", "language": "en" } --- @@ -24,5 +24,243 @@ specific language governing permissions and limitations under the License. --> -The document is under development, please refer to versioned doc 2.1 or 3.0 +A Data Catalog is used to describe the attributes of a data source. + +In Doris, multiple catalogs can be created to point to different data sources (such as Hive, Iceberg, MySQL). Doris will automatically obtain the databases, tables, columns, partitions, data locations, etc. of the corresponding data sources through the catalogs. Users can access these catalogs for data analysis through standard SQL statements and can conduct join queries on the data from multiple catalogs. + +There are two types of catalogs in Doris: + +| Type | Description | +| ---------------- | -------------------------------------------------------- | +| Internal Catalog | The built-in catalog, named `internal`, used to store Doris internal table data. It cannot be created, modified, or deleted. | +| External Catalog | External catalogs refer to all catalogs other than the Internal Catalog. Users can create, modify, and delete external catalogs. | + +Catalogs are mainly applicable to the following three scenarios, but different catalogs are suitable for different scenarios. For details, see the documentation for the corresponding catalog. + +| Scenario | Description | +| ---- | ------------------------------------------- | +| Query Acceleration | Direct query acceleration for data lakes such as Hive, Iceberg, Paimon, etc. | +| Data Integration | ZeroETL solution, directly accessing different data sources to generate result data, or facilitating data flow between different data sources. | +| Data Write-back | After data processing via Doris, write back to external data sources. | + +This document uses [Iceberg Catalog](./catalogs/iceberg-catalog.md) as an example to focus on the basic operations of catalogs. For detailed descriptions of different catalogs, please refer to the documentation of the corresponding catalog. + +## Creating Catalog + +Create an Iceberg Catalog using the `CREATE CATALOG` statement. + +```sql +CREATE CATALOG iceberg_catalog PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = 'hadoop', + 'warehouse' = 's3://bucket/dir/key', + 's3.endpoint' = 's3.us-east-1.amazonaws.com', + 's3.access_key' = 'ak', + 's3.secret_key' = 'sk' +); +``` + +Essentially, a catalog created in Doris acts as a "proxy" to access the metadata services (such as Hive Metastore) and storage services (such as HDFS/S3) of the corresponding data source. Doris only stores connection properties and other information of the catalog, not the actual metadata and data of the corresponding data source. + +### Common Properties + +In addition to the set of properties specific to each catalog, here are the common properties for all catalogs `{CommonProperties}`. + +| Property Name | Description | Example | +| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------- | +| `include_database_list` | Supports synchronizing only specified Databases, separated by `,`. By default, all Databases are synchronized. Database names are case-sensitive. Use this parameter when there are many Databases in the external data source but only a few need to be accessed, to avoid synchronizing a large amount of metadata. | `'include_database_list' = 'db1,db2'` | +| `exclude_database_list` | Supports specifying multiple Databases that do not need to be synchronized, separated by `,`. By default, no filtering is applied, and all Databases are synchronized. Database names are case-sensitive. This is used in the same scenario as above, to exclude databases that do not need to be accessed. If there is a conflict, `exclude` takes precedence over `include`. | `'exclude_database_list' = 'db1,db2'` | + + +### Column Type Mapping + +After a user creates a catalog, Doris automatically synchronizes the databases, tables, and columns of the catalog. For column type mapping rules of different catalogs, please refer to the documentation of the corresponding catalog. + +For external data types that cannot currently be mapped to Doris column types, such as `UNION`, `INTERVAL`, etc., Doris will map the column type to `UNSUPPORTED`. For queries involving `UNSUPPORTED` types, see the example below: + +Assume the synchronized table schema is: + +```text +k1 INT, +k2 INT, +k3 UNSUPPORTED, +k4 INT +``` + +The query behavior is as follows: + +```sql +SELECT * FROM table; -- Error: Unsupported type 'UNSUPPORTED_TYPE' in 'k3' +SELECT * EXCEPT(k3) FROM table; -- Query OK. +SELECT k1, k3 FROM table; -- Error: Unsupported type 'UNSUPPORTED_TYPE' in 'k3' +SELECT k1, k4 FROM table; -- Query OK. +``` + +## Using Catalog + +### Viewing Catalog + +After creation, you can view the catalog using the `SHOW CATALOGS` command: + +```text +mysql> SHOW CATALOGS; ++-----------+-----------------+----------+-----------+-------------------------+---------------------+------------------------+ +| CatalogId | CatalogName | Type | IsCurrent | CreateTime | LastUpdateTime | Comment | ++-----------+-----------------+----------+-----------+-------------------------+---------------------+------------------------+ +| 10024 | iceberg_catalog | hms | yes | 2023-12-25 16:11:41.687 | 2023-12-25 20:43:18 | NULL | +| 0 | internal | internal | | NULL | NULL | Doris internal catalog | ++-----------+-----------------+----------+-----------+-------------------------+---------------------+------------------------+ +``` + +You can view the statement to create a catalog using [SHOW CREATE CATALOG](../sql-manual/sql-statements/Show-Statements/SHOW-CREATE-CATALOG.md). + +### Switching Catalog + +Doris provides the `SWITCH` statement to switch the connection session context to the corresponding catalog. This is similar to using the `USE` statement to switch databases. + +After switching to a catalog, you can use the `USE` statement to continue switching to a specified database, or use `SHOW DATABASES` to view the databases under the current catalog. + +```sql +SWITCH iceberg_catalog; + +SHOW DATABASES; ++--------------------+ +| Database | ++--------------------+ +| information_schema | +| mysql | +| test | +| iceberg_db | ++--------------------+ + +USE iceberg_db; +``` + +You can also use the `USE` statement with the fully qualified name `catalog_name.database_name` to switch directly to a specified database within a specified catalog: + +```sql +USE iceberg_catalog.iceberg_db; +``` + +Fully qualified names can also be used in MySQL command line or JDBC connection strings to be compatible with the MySQL connection protocol. + +```sql +# Command line tool +mysql -h host -P9030 -uroot -Diceberg_catalog.iceberg_db + +# JDBC url +jdbc:mysql://host:9030/iceberg_catalog.iceberg_db +``` + +The fixed name for the built-in catalog is `internal`. The switching method is the same as for external catalogs. + +### Simple Query + +You can query tables in external catalogs using any SQL statement supported by Doris. + +```sql +SELECT id, SUM(cost) FROM iceberg_db.table1 +GROUP BY id ORDER BY id; +``` + +### Cross-Catalog Query + +Doris supports join queries across different catalogs. + +Here, let's create another MySQL Catalog: + +```sql +CREATE CATALOG mysql_catalog properties( + 'type' = 'jdbc', + 'user' = 'root', + 'password' = '123456', + 'jdbc_url' = 'jdbc:mysql://host:3306/mysql_db', + 'driver_url' = 'mysql-connector-java-8.0.25.jar', + 'driver_class' = 'com.mysql.cj.jdbc.Driver' +); +``` + +Then, perform a join query between Iceberg tables and MySQL tables using SQL: + +```sql +SELECT * FROM FROM +iceberg_catalog.iceberg_db.table1 tbl1 JOIN mysql_catalog.mysql_db.dim_table tbl2 +ON tbl1.id = tbl2.id; +``` + +### Data Import + +You can import data from a data source into Doris using the `INSERT` command. + +```sql +INSERT INTO internal.doris_db.tbl1 +SELECT * FROM iceberg_catalog.iceberg_db.table1; +``` + +You can also use the `CTAS (Create Table As Select)` statement to create an internal Doris table from an external data source and import the data: + +```sql +CREATE TABLE internal.doris_db.tbl1 +PROPERTIES('replication_num' = '1') +AS +SELECT * FROM iceberg_catalog.iceberg_db.table1; +``` + +### Data Write-Back + +Doris supports writing data back to external data sources using the `INSERT` statement. For more details, refer to: + +* [Hive Catalog](./catalogs/hive-catalog.md) +* [Iceberg Catalog](./catalogs/iceberg-catalog.md) +* [JDBC Catalog](./catalogs/jdbc-catalog-overview.md) + +## Refreshing Catalog + +Catalogs created in Doris act as "proxy" to access the metadata services of corresponding data sources. Doris caches some metadata to improve access performance and reduce frequent cross-network requests. However, the cache has a validity period, and without refreshing, you cannot access the latest metadata. Therefore, Doris provides several ways to refresh catalogs. + +```sql +-- Refresh catalog +REFRESH CATALOG catalog_name; + +-- Refresh specified database +REFRESH DATABASE catalog_name.db_name; + +-- Refresh specified table +REFRESH TABLE catalog_name.db_name.table_name; +``` + +Doris also supports disabling metadata caching to access the latest metadata in real-time. + +For detailed information and configuration of metadata caching, please refer to: [Metadata Cache](./meta-cache.md) + +## Modifying Catalog + +You can modify the properties or name of a catalog using the `ALTER CATALOG` statement: + +```sql +-- Rename a catalog +ALTER CATALOG iceberg_catalog RENAME iceberg_catalog2; + +-- Modify properties of a catalog +ALTER CATALOG iceberg_catalog SET PROPERTIES ('key1' = 'value1' [, 'key' = 'value2']); + +-- Modify the comment of a catalog +ALTER CATALOG iceberg_catalog MODIFY COMMENT 'my iceberg catalog'; +``` + +## Deleting Catalog + +You can delete a specified external catalog using the `DROP CATALOG` statement. + +```sql +DROP CATALOG [IF EXISTS] iceberg_catalog; +``` + +Deleting an external catalog from Doris does not remove the actual data; it only deletes the mapping relationship stored in Doris. + +## Permission Management + +The permission management for databases and tables in an external catalog is the same as for internal tables. For details, refer to the [Authentication and Authorization](../admin-manual/auth/authentication-and-authorization.md) documentation. + + diff --git a/docs/lakehouse/catalogs/hive-catalog.md b/docs/lakehouse/catalogs/hive-catalog.md index df0f344774c..98938f3fcd3 100644 --- a/docs/lakehouse/catalogs/hive-catalog.md +++ b/docs/lakehouse/catalogs/hive-catalog.md @@ -24,5 +24,661 @@ specific language governing permissions and limitations under the License. --> -The document is under development, please refer to versioned doc 2.1 or 3.0 +By connecting to Hive Metastore or metadata services compatible with Hive Metastore, Doris can automatically retrieve Hive database and table information for data querying. +In addition to Hive, many other systems use Hive Metastore to store metadata. Therefore, through the Hive Catalog, we can access not only Hive tables but also other table formats that use Hive Metastore for metadata storage, such as Iceberg and Hudi. + +## Applicable Scenarios + +| Scenario | Description | +|--------------|--------------------------------------------------------------| +| Query Acceleration | Use Doris's distributed computing engine to directly access Hive data for query acceleration. | +| Data Integration | Read Hive data and write it to Doris internal tables, or perform ZeroETL operations using the Doris computing engine. | +| Data Write-back | Process data from any source supported by Doris and write it back to Hive tables. | + +## Configuring Catalog + +### Syntax + +```sql +CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( + 'type'='hms', -- required + 'hive.metastore.type' = '<hive_metastore_type>', -- optional + 'hive.version' = '<hive_version>', -- optional + 'fs.defaultFS' = '<fs_defaultfs>', -- optional + {MetaStoreProperties}, + {StorageProperties}, + {CommonProperties} +); +``` + +* `<hive_metastore_type>` + + Specify the type of Hive Metastore. + + * `hms`: Standard Hive Metastore service. + * `glue`: Access AWS Glue metadata service using Hive Metastore compatible interface. + * `dlf`: Access Alibaba Cloud DLF metadata service using Hive Metastore compatible interface. + +* `<fs_defaultfs>` + + This parameter is required when writing data from Doris to tables in this Hive Catalog. Example: + + `'fs.defaultFS' = 'hdfs://namenode:port'` + +* `{MetaStoreProperties}` + + The MetaStoreProperties section is for entering connection and authentication information for the Metastore metadata service. Refer to the "Supported Metadata Services" section for details. + +* `{StorageProperties}` + + The StorageProperties section is for entering connection and authentication information related to the storage system. Refer to the "Supported Storage Systems" section for details. + +* `{CommonProperties}` + + The CommonProperties section is for entering common attributes. Please see the "Common Properties" section in the [Catalog Overview](../catalog-overview.md). + +### Supported Hive Versions + +Supports Hive 1.x, 2.x, 3.x, and 4.x. + +Hive transactional tables are supported from version 3.x onwards. For details, refer to the "Hive Transactional Tables" section. + +### Supported Metadata Services + +* [Hive Metastore](../metastores/hive-metastore.md) +* [AWS Glue](../metastores/aws-glue.md) +* [Aliyun DLF](../metastores/aliyun-dlf.md) + +### Supported Storage Systems + +* [HDFS](../storages/hdfs.md) +* [AWS S3](../storages/s3.md) +* [Google Cloud Storage](../storages/google-cloud-storage.md) +* [Alibaba Cloud OSS](../storages/aliyun-oss.md) +* [Tencent Cloud COS](../storages/tencent-cos.md) +* [Huawei Cloud OBS](../storages/huawei-obs.md) +* [MINIO](../storages/minio.md) + +> To create Hive tables and write data through Doris, you need to explicitly add the `fs.defaultFS` property in the Catalog attributes. If the Catalog is created only for querying, this parameter can be omitted. + +### Supported Data Formats + +* Hive + + * [ Parquet](../file-formats/parquet.md) + + * [ ORC](../file-formats/orc.md) + + * [ Text/CSV/JSON](../file-formats/text.md) + +* Hudi + + * [ Parquet](../file-formats/parquet.md) + + * [ ORC](../file-formats/orc.md) + +* Iceberg + + * [ Parquet](../file-formats/parquet.md) + + * [ ORC](../file-formats/orc.md) + +## Column Type Mapping + +| Hive Type | Doris Type | Comment | +| ------------- | ------------- | ---------------------------------------------- | +| boolean | boolean | | +| tinyint | tinyint | | +| smallint | smallint | | +| int | int | | +| bigint | bigint | | +| date | date | | +| timestamp | datetime(6) | Mapped to datetime with precision 6 | +| float | float | | +| double | double | | +| decimal(P, S) | decimal(P, S) | Defaults to decimal(9, 0) if precision not specified | +| char(N) | char(N) | | +| varchar(N) | varchar(N) | | +| string | string | | +| binary | string | | +| array | array | | +| map | map | | +| struct | struct | | +| other | unsupported | | + + +## Examples + +### Hive on HDFS + +```sql +CREATE CATALOG hive_hdfs PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.uris' = 'thrift://172.0.0.1:9083' +); +``` + +### Hive on HDFS with HA + +```sql +CREATE CATALOG hive_hdfs_ha PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.uris' = 'thrift://172.0.0.1:9083', + 'dfs.nameservices' = 'your-nameservice', + 'dfs.ha.namenodes.your-nameservice' = 'nn1,nn2', + 'dfs.namenode.rpc-address.your-nameservice.nn1' = '172.21.0.2:8088', + 'dfs.namenode.rpc-address.your-nameservice.nn2' = '172.21.0.3:8088', + 'dfs.client.failover.proxy.provider.your-nameservice' = 'org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider' +); +``` + +### Hive on ViewFS + +```sql +CREATE CATALOG hive_viewfs PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.uris' = 'thrift://172.0.0.1:9083', + 'dfs.nameservices' = 'your-nameservice', + 'dfs.ha.namenodes.your-nameservice' = 'nn1,nn2', + 'dfs.namenode.rpc-address.your-nameservice.nn1' = '172.21.0.2:8088', + 'dfs.namenode.rpc-address.your-nameservice.nn2' = '172.21.0.3:8088', + 'dfs.client.failover.proxy.provider.your-nameservice' = 'org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider', + 'fs.defaultFS' = 'viewfs://your-cluster', + 'fs.viewfs.mounttable.your-cluster.link./ns1' = 'hdfs://your-nameservice/', + 'fs.viewfs.mounttable.your-cluster.homedir' = '/ns1' +); +``` + +### Hive on S3 + +```sql +CREATE CATALOG hive_s3 PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.uris' = 'thrift://172.0.0.1:9083', + 's3.endpoint' = 's3.us-east-1.amazonaws.com', + 's3.region' = 'us-east-1', + 's3.access_key' = 'ak', + 's3.secret_key' = 'sk', + 'use_path_style' = 'true' +); +``` + +## Query Operations + +### Basic Query + +After configuring the Catalog, you can query the table data within the Catalog using the following method: + +```sql +-- 1. switch to catalog, use database and query +SWITCH hive_ctl; +USE hive_db; +SELECT * FROM hive_tbl LIMIT 10; + +-- 2. use hive database directly +USE hive_ctl.hive_db; +SELECT * FROM hive_tbl LIMIT 10; + +-- 3. use full qualified name to query +SELECT * FROM hive_ctl.hive_db.hive_tbl LIMIT 10; +``` + +### Querying Hive Partitions + +You can query Hive partition information using the following two methods: + +* `SHOW PARTITIONS FROM [catalog.][db.]hive_table` + + This statement lists all partitions and their values for the specified Hive table. + + ```sql + SHOW PARTITIONS FROM hive_table; + + +--------------------------------+ + | Partition | + +--------------------------------+ + | pt1=2024-10-10/pt2=beijing | + | pt1=2024-10-10/pt2=shanghai | + | pt1=2024-10-11/pt2=beijing | + | pt1=2024-10-11/pt2=shanghai | + | pt1=2024-10-12/pt2=nanjing | + +--------------------------------+ + ``` + +* Using the `table$partitions` Metadata Table + + Starting from versions 2.1.7 and 3.0.3, you can query Hive partition information through the `table$partitions` metadata table. This table is essentially relational, with each partition column represented as a column, allowing it to be used in any SELECT statement. + + ```sql + SELECT * FROM hive_table$partitions; + + +------------+-------------+ + | pt1 | pt2 | + +------------+-------------+ + | 2024-10-10 | beijing | + | 2024-10-10 | shanghai | + | 2024-10-12 | nanjing | + | 2024-10-11 | beijing | + | 2024-10-11 | shanghai | + +------------+-------------+ + ``` + +### Querying Hive Transactional Tables + +Hive Transactional tables support ACID semantics. For more details, see [Hive Transactions](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions). + +* Support for Hive Transactional Tables + + | Table Type | Supported Operations in Hive | Hive Table Properties | Supported Hive Versions | + | ------------------------------- | ----------------------------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------- | + | Full-ACID Transactional Table | Supports Insert, Update, Delete | `'transactional'='true'` | 4.x, 3.x, 2.x (2.x requires Major Compaction in Hive to read) | + | Insert-Only Transactional Table | Supports Insert only | `'transactional'='true'`, `'transactional_properties'='insert_only'` | 4.x, 3.x, 2.x (specify `hive.version` when creating the catalog) | + +* Current Limitations + + Original Files scenarios are not supported. When a table is converted to a Transactional table, new data files will use the Hive Transactional table schema, but existing data files will not be converted. These files are referred to as Original Files. + +### Querying Hive Views + +You can query Hive Views, but there are some limitations: + +* The Hive View definition (HiveQL) must be compatible with SQL statements supported by Doris. Otherwise, a parsing error will occur. + +* Some functions supported by HiveQL may have the same name as those in Doris but behave differently. This could lead to discrepancies between the results obtained from Hive and Doris. If you encounter such issues, please report them to the community. + +## Write Operations + +Data can be written to Hive tables using the INSERT statement. This is supported for Hive tables created by Doris or existing Hive tables with compatible formats. + +For partitioned tables, data will automatically be written to the corresponding partition or a new partition will be created based on the data. Currently, specifying a partition for writing is not supported. + +### INSERT INTO + +The INSERT operation appends data to the target table. Specifying a partition for writing is currently not supported. + +```sql +INSERT INTO hive_tbl VALUES (val1, val2, val3, val4); +INSERT INTO hive_ctl.hive_db.hive_tbl SELECT col1, col2 FROM internal.db1.tbl1; + +INSERT INTO hive_tbl(col1, col2) VALUES (val1, val2); +INSERT INTO hive_tbl(col1, col2, partition_col1, partition_col2) VALUES (1, 2, "beijing", "2023-12-12"); +``` + +### INSERT OVERWRITE + +INSERT OVERWRITE completely replaces the existing data in the table with new data. Specifying a partition for writing is currently not supported. + +```sql +INSERT OVERWRITE TABLE hive_tbl VALUES (val1, val2, val3, val4); +INSERT OVERWRITE TABLE hive_ctl.hive_db.hive_tbl(col1, col2) SELECT col1, col2 FROM internal.db1.tbl1; +``` + +The semantics of INSERT OVERWRITE are consistent with Hive, with the following behaviors: + +* If the target table is partitioned and the source table is empty, the operation has no effect. The target table remains unchanged. + +* If the target table is non-partitioned and the source table is empty, the target table will be cleared. + +* Since specifying a partition for writing is not supported, INSERT OVERWRITE automatically handles the relevant partitions in the target table based on the source table values. If the target table is partitioned, only the affected partitions will be overwritten; unaffected partitions remain unchanged. + +### CTAS + +You can create a Hive table and insert data using the `CTAS (CREATE TABLE AS SELECT)` statement: + +```sql +CREATE TABLE hive_ctas ENGINE=hive AS SELECT * FROM other_table; +``` + +CTAS supports specifying file formats, partitioning methods, and more, as shown below: + +```sql +CREATE TABLE hive_ctas ENGINE=hive +PARTITION BY LIST (pt1, pt2) () +AS SELECT col1, pt1, pt2 FROM part_ctas_src WHERE col1 > 0; + +CREATE TABLE hive_ctl.hive_db.hive_ctas (col1, col2, pt1) ENGINE=hive +PARTITION BY LIST (pt1) () +PROPERTIES ( + "file_format"="parquet", + "compression"="zstd" +) +AS SELECT col1, pt1 AS col2, pt2 AS pt1 FROM test_ctas.part_ctas_src WHERE col1 > 0; +``` + +### Related Parameters + +* BE + + | Parameter Name | Default Value | Description | + | ----------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | ----------- | + | `hive_sink_max_file_size` | 1GB | Maximum data file size. When the data size exceeds this limit, the current file will be closed, and a new file will be created for further writing. | + | `table_sink_partition_write_max_partition_nums_per_writer` | 128 | Maximum number of partitions that each instance can write to on a BE node. | + | `table_sink_non_partition_write_scaling_data_processed_threshold` | 25MB | Data volume threshold for starting scaling-write for non-partitioned tables. A new writer (instance) will be used for every additional `table_sink_non_partition_write_scaling_data_processed_threshold` of data. This mechanism adjusts the number of writers (instances) based on data volume to e [...] + | `table_sink_partition_write_min_data_processed_rebalance_threshold` | 25MB | Minimum data volume threshold to trigger rebalancing for partitioned tables. Rebalancing is triggered if `current accumulated data volume` - `data volume since last rebalancing or initial accumulation` >= `table_sink_partition_write_min_data_processed_rebalance_threshold`. If the final file [...] + | `table_sink_partition_write_min_partition_data_processed_rebalance_threshold` | 15MB | Minimum partition data volume threshold to trigger rebalancing. Rebalancing occurs if `current partition data volume` >= `threshold` \* `number of tasks already allocated to the partition`. If the final file size varies significantly, reduce this threshold to improve balance. However, a lowe [...] + + +## Database and Table Management + +Users can create and delete databases and tables in the Hive Metastore through Doris. Note that Doris only calls the Hive Metastore API for these operations and does not store or persist any Hive metadata itself. + +### Creating and Dropping Databases + +You can switch to the appropriate Catalog using the `SWITCH` statement and execute the `CREATE DATABASE` statement: + +```sql +SWITCH hive_ctl; +CREATE DATABASE [IF NOT EXISTS] hive_db; +``` + +You can also create a database using a fully qualified name or specify a location, such as: + +```sql +CREATE DATABASE [IF NOT EXISTS] hive_ctl.hive_db; + +CREATE DATABASE [IF NOT EXISTS] hive_ctl.hive_db +PROPERTIES ('location'='hdfs://172.21.16.47:4007/path/to/db/'); +``` + +You can view the location information of the Database using the `SHOW CREATE DATABASE` command: + +```sql +mysql> SHOW CREATE DATABASE hive_db; ++----------+---------------------------------------------------------------------------------------------+ +| Database | Create Database | ++----------+---------------------------------------------------------------------------------------------+ +| hive_db | CREATE DATABASE hive_db LOCATION 'hdfs://172.21.16.47:4007/usr/hive/warehouse/hive_db.db' | ++----------+---------------------------------------------------------------------------------------------+ +``` + +To drop a database: + +```sql +DROP DATABASE [IF EXISTS] hive_ctl.hive_db; +``` + +:::caution +For a Hive Database, you must first delete all tables under that Database before you can delete the Database itself; otherwise, an error will occur. This operation will also delete the corresponding Database in Hive. +::: + +### Creating and Dropping Tables + +- Creating Tables + + Doris supports creating both partitioned and non-partitioned tables in Hive. + + ```sql + -- Create unpartitioned hive table + CREATE TABLE unpartitioned_table ( + `col1` BOOLEAN COMMENT 'col1', + `col2` INT COMMENT 'col2', + `col3` BIGINT COMMENT 'col3', + `col4` CHAR(10) COMMENT 'col4', + `col5` FLOAT COMMENT 'col5', + `col6` DOUBLE COMMENT 'col6', + `col7` DECIMAL(9,4) COMMENT 'col7', + `col8` VARCHAR(11) COMMENT 'col8', + `col9` STRING COMMENT 'col9' + ) ENGINE=hive + PROPERTIES ( + 'file_format'='parquet' + ); + + -- Create partitioned hive table + -- The partition columns must be in table's column definition list + CREATE TABLE partition_table ( + `col1` BOOLEAN COMMENT 'col1', + `col2` INT COMMENT 'col2', + `col3` BIGINT COMMENT 'col3', + `col4` DECIMAL(2,1) COMMENT 'col4', + `pt1` VARCHAR COMMENT 'pt1', + `pt2` VARCHAR COMMENT 'pt2' + ) ENGINE=hive + PARTITION BY LIST (pt1, pt2) () + PROPERTIES ( + 'file_format'='orc', + 'compression'='zlib' + ); + + -- Create text format table(Since 2.1.7 & 3.0.3) + CREATE TABLE text_table ( + `id` INT, + `name` STRING + ) PROPERTIES ( + 'file_format'='text', + 'compression'='gzip', + 'field.delim'='\t', + 'line.delim'='\n', + 'collection.delim'=';', + 'mapkey.delim'=':', + 'serialization.null.format'='\\N', + 'escape.delim'='\\' + ); + ``` + + After creating a table, you can view the Hive table creation statement using the `SHOW CREATE TABLE` command. + + Note that unlike Hive's table creation syntax, when creating a partitioned table in Doris, partition columns must be included in the table schema. Additionally, partition columns must be placed at the end of the schema and maintain the same order. + + :::tip + For Hive clusters where ACID transaction features are enabled by default, tables created by Doris will have the `transactional` property set to `true`. Since Doris only supports certain features of Hive transactional tables, this may lead to issues where Doris cannot read the Hive tables it creates. To avoid this, explicitly set `"transactional" = "false"` in the table properties to create non-transactional Hive tables: + + ```sql + CREATE TABLE non_acid_table( + `col1` BOOLEAN COMMENT 'col1', + `col2` INT COMMENT 'col2', + `col3` BIGINT COMMENT 'col3' + ) ENGINE=hive + PROPERTIES ( + 'transactional'='false' + ); + ``` + ::: + +- Dropping Tables + + You can delete a Hive table using the `DROP TABLE` statement. When a table is deleted, all data, including partition data, is also removed. + +- Column Type Mapping + + Refer to the [Column Type Mapping] section for details. Note the following restrictions: + + - Columns must be of the default nullable type; `NOT NULL` is not supported. + - Hive 3.0 supports setting default values. To set default values, explicitly add `"hive.version" = "3.0.0"` in the catalog properties. + - If inserted data types are incompatible (e.g., inserting `'abc'` into a numeric type), the value will be converted to `null`. + +- Partitioning + + In Hive, partition types correspond to List partitions in Doris. Therefore, when creating a Hive partitioned table in Doris, use the List partition syntax, but there is no need to explicitly enumerate each partition. Doris will automatically create the corresponding Hive partition based on data values during data insertion. Single-column or multi-column partitioned tables are supported. + +- File Formats + + - **ORC** (default) + - **Parquet** + - **Text** (supported from versions 2.1.7 and 3.0.3) + + Text format supports the following table properties: + + - `field.delim`: Column delimiter. Default is `\1`. + - `line.delim`: Line delimiter. Default is `\n`. + - `collection.delim`: Delimiter for elements in complex types. Default is `\2`. + - `mapkey.delim`: Delimiter for map key-value pairs. Default is `\3`. + - `serialization.null.format`: Format for storing `NULL` values. Default is `\N`. + - `escape.delim`: Escape character. Default is `\`. + +- Compression Formats + + - **Parquet**: snappy (default), zstd, plain (no compression) + - **ORC**: snappy, zlib (default), zstd, plain (no compression) + - **Text**: gzip, deflate, bzip2, zstd, lz4, lzo, snappy, plain (default, no compression) + +- Storage Medium + + - **HDFS** + - **Object Storage** + +## Subscribing to Hive Metastore Events + +By having the FE nodes periodically read Notification Events from the HMS, Doris can detect real-time changes in Hive table metadata, improving metadata timeliness. Currently, the following events are supported: + +| Event | Action and Corresponding Behavior | +| ---------------- | ------------------------------------------------------------------------- | +| CREATE DATABASE | Creates a database in the corresponding data directory. | +| DROP DATABASE | Deletes a database in the corresponding data directory. | +| ALTER DATABASE | Mainly affects changes to database properties, comments, and default storage locations. These changes do not affect Doris's ability to query external data directories, so this event is currently ignored. | +| CREATE TABLE | Creates a table in the corresponding database. | +| DROP TABLE | Deletes a table in the corresponding database and invalidates the table cache. | +| ALTER TABLE | If renamed, deletes the old table and creates a new one with the new name; otherwise, invalidates the table cache. | +| ADD PARTITION | Adds a partition to the cached partition list of the corresponding table. | +| DROP PARTITION | Removes a partition from the cached partition list and invalidates the partition cache. | +| ALTER PARTITION | If renamed, deletes the old partition and creates a new one with the new name; otherwise, invalidates the partition cache. | + +:::tip +1. When data import causes file changes, partitioned tables trigger an `ALTER PARTITION` event, while non-partitioned tables trigger an `ALTER TABLE` event. + +2. If you bypass HMS and directly manipulate the file system, HMS will not generate corresponding events, and Doris will not detect metadata changes. +::: + +The following parameters in `fe.conf` are related to this feature: + +1. `enable_hms_events_incremental_sync`: Enables automatic incremental metadata synchronization. Disabled by default. + +2. `hms_events_polling_interval_ms`: Interval for reading events, default is 10000 milliseconds. + +3. `hms_events_batch_size_per_rpc`: Maximum number of events to read per RPC, default is 500. + +To use this feature (excluding Huawei Cloud MRS), you need to modify the `hive-site.xml` of HMS and restart both HMS and HiveServer2. + +```xml +<property> + <name>hive.metastore.event.db.notification.api.auth</name> + <value>false</value> +</property> +<property> + <name>hive.metastore.dml.events</name> + <value>true</value> +</property> +<property> + <name>hive.metastore.transactional.event.listeners</name> + <value>org.apache.hive.hcatalog.listener.DbNotificationListener</value> +</property> +``` + +For Huawei Cloud MRS, you need to modify the `hivemetastore-site.xml` and restart both HMS and HiveServer2. + +```xml +<property> + <name>metastore.transactional.event.listeners</name> + <value>org.apache.hive.hcatalog.listener.DbNotificationListener</value> +</property> +``` + +## Appendix + +### Transaction Mechanism + +Write operations to Hive are placed in a separate transaction. Before the transaction is committed, the data is not visible externally. Only after the transaction is committed do the related table operations become visible to others. + +Transactions ensure the atomicity of operations, meaning all operations within a transaction either succeed together or fail together. + +Transactions cannot fully guarantee isolation of operations, but they strive to minimize inconsistencies by separating file system operations from Hive Metastore metadata operations. + +For example, in a transaction that requires modifying multiple partitions of a Hive table, if the task is divided into two batches, the first batch might be visible externally before the second batch is completed. This means the first batch of partitions can be read, but the second batch cannot. + +If any exceptions occur during the transaction commit process, the transaction will be rolled back completely, including modifications to HDFS files and Hive Metastore metadata, without requiring any additional user intervention. + +### Concurrent Writing Mechanism + +Apache Doris currently supports concurrent writing using multiple insert statements. However, users need to ensure that concurrent writes do not result in potential conflicts. + +Since regular non-transactional Hive tables lack a complete transaction mechanism, the Apache Doris transaction mechanism aims to minimize inconsistency windows but cannot guarantee true ACID properties. Therefore, concurrent writes to Hive tables in Apache Doris may lead to data consistency issues. + +1. **Concurrent `INSERT` Operations** + +2. `INSERT` operations append data and do not conflict when executed concurrently, producing the expected results. + +3. **Concurrent `INSERT OVERWRITE` Operations** + +4. Concurrent `INSERT OVERWRITE` operations on the same table or partition may lead to data loss or corruption, resulting in unpredictable outcomes. + +5. Common solutions include: + + * For partitioned tables, write data to different partitions. Concurrent operations on different partitions do not conflict. + + * For non-partitioned tables, use `INSERT` instead of `INSERT OVERWRITE` to avoid conflicts. + + * For operations that may conflict, ensure that only one write operation occurs at a time on the business side. + +### HDFS File Operations + +For Hive table data on HDFS, data is typically first written to a temporary directory, then finalized using file system operations like `rename`. Below is a detailed explanation of the specific file operations on HDFS for different data operations. + +The temporary directory format for data is: `/tmp/.doris_staging/<username>/<uuid>` + +The format for the written data file names is: `<query-id>_<uuid>-<index>.<compress-type>.<file-type>` + +Here are examples of file operations under various scenarios: + +1. **Non-Partitioned Table** + + * **Append (Add Data)** + + * Target table directory: `hdfs://ns/usr/hive/warehouse/example.db/table1` + + * Temporary file: `hdfs://ns/tmp/.doris_staging/root/f02247cb662846038baae272af5eeb05/b35fdbcea3a4e39-86d1f36987ef1492_7e3985bf-9de9-4fc7-b84e-adf11aa08756-0.orc` + + * During the commit phase, all temporary files are moved to the target table directory. + + * **Overwrite (Replace Data)** + + * Target table directory: `hdfs://ns/usr/hive/warehouse/example.db/table1` + + * Temporary file: `hdfs://ns/tmp/.doris_staging/root/f02247cb662846038baae272af5eeb05/b35fdbcea3a4e39-86d1f36987ef1492_7e3985bf-9de9-4fc7-b84e-adf11aa08756-0.orc` + + * Commit phase steps: + + 1. Rename the target table directory to a temporary directory: `hdfs://ns/usr/hive/warehouse/example.db/_temp_b35fdbcea3a4e39-86d1f36987ef1492_table1` + + 2. Rename the temporary directory to the target table directory. + + 3. Delete the temporary target table directory. + +2. **Partitioned Table** + + * **Add (Add to New Partition)** + + * Target table directory: `hdfs://ns/usr/hive/warehouse/example.db/table2/part_col=2024-01-01` + + * Temporary file: `hdfs://ns/tmp/.doris_staging/root/a7eac7505d7a42fdb06cb9ef1ea3e912/par1=a/d678a74d232345e0-b659e2fb58e86ffd_549ad677-ee75-4fa1-b8a6-3e821e1dae61-0.orc` + + * During the commit phase, the temporary directory is renamed to the target table directory. + + * **Append (Add Data to Existing Partition)** + + * Target table directory: `hdfs://ns/usr/hive/warehouse/example.db/table2/part_col=2024-01-01` + + * Temporary file: `hdfs://ns/tmp/.doris_staging/root/a7eac7505d7a42fdb06cb9ef1ea3e912/par1=a/d678a74d232345e0-b659e2fb58e86ffd_549ad677-ee75-4fa1-b8a6-3e821e1dae61-0.orc` + + * During the commit phase, files from the temporary directory are moved to the target table directory. + + * **Overwrite (Replace Existing Partition)** + + * Target table directory: `hdfs://ns/usr/hive/warehouse/example.db/table2/part_col=2024-01-01` + + * Temporary file: `hdfs://ns/tmp/.doris_staging/root/a7eac7505d7a42fdb06cb9ef1ea3e912/par1=a/d678a74d232345e0-b659e2fb58e86ffd_549ad677-ee75-4fa1-b8a6-3e821e1dae61-0.orc` + + * Commit phase steps: + + 1. Rename the target partition directory to a temporary partition directory: `hdfs://ns/usr/hive/warehouse/example.db/table2/_temp_d678a74d232345e0-b659e2fb58e86ffd_part_col=2024-01-01` + + 2. Rename the temporary partition directory to the target partition directory. + + 3. Delete the temporary target partition directory. + +### Change Log + +| Doris Version | Feature Support | +| ------------- | --------------------------------------------- | +| 2.1.6 | Support for writing back to Hive tables | +| 3.0.4 | Support for Hive tables in JsonSerDe format. Support for transactional tables in Hive4. | \ No newline at end of file diff --git a/docs/lakehouse/catalogs/hudi-catalog.md b/docs/lakehouse/catalogs/hudi-catalog.md index af0cc8ee1ca..7ad02eb0573 100644 --- a/docs/lakehouse/catalogs/hudi-catalog.md +++ b/docs/lakehouse/catalogs/hudi-catalog.md @@ -24,5 +24,206 @@ specific language governing permissions and limitations under the License. --> -The document is under development, please refer to versioned doc 2.1 or 3.0 +Hudi Catalog reuses the Hive Catalog. By connecting to the Hive Metastore, or a metadata service compatible with the Hive Metastore, Doris can automatically obtain Hudi's database and table information and perform data queries. +## Applicable Scenarios + +| Scenario | Description | +| -------- | ----------- | +| Query Acceleration | Use Doris's distributed computing engine to directly access Hudi data for query acceleration. | +| Data Integration | Read Hudi data and write it into Doris internal tables, or perform ZeroETL operations using the Doris computing engine. | +| Data Write-back | Not supported. | + +## Configuring Catalog + +### Syntax + +```sql +CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( + 'type' = 'hms', -- required + 'hive.metastore.uris' = '<metastore_thrift_url>', -- required + {MetaStoreProperties}, + {StorageProperties}, + {HudiProperties}, + {CommonProperties} +); +``` + +* `{MetaStoreProperties}` + + The MetaStoreProperties section is used to fill in the connection and authentication information for the Metastore metadata service. See the section [Supported Metadata Services] for details. + +* `{StorageProperties}` + + The StorageProperties section is used to fill in the connection and authentication information related to the storage system. See the section [Supported Storage Systems] for details. + +* `{CommonProperties}` + + The CommonProperties section is used to fill in common properties. Please refer to the [Data Catalog Overview](../catalog-overview.md) section on [Common Properties]. + +* `{HudiProperties}` + + | Parameter Name | Former Name | Description | Default Value | + | ------------------------------- | -------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- | + | `hudi.use_hive_sync_partition` | `use_hive_sync_partition` | Whether to use the partition information already synchronized by Hive Metastore. If true, partition information will be obtained directly from Hive Metastore. Otherwise, it will be obtained from the metadata file of the file system. Obtaining information from Hive Metastore is more efficient, but users need to ensure that the latest metadata has been synchronized to Hive Metastore. | false | + +### Supported Hudi Versions + +The current dependent Hudi version is 0.15. It is recommended to access Hudi data version 0.14 and above. + +### Supported Query Types + +| Table Type | Supported Query Types | +| ------------- | ------------------------------------------------------------------------ | +| Copy On Write | Snapshot Query, Time Travel, Incremental Read | +| Merge On Read | Snapshot Queries, Read Optimized Queries, Time Travel, Incremental Read | + +### Supported Metadata Services + +* [Hive Metastore](../metastores/hive-metastore.md) + +### Supported Storage Systems + +* [HDFS](../storages/hdfs.md) +* [AWS S3](../storages/s3.md) +* [Google Cloud Storage](../storages/google-cloud-storage.md) +* [Alibaba Cloud OSS](../storages/aliyun-oss.md) +* [Tencent Cloud COS](../storages/tencent-cos.md) +* [Huawei Cloud OBS](../storages/huawei-obs.md) +* [MINIO](../storages/minio.md) + +### Supported Data Formats + +* [Parquet](../file-formats/parquet.md) +* [ORC](../file-formats/orc.md) + +## Column Type Mapping + +| Hudi Type | Doris Type | Comment | +| ------------- | ------------- | --------------------------------------------------------- | +| boolean | boolean | | +| int | int | | +| long | bigint | | +| float | float | | +| double | double | | +| decimal(P, S) | decimal(P, S) | | +| bytes | string | | +| string | string | | +| date | date | | +| timestamp | datetime(N) | Automatically maps to datetime(3) or datetime(6) based on precision | +| array | array | | +| map | map | | +| struct | struct | | +| other | UNSUPPORTED | | + +## Examples + +The creation of a Hudi Catalog is similar to a Hive Catalog. For more examples, please refer to [Hive Catalog](./hive-catalog.md). + +```sql +CREATE CATALOG hudi_hms PROPERTIES ( + 'type'='hms', + 'hive.metastore.uris' = 'thrift://172.21.0.1:7004', + 'hadoop.username' = 'hive', + 'dfs.nameservices'='your-nameservice', + 'dfs.ha.namenodes.your-nameservice'='nn1,nn2', + 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007', + 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007', + 'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider' +); +``` + +## Query Operations + +### Basic Query + +Once the Catalog is configured, you can query the tables within the Catalog using the following method: + +```sql +-- 1. switch to catalog, use database and query +SWITCH hudi_ctl; +USE hudi_db; +SELECT * FROM hudi_tbl LIMIT 10; + +-- 2. use hudi database directly +USE hudi_ctl.hudi_db; +SELECT * FROM hudi_tbl LIMIT 10; + +-- 3. use full qualified name to query +SELECT * FROM hudi_ctl.hudi_db.hudi_tbl LIMIT 10; +``` + +### Time Travel + +Every write operation to a Hudi table creates a new snapshot. Doris supports reading a specified snapshot of a Hudi table. By default, query requests only read the latest snapshot. + +You can query the timeline of a specified Hudi table using the `hudi_meta()` table function: + +```sql +SELECT * FROM hudi_meta( + 'table' = 'hudi_ctl.hudi_db.hudi_tbl', + 'query_type' = 'timeline' +); + ++-------------------+--------+--------------------------+-----------+-----------------------+ +| timestamp | action | file_name | state | state_transition_time | ++-------------------+--------+--------------------------+-----------+-----------------------+ +| 20241202171214902 | commit | 20241202171214902.commit | COMPLETED | 20241202171215756 | +| 20241202171217258 | commit | 20241202171217258.commit | COMPLETED | 20241202171218127 | +| 20241202171219557 | commit | 20241202171219557.commit | COMPLETED | 20241202171220308 | +| 20241202171221769 | commit | 20241202171221769.commit | COMPLETED | 20241202171222541 | +| 20241202171224269 | commit | 20241202171224269.commit | COMPLETED | 20241202171224995 | +| 20241202171226401 | commit | 20241202171226401.commit | COMPLETED | 20241202171227155 | +| 20241202171228827 | commit | 20241202171228827.commit | COMPLETED | 20241202171229570 | +| 20241202171230907 | commit | 20241202171230907.commit | COMPLETED | 20241202171231686 | +| 20241202171233356 | commit | 20241202171233356.commit | COMPLETED | 20241202171234288 | +| 20241202171235940 | commit | 20241202171235940.commit | COMPLETED | 20241202171236757 | ++-------------------+--------+--------------------------+-----------+-----------------------+ +``` + +You can use the `FOR TIME AS OF` statement to read historical versions of data based on the snapshot's timestamp. The time format is consistent with the Hudi documentation. Here are some examples: + +```sql +SELECT * FROM hudi_tbl FOR TIME AS OF "2022-10-07 17:20:37"; +SELECT * FROM hudi_tbl FOR TIME AS OF "20221007172037"; +SELECT * FROM hudi_tbl FOR TIME AS OF "2022-10-07"; +``` + +Note that Hudi tables do not support the `FOR VERSION AS OF` statement. Attempting to use this syntax with a Hudi table will result in an error. + +### Incremental Query + +Incremental Read allows querying data changes within a specified time range, returning the final state of the data at the end of that period. + +Doris provides the `@incr` syntax to support Incremental Read: + +```sql +SELECT * from hudi_table@incr('beginTime'='xxx', ['endTime'='xxx'], ['hoodie.read.timeline.holes.resolution.policy'='FAIL'], ...); +``` + +* `beginTime` + + Required. The time format must be consistent with the Hudi official [hudi\_table\_changes](https://hudi.apache.org/docs/0.14.0/quick-start-guide/#incremental-query), supporting "earliest". + +* `endTime` + + Optional, defaults to the latest commitTime. + +You can add more options in the `@incr` function, compatible with [Spark Read Options](https://hudi.apache.org/docs/0.14.0/configurations#Read-Options). + +By using `desc` to view the execution plan, you can see that Doris converts `@incr` into predicates pushed down to `VHUDI_SCAN_NODE`: + +```text +| 0:VHUDI_SCAN_NODE(113) | +| table: lineitem_mor | +| predicates: (_hoodie_commit_time[#0] >= '20240311151019723'), (_hoodie_commit_time[#0] <= '20240311151606605') | +| inputSplitNum=1, totalFileSize=13099711, scanRanges=1 +``` + +## Appendix + +### Change Log + +| Doris Version | Feature Support | +| ------------- | ---------------------------------------------- | +| 2.1.8/3.0.4 | Hudi dependency upgraded to 0.15. Added Hadoop Hudi JNI Scanner. | \ No newline at end of file diff --git a/docs/lakehouse/catalogs/iceberg-catalog.md b/docs/lakehouse/catalogs/iceberg-catalog.md index 0e874411c8e..ae651c8b539 100644 --- a/docs/lakehouse/catalogs/iceberg-catalog.md +++ b/docs/lakehouse/catalogs/iceberg-catalog.md @@ -24,5 +24,498 @@ specific language governing permissions and limitations under the License. --> -The document is under development, please refer to versioned doc 2.1 or 3.0 +Doris supports accessing Iceberg table data through various metadata services. In addition to reading data, Doris also supports writing to Iceberg tables. +:::tip +Users can access Iceberg tables using Hive Metastore as metadata through the Hive Catalog. However, it is still recommended to use the Iceberg Catalog directly to avoid some compatibility issues. +::: + +## Applicable Scenarios + +| Scenario | Description | +| ----------- | --------------------------------------------------------------------------- | +| Query Acceleration | Use Doris's distributed computing engine to directly access Iceberg data for faster queries. | +| Data Integration | Read Iceberg data and write it to internal Doris tables, or perform ZeroETL operations using the Doris computing engine. | +| Data Write-back | Process data from any Doris-supported data source and write it back to Iceberg table storage. | + +## Configuring Catalog + +### Syntax + +```sql +CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = '<iceberg_catalog_type>', + 'warehouse' = '<warehouse>' --optional + {MetaStoreProperties}, + {StorageProperties}, + {CommonProperties} +); +``` + +* `<iceberg_catalog_type>` + + The type of Iceberg Catalog, supporting the following options: + + * `hms`: Uses Hive Metastore as the metadata service. + + * `rest`: Metadata service compatible with the Iceberg Rest Catalog interface. + + * `hadoop`: Directly accesses metadata stored on the file system. + + * `glue`: Uses AWS Glue as the metadata service. + + * `dlf`: Uses Alibaba Cloud DLF as the metadata service. + +* `<warehouse>` + + The warehouse path for Iceberg. This parameter must be specified when `<iceberg_catalog_type>` is `hadoop`. + + The `warehouse` path must point to the level above the `Database` path. For example, if your table path is `s3://bucket/path/to/db1/table1`, then the `warehouse` should be `s3://bucket/path/to/`. + +* `{MetaStoreProperties}` + + The MetaStoreProperties section is for entering connection and authentication information for the Metastore metadata service. Refer to the section on [Supported Metadata Services]. + +* `{StorageProperties}` + + The StorageProperties section is for entering connection and authentication information related to the storage system. Refer to the section on [Supported Storage Systems]. + +* `{CommonProperties}` + + The CommonProperties section is for entering general properties. See the [Catalog Overview](../catalog-overview.md) for details on common properties. + +### Supported Iceberg Versions + +The current Iceberg dependency is version 1.4.3, which is compatible with higher versions of Iceberg. + +### Supported Iceberg Formats + +* Supports Iceberg V1/V2 formats. +* Supports Position Delete and Equality Delete. + +### Supported Metadata Services + +* [Hive Metastore](../metastores/hive-metastore.md) +* [AWS Glue](../metastores/aws-glue.md) +* [Aliyun DLF](../metastores/aliyun-dlf.md) +* [Iceberg Rest Catalog](../metastores/iceberg-rest.md) +* [FileSystem](../metastores/filesystem.md) + +### Supported Storage Systems + +* [HDFS](../storages/hdfs.md) +* [AWS S3](../storages/s3.md) +* [Google Cloud Storage](../storages/google-cloud-storage.md) +* [Aliyun OSS](../storages/aliyun-oss.md) +* [Tencent COS](../storages/tencent-cos.md) +* [Huawei OBS](../storages/huawei-obs.md) +* [MINIO](../storages/minio.md) + +### Supported Data Formats + +* [Parquet](../file-formats/parquet.md) +* [ORC](../file-formats/orc.md) + +## Column Type Mapping + +| Iceberg Type | Doris Type | Comment | +| -------------------------------------- | -------------------- | --------------------------------------- | +| boolean | boolean | | +| integer | int | | +| long | bigint | | +| float | float | | +| double | double | | +| decimal(P, S) | decimal(P, S) | | +| date | date | | +| timestamp (Timestamp without timezone) | datetime(6) | Mapped to datetime with precision 6 | +| timestamptz (Timestamp with timezone) | datetime(6) | Mapped to datetime with precision 6 | +| fixed(N) | char(N) | | +| string | string | | +| binary | string | | +| uuid | string | | +| struct | struct (supported from version 2.1.3) | | +| map | map (supported from version 2.1.3) | | +| list | array | | +| other | UNSUPPORTED | | + +## Examples + +### Iceberg on Hive Metastore + +```sql +CREATE CATALOG iceberg_hms PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = 'hms', + 'hive.metastore.uris' = 'thrift://hms-host:7004', + 'hadoop.username' = 'hive', + 'dfs.nameservices' = 'your-nameservice', + 'dfs.ha.namenodes.your-nameservice' = 'nn1,nn2', + 'dfs.namenode.rpc-address.your-nameservice.nn1' = '172.21.0.2:4007', + 'dfs.namenode.rpc-address.your-nameservice.nn2' = '172.21.0.3:4007', + 'dfs.client.failover.proxy.provider.your-nameservice' = 'org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider' +); +``` + +### Iceberg on Hadoop + +```sql +CREATE CATALOG iceberg_hadoop PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = 'hadoop', + 'warehouse' = 'hdfs://namenode:8020/dir/key' +); +``` + +### Iceberg on S3 + +```sql +CREATE CATALOG iceberg_s3 PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = 'hadoop', + 'warehouse' = 's3://bucket/dir/key', + 's3.endpoint' = 's3.us-east-1.amazonaws.com', + 's3.access_key' = 'ak', + 's3.secret_key' = 'sk' +); +``` + +### Iceberg on Glue + +```sql +-- Using access key and secret key +CREATE CATALOG iceberg_glue PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = 'glue', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com/', + 'client.credentials-provider' = 'com.amazonaws.glue.catalog.credentials.ConfigAWSProvider', + 'client.credentials-provider.glue.access_key' = 'ak', + 'client.credentials-provider.glue.secret_key' = 'sk' +); +``` + +### Iceberg on DLF + +```sql +CREATE CATALOG iceberg_dlf PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.type' = 'dlf', + 'dlf.proxy.mode' = 'DLF_ONLY', + 'dlf.endpoint' = 'datalake-vpc.cn-beijing.aliyuncs.com', + 'dlf.region' = 'cn-beijing', + 'dlf.uid' = 'uid', + 'dlf.catalog.id' = 'catalog_id', + 'dlf.access_key' = 'ak', + 'dlf.secret_key' = 'sk' +); +``` + +### Iceberg on Rest + +```sql +CREATE CATALOG iceberg_rest PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = 'rest', + 'uri' = 'http://172.21.0.1:8181' +); +``` + +### Iceberg on Rest with MINIO + +```sql +CREATE CATALOG iceberg_minio PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = 'rest', + 'uri' = 'http://172.21.0.1:8181', + 's3.access_key' = 'ak', + 's3.secret_key' = 'sk', + 's3.endpoint' = 'http://10.0.0.1:9000', + 's3.region' = 'us-east-1' +); +``` + +### Iceberg on Google Dataproc Metastore + +```sql +CREATE CATALOG iceberg_gdm PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = 'hms', + 'hive.metastore.uris' = 'thrift://172.21.0.1:9083', + 'gs.endpoint' = 'https://storage.googleapis.com', + 'gs.region' = 'us-east-1', + 'gs.access_key' = 'ak', + 'gs.secret_key' = 'sk', + 'use_path_style' = 'true' +); +``` + +## Query Operations + +### Basic Query + +```sql +-- 1. switch to catalog, use database and query +SWITCH iceberg; +USE iceberg_db; +SELECT * FROM iceberg_tbl LIMIT 10; + +-- 2. use iceberg database directly +USE iceberg.iceberg_db; +SELECT * FROM iceberg_tbl LIMIT 10; + +-- 3. use full qualified name to query +SELECT * FROM iceberg.iceberg_db.iceberg_tbl LIMIT 10; +``` + +### Time Travel + +You can read a specific snapshot of an Iceberg table. + +By default, read requests will access the latest snapshot version. + +You can query a specific snapshot of an Iceberg table using the `iceberg_meta()` table function: + +```sql +SELECT * FROM iceberg_meta( + 'table' = 'iceberg_ctl.iceberg_db.iceberg_tbl', + 'query_type' = 'snapshots' +)\G + +*************************** 1. row *************************** + committed_at: 2024-11-28 11:07:29 + snapshot_id: 8903826400153112036 + parent_id: -1 + operation: append +manifest_list: oss://path/to/metadata/snap-8903826400153112036-1-3835e66d-9a18-4cb0-b9b0-9ec80527ad8d.avro + summary: {"added-data-files":"2","added-records":"3","added-files-size":"2742","changed-partition-count":"2","total-records":"3","total-files-size":"2742","total-data-files":"2","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0"} +*************************** 2. row *************************** + committed_at: 2024-11-28 11:10:11 + snapshot_id: 6099853805930794326 + parent_id: 8903826400153112036 + operation: append +manifest_list: oss://path/to/metadata/snap-6099853805930794326-1-dd46a1bd-219b-4fb0-bb46-ac441d8b3105.avro + summary: {"added-data-files":"1","added-records":"1","added-files-size":"1367","changed-partition-count":"1","total-records":"4","total-files-size":"4109","total-data-files":"3","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0"} +``` + +You can use the `FOR TIME AS OF` and `FOR VERSION AS OF` clauses to read historical data based on snapshot ID or the time the snapshot was created. Here are some examples: + +```sql +-- Read data as of a specific timestamp +SELECT * FROM iceberg_table FOR TIME AS OF '2023-01-01 00:00:00'; + +-- Read data as of a specific snapshot ID +SELECT * FROM iceberg_table FOR VERSION AS OF 123456789; +``` + +## Write Operations + +### INSERT INTO + +The INSERT operation appends data to the target table. + +For example: + +```sql +INSERT INTO iceberg_tbl VALUES (val1, val2, val3, val4); +INSERT INTO iceberg.iceberg_db.iceberg_tbl SELECT col1, col2 FROM internal.db1.tbl1; + +INSERT INTO iceberg_tbl(col1, col2) VALUES (val1, val2); +INSERT INTO iceberg_tbl(col1, col2, partition_col1, partition_col2) VALUES (1, 2, 'beijing', '2023-12-12'); +``` + +### INSERT OVERWRITE + +The INSERT OVERWRITE operation completely replaces the existing data in the table with new data. + +```sql +INSERT OVERWRITE TABLE iceberg_tbl VALUES (val1, val2, val3, val4); +INSERT OVERWRITE TABLE iceberg.iceberg_db.iceberg_tbl(col1, col2) SELECT col1, col2 FROM internal.db1.tbl1; +``` + +### CTAS + +You can create an Iceberg table and write data using the `CTAS` (Create Table As Select) statement: + +```sql +CREATE TABLE iceberg_ctas AS SELECT * FROM other_table; +``` + +CTAS supports specifying file formats, partitioning, and other properties: + +```sql +CREATE TABLE iceberg_ctas +PARTITION BY LIST (pt1, pt2) () +AS SELECT col1, pt1, pt2 FROM part_ctas_src WHERE col1 > 0; + +CREATE TABLE iceberg.iceberg_db.iceberg_ctas (col1, col2, pt1) +PARTITION BY LIST (pt1) () +PROPERTIES ( + 'write-format'='parquet', + 'compression-codec'='zstd' +) +AS SELECT col1, pt1 AS col2, pt2 AS pt1 FROM test_ctas.part_ctas_src WHERE col1 > 0; +``` + +### Related Parameters + +* BE (Backend) + + | Parameter Name | Default Value | Description | + | ----------------------------------------------------------------------------- | ------------- | ----------- | + | `iceberg_sink_max_file_size` | 1GB | Maximum data file size. When the written data exceeds this size, the current file is closed and a new file is created to continue writing. | + | `table_sink_partition_write_max_partition_nums_per_writer` | 128 | Maximum number of partitions each instance can write to on a BE node. | + | `table_sink_non_partition_write_scaling_data_processed_threshold` | 25MB | Data threshold for starting scaling-write in non-partitioned tables. A new writer (instance) is used for every additional `table_sink_non_partition_write_scaling_data_processed_threshold` of data. This mechanism adjusts the number of writers based on data volume to enhance throughput while conserving resources and minimizing file numbers for smaller data volumes. | + | `table_sink_partition_write_min_data_processed_rebalance_threshold` | 25MB | Minimum data volume threshold to trigger rebalancing for partitioned tables. Rebalancing starts if `current accumulated data volume` - `data volume since last rebalancing` >= `table_sink_partition_write_min_data_processed_rebalance_threshold`. Lowering this threshold can improve balance if file size differences are significant, but may increase rebalancing costs and impact performance. | + | `table_sink_partition_write_min_partition_data_processed_rebalance_threshold` | | Minimum partition data volume threshold to trigger rebalancing. Rebalancing starts if `current partition data volume` >= `threshold` \* `number of tasks already allocated to the partition`. Lowering this threshold can improve balance if file size differences are significant, but may increase rebalancing costs and impact performance. | + +## Database and Table Management + +### Creating and Dropping Databases + +You can switch to the desired catalog using the `SWITCH` statement and execute the `CREATE DATABASE` command: + +```sql +SWITCH iceberg; +CREATE DATABASE [IF NOT EXISTS] iceberg_db; +``` + +You can also create a database using a fully qualified name or specify a location (currently, only HMS-type catalogs support specifying a location), such as: + +```sql +CREATE DATABASE [IF NOT EXISTS] iceberg.iceberg_db; + +CREATE DATABASE [IF NOT EXISTS] iceberg.iceberg_db +PROPERTIES ('location'='hdfs://172.21.16.47:4007/path/to/db/'); +``` + +You can view the database's location information using the `SHOW CREATE DATABASE` command: + +```sql +mysql> SHOW CREATE DATABASE iceberg_db; ++-------------+-------------------------------------------------------------------------------------------------+ +| Database | Create Database | ++-------------+-------------------------------------------------------------------------------------------------+ +| iceberg_db | CREATE DATABASE iceberg_db LOCATION 'hdfs://172.21.16.47:4007/usr/hive/warehouse/iceberg_db.db' | ++-------------+-------------------------------------------------------------------------------------------------+ +``` + +To drop a database: + +```sql +DROP DATABASE [IF EXISTS] iceberg.iceberg_db; +``` + +:::caution +For an Iceberg Database, you must first delete all tables under the database before you can delete the database itself; otherwise, an error will occur. +::: + +### Creating and Dropping Tables + +* Creating Tables + + Doris supports creating both partitioned and non-partitioned tables in Iceberg. + + For example: + + ```sql + -- Create unpartitioned iceberg table + CREATE TABLE unpartitioned_table ( + `col1` BOOLEAN COMMENT 'col1', + `col2` INT COMMENT 'col2', + `col3` BIGINT COMMENT 'col3', + `col4` FLOAT COMMENT 'col4', + `col5` DOUBLE COMMENT 'col5', + `col6` DECIMAL(9,4) COMMENT 'col6', + `col7` STRING COMMENT 'col7', + `col8` DATE COMMENT 'col8', + `col9` DATETIME COMMENT 'col9' + ) + PROPERTIES ( + 'write-format'='parquet' + ); + + -- Create partitioned iceberg table + -- The partition columns must be in table's column definition list + CREATE TABLE partition_table ( + `ts` DATETIME COMMENT 'ts', + `col1` BOOLEAN COMMENT 'col1', + `col2` INT COMMENT 'col2', + `col3` BIGINT COMMENT 'col3', + `col4` FLOAT COMMENT 'col4', + `col5` DOUBLE COMMENT 'col5', + `col6` DECIMAL(9,4) COMMENT 'col6', + `col7` STRING COMMENT 'col7', + `col8` DATE COMMENT 'col8', + `col9` DATETIME COMMENT 'col9', + `pt1` STRING COMMENT 'pt1', + `pt2` STRING COMMENT 'pt2' + ) + PARTITION BY LIST (day(ts), pt1, pt2) () + PROPERTIES ( + 'write-format'='orc', + 'compression-codec'='zlib' + ); + ``` + + After creation, you can use the `SHOW CREATE TABLE` command to view the Iceberg table creation statement. For details about partition functions, see the [Partitioning](#) section. + +* Dropping Tables + + You can drop an Iceberg table using the `DROP TABLE` statement. Dropping a table will also remove its data, including partition data. + + For example: + + ```sql + DROP TABLE [IF EXISTS] iceberg_tbl; + ``` + +* Column Type Mapping + + Refer to the [Column Type Mapping](#) section. + +* Partitioning + + Partition types in Iceberg correspond to List partitions in Doris. Therefore, when creating an Iceberg partitioned table in Doris, you should use the List partitioning syntax, but you don't need to explicitly enumerate each partition. Doris will automatically create the corresponding Iceberg partitions based on the data values during data insertion. + + * Supports creating single-column or multi-column partitioned tables. + + * Supports partition transformation functions to enable Iceberg implicit partitioning and partition evolution. For specific Iceberg partition transformation functions, see [Iceberg partition transforms](https://iceberg.apache.org/spec/#partition-transforms): + + * `year(ts)` or `years(ts)` + + * `month(ts)` or `months(ts)` + + * `day(ts)` or `days(ts)` or `date(ts)` + + * `hour(ts)` or `hours(ts)` or `date_hour(ts)` + + * `bucket(N, col)` + + * `truncate(L, col)` + +* File Formats + + * Parquet (default) + + * ORC + +* Compression Formats + + * Parquet: snappy, zstd (default), plain (no compression). + + * ORC: snappy, zlib (default), zstd, plain (no compression). + +* Storage Medium + + * HDFS + + * Object storage + +## Appendix + +### Change Log + +| Doris Version | Feature Support | +| -------------- | -------------------------------------- | +| 2.1.3 | Support for ORC file format, Equality Delete | +| 2.1.6 | Support for DDL, DML | \ No newline at end of file diff --git a/docs/lakehouse/catalogs/jdbc-catalog-overview.md b/docs/lakehouse/catalogs/jdbc-catalog-overview.md index 6be0ac5bb1c..45c80a031a6 100644 --- a/docs/lakehouse/catalogs/jdbc-catalog-overview.md +++ b/docs/lakehouse/catalogs/jdbc-catalog-overview.md @@ -1,6 +1,6 @@ --- { - "title": "JDBC Catalog Overview", + "title": "JDBC Catalog", "language": "en" } --- diff --git a/docs/lakehouse/catalogs/paimon-catalog.md b/docs/lakehouse/catalogs/paimon-catalog.md index 413431cdb08..3b68b1ff892 100644 --- a/docs/lakehouse/catalogs/paimon-catalog.md +++ b/docs/lakehouse/catalogs/paimon-catalog.md @@ -24,5 +24,191 @@ specific language governing permissions and limitations under the License. --> -The document is under development, please refer to versioned doc 2.1 or 3.0 +Doris currently supports accessing Paimon table metadata through various metadata services and querying Paimon data. + +At present, only read operations on Paimon tables are supported. Write operations to Paimon tables will be supported in the future. + +## Applicable Scenarios + +| Scenario | Description | +| ------------ | ------------------------------------------------------------ | +| Query Acceleration | Use Doris's distributed computing engine to directly access Paimon data for query acceleration. | +| Data Integration | Read Paimon data and write it into Doris internal tables, or perform ZeroETL operations using the Doris computing engine. | +| Data Write-back | Not supported yet. | + +## Configuring Catalog + +### Syntax + +```sql +CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( + 'type' = 'paimon', + 'paimon.catalog.type' = '<paimon_catalog_type>', + 'warehouse' = '<paimon_warehouse>' + {MetaStoreProperties}, + {StorageProperties}, + {CommonProperties} +); +``` + +* `<paimon_catalog_type>` + + The type of Paimon Catalog, supporting the following: + + * `filesystem`: Default. Directly accesses metadata stored on the file system. + + * `hms`: Uses Hive Metastore as the metadata service. + + * `dlf`: Uses Alibaba Cloud DLF as the metadata service. + +* `<paimon_warehouse>` + + The warehouse path for Paimon. This parameter must be specified when `<paimon_catalog_type>` is `filesystem`. + + The `warehouse` path must point to the level above the `Database` path. For example, if your table path is: `s3://bucket/path/to/db1/table1`, then `warehouse` should be: `s3://bucket/path/to/`. + +* `{MetaStoreProperties}` + + The MetaStoreProperties section is used to fill in connection and authentication information for the Metastore metadata service. Refer to the section on [Supported Metadata Services] for details. + +* `{StorageProperties}` + + The StorageProperties section is used to fill in connection and authentication information related to the storage system. Refer to the section on [Supported Storage Systems] for details. + +* `{CommonProperties}` + + The CommonProperties section is used to fill in common properties. Please refer to the [Catalog Overview](../catalog-overview.md) section on [Common Properties]. + +### Supported Paimon Versions + +The currently dependent Paimon version is 0.8.1. Higher versions of Paimon tables are supported for reading. + +### Supported Paimon Formats + +* Supports reading Paimon Deletion Vector + +### Supported Metadata Services + +* [Hive Metastore](../metastores/hive-metastore.md) + +* [Aliyun DLF](../metastores/aliyun-dlf.md) + +* [FileSystem](../metastores/filesystem.md) + +### Supported Storage Systems + +* [HDFS](../storages/hdfs.md) + +* [AWS S3](../storages/s3.md) + +* [Google Cloud Storage](../storages/google-cloud-storage.md) + +* [Alibaba Cloud OSS](../storages/aliyun-oss.md) + +* [Tencent Cloud COS](../storages/tencent-cos.md) + +* [Huawei Cloud OBS](../storages/huawei-obs.md) + +* [MINIO](../storages/minio.md) + +### Supported Data Formats + +* [Parquet](../file-formats/parquet.md) + +* [ORC](../file-formats/orc.md) + +## Column Type Mapping + +| Paimon Type | Doris Type | Comment | +| ---------------------------------- | ------------- | ----------------------------------------------------------------------- | +| boolean | boolean | | +| tinyint | tinyint | | +| smallint | smallint | | +| integer | int | | +| bigint | bigint | | +| float | float | | +| double | double | | +| decimal(P, S) | decimal(P, S) | | +| varchar | string | | +| char | string | | +| binary | string | | +| varbinary | string | | +| date | date | | +| timestamp\_without\_time\_zone | datetime(N) | Mapped according to precision. If precision is greater than 6, it maps to a maximum of 6 (may cause precision loss). | +| timestamp\_with\_local\_time\_zone | datetime(N) | Mapped according to precision. If precision is greater than 6, it maps to a maximum of 6 (may cause precision loss). | +| array | array | | +| map | map | | +| row | struct | | +| other | UNSUPPORTED | | + +## Examples + +### Paimon on HDFS + +```sql +CREATE CATALOG paimon_hdfs PROPERTIES ( + 'type' = 'paimon', + 'warehouse' = 'hdfs://HDFS8000871/user/paimon', + 'dfs.nameservices' = 'HDFS8000871', + 'dfs.ha.namenodes.HDFS8000871' = 'nn1,nn2', + 'dfs.namenode.rpc-address.HDFS8000871.nn1' = '172.21.0.1:4007', + 'dfs.namenode.rpc-address.HDFS8000871.nn2' = '172.21.0.2:4007', + 'dfs.client.failover.proxy.provider.HDFS8000871' = 'org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider', + 'hadoop.username' = 'hadoop' +); +``` + +### Paimon on HMS + +```sql +CREATE CATALOG paimon_hms PROPERTIES ( + 'type' = 'paimon', + 'paimon.catalog.type' = 'hms', + 'warehouse' = 'hdfs://HDFS8000871/user/zhangdong/paimon2', + 'hive.metastore.uris' = 'thrift://172.21.0.44:7004', + 'dfs.nameservices' = 'HDFS8000871', + 'dfs.ha.namenodes.HDFS8000871' = 'nn1,nn2', + 'dfs.namenode.rpc-address.HDFS8000871.nn1' = '172.21.0.1:4007', + 'dfs.namenode.rpc-address.HDFS8000871.nn2' = '172.21.0.2:4007', + 'dfs.client.failover.proxy.provider.HDFS8000871' = 'org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider', + 'hadoop.username' = 'hadoop' +); +``` + +### Paimon on DLF + +```sql +CREATE CATALOG paimon_dlf PROPERTIES ( + 'type' = 'paimon', + 'paimon.catalog.type' = 'dlf', + 'warehouse' = 'oss://xx/yy/', + 'dlf.proxy.mode' = 'DLF_ONLY', + 'dlf.uid' = 'xxxxx', + 'dlf.region' = 'cn-beijing', + 'dlf.access_key' = 'ak', + 'dlf.secret_key' = 'sk' +); +``` + +## Query Operations + +### Basic Query + +Once the Catalog is configured, you can query the table data in the Catalog as follows: + +```sql +-- 1. Switch to catalog, use database, and query +SWITCH paimon_ctl; +USE paimon_db; +SELECT * FROM paimon_tbl LIMIT 10; + +-- 2. Use Paimon database directly +USE paimon_ctl.paimon_db; +SELECT * FROM paimon_tbl LIMIT 10; + +-- 3. Use fully qualified name to query +SELECT * FROM paimon_ctl.paimon_db.paimon_tbl LIMIT 10; +``` + +## Appendix diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current.json b/i18n/zh-CN/docusaurus-plugin-content-docs/current.json index b22ef352426..eb8aa06c547 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current.json +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current.json @@ -143,9 +143,9 @@ "message": "湖仓一体", "description": "The label for category Data Lakehouse in sidebar docs" }, - "sidebar.docs.category.Catalogs": { + "sidebar.docs.category.Data Catalogs": { "message": "数据目录", - "description": "The label for category Catalogs in sidebar docs" + "description": "The label for category Data Catalogs in sidebar docs" }, "sidebar.docs.category.Metastores": { "message": "元数据服务", diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalog-overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalog-overview.md index ab5cfc8e859..d6d76667761 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalog-overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalog-overview.md @@ -24,7 +24,7 @@ specific language governing permissions and limitations under the License. --> -数据目录(Catalog)用于描述一个数据源的属性。 +数据目录(Data Catalog)用于描述一个数据源的属性。 在 Doris 中,可以创建多个数据目录指向不同的数据源(如 Hive、Iceberg、MySQL)。Doris 会通过数据目录,自动获取对应数据源的库、表、Schema、分区、数据位置等。用户可以通过标准的 SQL 语句访问这些数据目录进行数据分析,并且可以对多个数据目录中的数据进行关联查询。 @@ -101,7 +101,7 @@ SELECT k1, k4 FROM table; -- Query OK. 创建后,可以通过 `SHOW CATALOGS` 命令查看 catalog: -```plain text +```text mysql> SHOW CATALOGS; +-----------+-----------------+----------+-----------+-------------------------+---------------------+------------------------+ | CatalogId | CatalogName | Type | IsCurrent | CreateTime | LastUpdateTime | Comment | diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md index 8ee11401447..c035d30e1a3 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md @@ -110,7 +110,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( * [ MINIO](../storages/minio.md) -> 如果需要通过 Doris 创建 Hive 表或写入数据,需要在 Catalog 属性中显式增加 `fs.defaultFS` 属性。如果创建 Catalog 仅用于查询,则该参数可以省略。 +> 如果需要通过 Doris 创建 Hive 表并写入数据,需要在 Catalog 属性中显式增加 `fs.defaultFS` 属性。如果创建 Catalog 仅用于查询,则该参数可以省略。 ### 支持的数据格式 @@ -355,7 +355,7 @@ PROPERTIES ( AS SELECT col1,pt1 as col2,pt2 as pt1 FROM test_ctas.part_ctas_src WHERE col1>0; ``` -### 相关参数(Configurations) +### 相关参数 * BE diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hudi-catalog.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hudi-catalog.md index 258c4533e36..89b33ec2738 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hudi-catalog.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hudi-catalog.md @@ -49,19 +49,19 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( ); ``` -* {MetaStoreProperties} +* `{MetaStoreProperties}` MetaStoreProperties 部分用于填写 Metastore 元数据服务连接和认证信息。具体可参阅【支持的元数据服务】部分。 -* {StorageProperties} +* `{StorageProperties}` StorageProperties 部分用于填写存储系统相关的连接和认证信息。具体可参阅【支持的存储系统】部分。 -* {CommonProperties} +* `{CommonProperties}` CommonProperties 部分用于填写通用属性。请参阅[ 数据目录概述 ](../catalog-overview.md)中【通用属性】部分。 -* {HudiProperties} +* `{HudiProperties}` | 参数名称 | 曾用名 | 说明 | 默认值 | | ------------------------------- | -------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- | @@ -167,10 +167,25 @@ SELECT * FROM hudi_ctl.hudi_db.hudi_tbl LIMIT 10; 可以通过 `hudi_meta()` 表函数查询查询指定 Hudi 表的 Timeline: ```sql -SELECT * FROM iceberg_meta( +SELECT * FROM hudi_meta( 'table' = 'hudi_ctl.hudi_db.hudi_tbl', 'query_type' = 'timeline' ); + ++-------------------+--------+--------------------------+-----------+-----------------------+ +| timestamp | action | file_name | state | state_transition_time | ++-------------------+--------+--------------------------+-----------+-----------------------+ +| 20241202171214902 | commit | 20241202171214902.commit | COMPLETED | 20241202171215756 | +| 20241202171217258 | commit | 20241202171217258.commit | COMPLETED | 20241202171218127 | +| 20241202171219557 | commit | 20241202171219557.commit | COMPLETED | 20241202171220308 | +| 20241202171221769 | commit | 20241202171221769.commit | COMPLETED | 20241202171222541 | +| 20241202171224269 | commit | 20241202171224269.commit | COMPLETED | 20241202171224995 | +| 20241202171226401 | commit | 20241202171226401.commit | COMPLETED | 20241202171227155 | +| 20241202171228827 | commit | 20241202171228827.commit | COMPLETED | 20241202171229570 | +| 20241202171230907 | commit | 20241202171230907.commit | COMPLETED | 20241202171231686 | +| 20241202171233356 | commit | 20241202171233356.commit | COMPLETED | 20241202171234288 | +| 20241202171235940 | commit | 20241202171235940.commit | COMPLETED | 20241202171236757 | ++-------------------+--------+--------------------------+-----------+-----------------------+ ``` 可以使用 `FOR TIME AS OF` 语句,根据快照的时间 ([时间格式](https://hudi.apache.org/docs/0.14.0/quick-start-guide/#timetravel)和 Hudi 官网保持一致) 读取历史版本的数据。示例如下: @@ -212,7 +227,7 @@ SELECT * from hudi_table@incr('beginTime'='xxx', ['endTime'='xxx'], ['hoodie.rea | inputSplitNum=1, totalFileSize=13099711, scanRanges=1 ``` -## 附录(Appendix) +## 附录 ### 版本更新记录 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/jdbc-catalog-overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/jdbc-catalog-overview.md index 2a510d07d67..aa230022efb 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/jdbc-catalog-overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/jdbc-catalog-overview.md @@ -1,6 +1,6 @@ --- { - "title": "JDBC Catalog 概述", + "title": "JDBC Catalog", "language": "zh-CN" } --- diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/paimon-catalog.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/paimon-catalog.md index 6314977c64b..5380c53fc8f 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/paimon-catalog.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/paimon-catalog.md @@ -26,7 +26,7 @@ under the License. Doris 支持通过多种元数据服务访问 Paimon 表元数据,并进行 Paimon 数据查询。 -目前只支持 Paimon 表的读操作,暂时不支持的写入 Paimon 表。 +目前只支持 Paimon 表的读操作,未来会支持的写入 Paimon 表。 ## 适用场景 @@ -34,7 +34,7 @@ Doris 支持通过多种元数据服务访问 Paimon 表元数据,并进行 Pa | ---- | ------------------------------------------------------ | | 查询加速 | 利用 Doris 分布式计算引擎,直接访问 Paimon 数据进行查询加速。 | | 数据集成 | 读取 Paimon 数据并写入到 Doris 内表。或通过 Doris 计算引擎进行 ZeroETL 操作。 | -| 数据写回 | 不支持。 | +| 数据写回 | 暂不支持。 | ## 配置 Catalog @@ -51,7 +51,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( ); ``` -* \<paimon\_catalog\_type> +* `<paimon_catalog_type>` Paimon Catalog 的类型,支持以下几种: @@ -61,21 +61,21 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( * `dlf`:使用阿里云 DLF 作为元数据服务。 -* \<paimon\_warehouse> +* `<paimon_warehouse>` Paimon 的仓库路径。当 `<paimon_catalog_type>` 为 `filesystem` 时,需指定这个参数。 - `warehouse` 的路径必须指向 `Database` 路径的上一级。如你的表路径是:`s3://bucket/path/to/db1/table1`,那么 `warehouse` 应该是:`s3://bucket/path/to/`。 + `warehouse` 的路径必须指向 `Database` 路径的上一级。如您的表路径是:`s3://bucket/path/to/db1/table1`,那么 `warehouse` 应该是:`s3://bucket/path/to/`。 -* {MetaStoreProperties} +* `{MetaStoreProperties}` MetaStoreProperties 部分用于填写 Metastore 元数据服务连接和认证信息。具体可参阅【支持的元数据服务】部分。 -* {StorageProperties} +* `{StorageProperties}` StorageProperties 部分用于填写存储系统相关的连接和认证信息。具体可参阅【支持的存储系统】部分。 -* {CommonProperties} +* `{CommonProperties}` CommonProperties 部分用于填写通用属性。请参阅[ 数据目录概述 ](../catalog-overview.md)中【通用属性】部分。 @@ -210,5 +210,5 @@ SELECT * FROM paimon_tbl LIMIT 10; SELECT * FROM paimon_ctl.paimon_db.paimon_tbl LIMIT 10; ``` -## 附录(Appendix) +## 附录 diff --git a/sidebars.json b/sidebars.json index 3b986e7d69e..1591422e881 100644 --- a/sidebars.json +++ b/sidebars.json @@ -370,7 +370,7 @@ "lakehouse/catalog-overview", { "type": "category", - "label": "Catalogs", + "label": "Data Catalogs", "items": [ "lakehouse/catalogs/hive-catalog", "lakehouse/catalogs/iceberg-catalog", --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org