This is an automated email from the ASF dual-hosted git repository. jiafengzheng pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push: new b621d1d68a [docs](docs) update en docs (#16257) b621d1d68a is described below commit b621d1d68a00d2c402228e95fe57fdb9a52eff89 Author: Hu Yanjun <100749531+httpshir...@users.noreply.github.com> AuthorDate: Fri Feb 3 22:01:43 2023 +0800 [docs](docs) update en docs (#16257) --- docs/en/docs/lakehouse/external-table/jdbc.md | 245 ++++++++++++- docs/en/docs/lakehouse/file.md | 111 +++++- docs/en/docs/lakehouse/multi-catalog/es.md | 418 ++++++++++++++++++++++- docs/en/docs/lakehouse/multi-catalog/faq.md | 20 +- docs/en/docs/lakehouse/multi-catalog/jdbc.md | 313 ++++++++++++++++- docs/zh-CN/docs/lakehouse/external-table/jdbc.md | 2 +- 6 files changed, 1103 insertions(+), 6 deletions(-) diff --git a/docs/en/docs/lakehouse/external-table/jdbc.md b/docs/en/docs/lakehouse/external-table/jdbc.md index 95a08174d8..8344503cac 100644 --- a/docs/en/docs/lakehouse/external-table/jdbc.md +++ b/docs/en/docs/lakehouse/external-table/jdbc.md @@ -28,7 +28,250 @@ under the License. <version deprecated="1.2.2"> -Please use [JDBC Catalog](../multi-catalog/jdbc) to visit JDBC data source. +Please use [JDBC Catalog](https://doris.apache.org/docs/dev/lakehouse/multi-catalog/jdbc/) to access JDBC data sources. </version> +<version since="1.2.0"> + +By creating JDBC External Tables, Doris can access external tables via JDBC, the standard database access inferface. This allows Doris to visit various databases without tedious data ingestion, and give full play to its own OLAP capabilities to perform data analysis on external tables: + +1. Multiple data sources can be connected to Doris; +2. It enables Join queries across Doris and other data sources and thus allows more complex analysis. + +This topic introduces how to use JDBC External Tables in Doris. + +</version> + +### Create JDBC External Table in Doris + +See [CREATE TABLE](https://doris.apache.org/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE/) for syntax details. + +#### 1. Create JDBC External Table by Creating JDBC_Resource + +```sql +CREATE EXTERNAL RESOURCE jdbc_resource +properties ( + "type"="jdbc", + "user"="root", + "password"="123456", + "jdbc_url"="jdbc:mysql://192.168.0.1:3306/test?useCursorFetch=true", + "driver_url"="http://IP:port/mysql-connector-java-5.1.47.jar", + "driver_class"="com.mysql.jdbc.Driver" +); + +CREATE EXTERNAL TABLE `baseall_mysql` ( + `k1` tinyint(4) NULL, + `k2` smallint(6) NULL, + `k3` int(11) NULL, + `k4` bigint(20) NULL, + `k5` decimal(9, 3) NULL +) ENGINE=JDBC +PROPERTIES ( +"resource" = "jdbc_resource", +"table" = "baseall", +"table_type"="mysql" +); +``` + +Parameter Description: + +| Parameter | Description | +| ---------------- | ------------------------------------------------------------ | +| **type** | "jdbc"; required; specifies the type of the Resource | +| **user** | Username for accessing the external database | +| **password** | Password of the user | +| **jdbc_url** | JDBC URL protocol, including the database type, IP address, port number, and database name; Please be aware of the different formats of different database protocols. For example, MySQL: "jdbc:mysql://127.0.0.1:3306/test?useCursorFetch=true". | +| **driver_class** | Class of the driver used to access the external database. For example, to access MySQL data: com.mysql.jdbc.Driver. | +| **driver_url** | Driver URL for downloading the Jar file package that is used to access the external database, for example, http://IP:port/mysql-connector-java-5.1.47.jar. For local stand-alone testing, you can put the Jar file package in a local path: "driver_url"="file:///home/disk1/pathTo/mysql-connector-java-5.1.47.jar"; for local multi-machine testing, please ensure the consistency of the paths. | +| **resource** | Name of the Resource that the Doris External Table depends on; should be the same as the name set in Resource creation. | +| **table** | Name of the external table to be mapped in Doris | +| **table_type** | The database from which the external table comes, such as mysql, postgresql, sqlserver, and oracle. | + +> **Note:** +> +> For local testing, please make sure you put the Jar file package in the FE and BE nodes, too. + +<version since="1.2.1"> + +> In Doris 1.2.1 and newer versions, if you have put the driver in the `jdbc_drivers` directory of FE/BE, you can simply specify the file name in the driver URL: `"driver_url" = "mysql-connector-java-5.1.47.jar"`, and the system will automatically find the file in the `jdbc_drivers` directory. + +</version> + +### Query + +``` +select * from mysql_table where k1 > 1000 and k3 ='term'; +``` + +In some cases, the keywords in the database might be used as the field names. For queries to function normally in these cases, Doris will add escape characters to the field names and tables names in SQL statements based on the rules of different databases, such as (``) for MySQL, ([]) for SQLServer, and ("") for PostgreSQL and Oracle. This might require extra attention on case sensitivity. You can view the query statements sent to these various databases via ```explain sql```. + +### Write Data + +After creating a JDBC External Table in Doris, you can write data or query results to it using the `insert into` statement. You can also ingest data from one JDBC External Table to another JDBC External Table. + + +``` +insert into mysql_table values(1, "doris"); +insert into mysql_table select * from table; +``` + +#### Transaction + +In Doris, data is written to External Tables in batches. If the ingestion process is interrupted, rollbacks might be required. That's why JDBC External Tables support data writing transactions. You can utilize this feature by setting the session variable: `enable_odbc_transcation ` (ODBC transactions are also controlled by this variable). + +``` +set enable_odbc_transcation = true; +``` + +The transaction mechanism ensures the atomicity of data writing to JDBC External Tables, but it reduces performance to a certain extent. You may decide whether to enable transactions based on your own tradeoff. + +#### 1.MySQL Test + +| MySQL Version | MySQL JDBC Driver Version | +| ------------- | ------------------------------- | +| 8.0.30 | mysql-connector-java-5.1.47.jar | + +#### 2.PostgreSQL Test + +| PostgreSQL Version | PostgreSQL JDBC Driver Version | +| ------------------ | ------------------------------ | +| 14.5 | postgresql-42.5.0.jar | + +```sql +CREATE EXTERNAL RESOURCE jdbc_pg +properties ( + "type"="jdbc", + "user"="postgres", + "password"="123456", + "jdbc_url"="jdbc:postgresql://127.0.0.1:5442/postgres?currentSchema=doris_test", + "driver_url"="http://127.0.0.1:8881/postgresql-42.5.0.jar", + "driver_class"="org.postgresql.Driver" +); + +CREATE EXTERNAL TABLE `ext_pg` ( + `k1` int +) ENGINE=JDBC +PROPERTIES ( +"resource" = "jdbc_pg", +"table" = "pg_tbl", +"table_type"="postgresql" +); +``` + +#### 3.SQLServer Test + +| SQLServer Version | SQLServer JDBC Driver Version | +| ----------------- | ----------------------------- | +| 2022 | mssql-jdbc-11.2.0.jre8.jar | + +#### 4.Oracle Test + +| Oracle Version | Oracle JDBC Driver Version | +| -------------- | -------------------------- | +| 11 | ojdbc6.jar | + +Test information on more versions will be provided in the future. + +#### 5.ClickHouse Test + +| ClickHouse Version | ClickHouse JDBC Driver Version | +| ------------------ | ------------------------------------- | +| 22 | clickhouse-jdbc-0.3.2-patch11-all.jar | + + +## Type Mapping + +The followings list how data types in different databases are mapped in Doris. + +### MySQL + +| MySQL | Doris | +| :-------------: | :------: | +| BOOLEAN | BOOLEAN | +| BIT(1) | BOOLEAN | +| TINYINT | TINYINT | +| SMALLINT | SMALLINT | +| INT | INT | +| BIGINT | BIGINT | +| BIGINT UNSIGNED | LARGEINT | +| VARCHAR | VARCHAR | +| DATE | DATE | +| FLOAT | FLOAT | +| DATETIME | DATETIME | +| DOUBLE | DOUBLE | +| DECIMAL | DECIMAL | + + +### PostgreSQL + +| PostgreSQL | Doris | +| :--------: | :------: | +| BOOLEAN | BOOLEAN | +| SMALLINT | SMALLINT | +| INT | INT | +| BIGINT | BIGINT | +| VARCHAR | VARCHAR | +| DATE | DATE | +| TIMESTAMP | DATETIME | +| REAL | FLOAT | +| FLOAT | DOUBLE | +| DECIMAL | DECIMAL | + +### Oracle + +| Oracle | Doris | +| :------: | :------: | +| VARCHAR | VARCHAR | +| DATE | DATETIME | +| SMALLINT | SMALLINT | +| INT | INT | +| REAL | DOUBLE | +| FLOAT | DOUBLE | +| NUMBER | DECIMAL | + + +### SQL server + +| SQLServer | Doris | +| :-------: | :------: | +| BIT | BOOLEAN | +| TINYINT | TINYINT | +| SMALLINT | SMALLINT | +| INT | INT | +| BIGINT | BIGINT | +| VARCHAR | VARCHAR | +| DATE | DATE | +| DATETIME | DATETIME | +| REAL | FLOAT | +| FLOAT | DOUBLE | +| DECIMAL | DECIMAL | + +### ClickHouse + +| ClickHouse | Doris | +| :--------: | :------: | +| BOOLEAN | BOOLEAN | +| CHAR | CHAR | +| VARCHAR | VARCHAR | +| STRING | STRING | +| DATE | DATE | +| Float32 | FLOAT | +| Float64 | DOUBLE | +| Int8 | TINYINT | +| Int16 | SMALLINT | +| Int32 | INT | +| Int64 | BIGINT | +| Int128 | LARGEINT | +| DATETIME | DATETIME | +| DECIMAL | DECIMAL | + +Note: + +- Some data types in ClickHouse, such as UUID, IPv4, IPv6, and Enum8, will be mapped to Varchar/String in Doris. IPv4 and IPv6 will be displayed with an `/` as a prefix. You can use the `split_part` function to remove the `/` . +- The Point Geo type in ClickHouse cannot be mapped in Doris by far. + +## Q&A + +See the FAQ section in [JDBC](https://doris.apache.org/docs/dev/lakehouse/multi-catalog/jdbc/). + diff --git a/docs/en/docs/lakehouse/file.md b/docs/en/docs/lakehouse/file.md index 4d3c0fed9a..4d12847fc6 100644 --- a/docs/en/docs/lakehouse/file.md +++ b/docs/en/docs/lakehouse/file.md @@ -27,5 +27,114 @@ under the License. # File Analysis -TODO: translate +<version since="1.2.0"> + +With the Table Value Function feature, Doris is able to query files in object storage or HDFS as simply as querying Tables. In addition, it supports automatic column type inference. + +</version> + +## Usage + +For more usage details, please see the documentation: + +* [S3](https://doris.apache.org/docs/dev/sql-manual/sql-functions/table-functions/s3/): supports file analysis on object storage compatible with S3 +* [HDFS](https://doris.apache.org/docs/dev/sql-manual/sql-functions/table-functions/hdfs/): supports file analysis on HDFS + +The followings illustrate how file analysis is conducted with the example of S3 Table Value Function. + +### Automatic Column Type Inference + +``` +MySQL [(none)]> DESC FUNCTION s3( + "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", + "ACCESS_KEY"= "minioadmin", + "SECRET_KEY" = "minioadmin", + "Format" = "parquet", + "use_path_style"="true"); ++---------------+--------------+------+-------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++---------------+--------------+------+-------+---------+-------+ +| p_partkey | INT | Yes | false | NULL | NONE | +| p_name | TEXT | Yes | false | NULL | NONE | +| p_mfgr | TEXT | Yes | false | NULL | NONE | +| p_brand | TEXT | Yes | false | NULL | NONE | +| p_type | TEXT | Yes | false | NULL | NONE | +| p_size | INT | Yes | false | NULL | NONE | +| p_container | TEXT | Yes | false | NULL | NONE | +| p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE | +| p_comment | TEXT | Yes | false | NULL | NONE | ++---------------+--------------+------+-------+---------+-------+ +``` + +An S3 Table Value Function is defined as follows: + +``` +s3( + "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", + "ACCESS_KEY"= "minioadmin", + "SECRET_KEY" = "minioadmin", + "Format" = "parquet", + "use_path_style"="true") +``` + +It specifies the file path, connection, and authentication. + +After defining, you can view the schema of this file using the `DESC FUNCTION` statement. + +As can be seen, Doris is able to automatically infer column types based on the metadata of the Parquet file. + +Besides Parquet, Doris supports analysis and auto column type inference of ORC, CSV, and Json files. + +### Query and Analysis + +You can conduct queries and analysis on this Parquet file using any SQL statements: + +``` +SELECT * FROM s3( + "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", + "ACCESS_KEY"= "minioadmin", + "SECRET_KEY" = "minioadmin", + "Format" = "parquet", + "use_path_style"="true") +LIMIT 5; ++-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ +| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment | ++-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ +| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi | +| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo | +| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag | +| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r | +| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully | ++-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ +``` + +You can put the Table Value Function anywhere that you used to put Table in the SQL, such as in the WITH or FROM clause in CTE. In this way, you can treat the file as a normal table and conduct analysis conveniently. + +### Data Ingestion + +Users can ingest files into Doris tables via `INSERT INTO SELECT` for faster file analysis: + +``` +// 1. Create Doris internal table +CREATE TABLE IF NOT EXISTS test_table +( + id int, + name varchar(50), + age int +) +DISTRIBUTED BY HASH(id) BUCKETS 4 +PROPERTIES("replication_num" = "1"); + +// 2. Insert data using S3 Table Value Function +INSERT INTO test_table (id,name,age) +SELECT cast(id as INT) as id, name, cast (age as INT) as age +FROM s3( + "uri" = "${uri}", + "ACCESS_KEY"= "${ak}", + "SECRET_KEY" = "${sk}", + "format" = "${format}", + "strip_outer_array" = "true", + "read_json_by_line" = "true", + "use_path_style" = "true"); +``` diff --git a/docs/en/docs/lakehouse/multi-catalog/es.md b/docs/en/docs/lakehouse/multi-catalog/es.md index 80d24db103..0ad296899b 100644 --- a/docs/en/docs/lakehouse/multi-catalog/es.md +++ b/docs/en/docs/lakehouse/multi-catalog/es.md @@ -26,4 +26,420 @@ under the License. # Elasticsearch -TODO: translate +Elasticsearch (ES) Catalogs in Doris support auto-mapping of ES metadata. Users can utilize the full-text search capability of ES in combination of the distributed query planning capability of Doris to provide a full-fledged OLAP solution that is able to perform: + +1. Multi-index distributed Join queries in ES; +2. Join queries across Doris and ES as well as full-text search and filter. + +## Usage + +1. Doris supports Elasticsearch 5.x and newer versions. + +## Create Catalog + +```sql +CREATE CATALOG es PROPERTIES ( + "type"="es", + "hosts"="http://127.0.0.1:9200" +); +``` + +Since there is no concept of "database" in ES, after connecting to ES, Doris will automatically generate a unique database: `default_db`. + +After switching to the ES Catalog, you will be in the `dafault_db` so you don't need to execute the `USE default_db` command. + +### Parameter Description + +| Parameter | Required or Not | Default Value | Description | +| ----------------- | --------------- | ------------- | ------------------------------------------------------------ | +| `hosts` | Yes | | ES address, can be one or multiple addresses, or the load balancer address of ES | +| `user` | No | Empty | ES username | +| `password` | No | Empty | Password of the corresponding user | +| `doc_value_scan` | No | true | Whether to obtain value of the target field by ES/Lucene columnar storage | +| `keyword_sniff` | No | true | Whether to sniff the text.fields in ES based on keyword; If this is set to false, the system will perform matching after tokenization. | +| `nodes_discovery` | No | true | Whether to enable ES node discovery, set to true by default; set to false in network isolation environments and only connected to specified nodes | +| `ssl` | No | false | Whether to enable HTTPS access mode for ES, currently follows a "Trust All" method in FE/BE | +| `mapping_es_id` | No | false | Whether to map the `_id` field in the ES index | + +> 1. In terms of authentication, only HTTP Basic authentication is supported and it requires the user to have read privilege for the index and paths including `/_cluster/state/` and `_nodes/http` ; if you have not enabled security authentication for the cluster, you don't need to set the `user` and `password`. +> +> 2. If there are multiple types in the index in 5.x and 6.x, the first type is taken by default. + +## Column Type Mapping + +| ES Type | Doris Type | Comment | +| ------------- | ----------- | ------- | +| null | null | | +| boolean | boolean | | +| byte | tinyint | | +| short | smallint | | +| integer | int | | +| long | bigint | | +| unsigned_long | largeint | | +| float | float | | +| half_float | float | | +| double | double | | +| scaled_float | double | | +| date | date | | +| keyword | string | | +| text | string | | +| ip | string | | +| nested | string | | +| object | string | | +| other | unsupported | | + +## Best Practice + +### Predicate Pushdown + +ES Catalogs support predicate pushdown to ES, which means only the filtered data will be returned. This can markedly improve query performance and reduce usage of CPU, memory, and IO in both Doris and ES. + +For the sake of optimization, operators will be converted into the following ES queries: + +| SQL syntax | ES 5.x+ syntax | +| -------------- | :--------------------------: | +| = | term query | +| in | terms query | +| > , < , >= , ⇐ | range query | +| and | bool.filter | +| or | bool.should | +| not | bool.must_not | +| not in | bool.must_not + terms query | +| is\_not\_null | exists query | +| is\_null | bool.must_not + exists query | +| esquery | ES-native JSON QueryDSL | + +### Columnar Scan for Faster Queries (enable\_docvalue\_scan=true) + +Set `"enable_docvalue_scan" = "true"`. + +After this, when obtaining data from ES, Doris will follow these rules: + +* **Try and see**: Doris will automatically check if columnar storage is enabled for the target fields (doc_value: true), if it is, Doris will obtain all values in the fields from the columnar storage. +* **Auto-downgrading**: If any one of the target fields is not available in columnar storage, Doris will parse and obtain all target data from row storage (`_source`). + +**Benefits** + +By default, Doris On ES obtains all target columns from `_source`, which is in row storage and JSON format. Compared to columnar storage, `_source` is slow in batch read. In particular, when the system only needs to read small number of columns, the performance of `docvalue` can be about a dozen times faster than that of `_source`. + +**Note** + +1. Columnar storage is not available for `text` fields in ES. Thus, if you need to obtain fields containing `text` values, you will need to obtain them from `_source`. +2. When obtaining large numbers of fields (`>= 25`), the performances of `docvalue` and `_source` are basically equivalent. + +### Sniff Keyword Fields + +Set `"enable_keyword_sniff" = "true"`. + +ES allows direct data ingestion without an index since it will automatically create an index after ingestion. For string fields, ES will create a field with both `text` and `keyword` types. This is how the Multi-Field feature of ES works. The mapping is as follows: + +``` +"k4": { + "type": "text", + "fields": { + "keyword": { + "type": "keyword", + "ignore_above": 256 + } + } +} +``` + +For example, to conduct "=" filtering on `k4`, Doris on ES will convert the filtering operation into an ES TermQuery. + +The original SQL filter: + +``` +k4 = "Doris On ES" +``` + +The converted ES query DSL: + +``` +"term" : { + "k4": "Doris On ES" + +} +``` + +Since the first field of `k4` is `text`, it will be tokenized by the analyzer set for `k4` (or by the standard analyzer if no analyzer has been set for `k4`) after data ingestion. As a result, it will be tokenized into three terms: "Doris", "on", and "ES". + +The details are as follows: + +``` +POST /_analyze +{ + "analyzer": "standard", + "text": "Doris On ES" +} +``` + +The tokenization results: + +``` +{ + "tokens": [ + { + "token": "doris", + "start_offset": 0, + "end_offset": 5, + "type": "<ALPHANUM>", + "position": 0 + }, + { + "token": "on", + "start_offset": 6, + "end_offset": 8, + "type": "<ALPHANUM>", + "position": 1 + }, + { + "token": "es", + "start_offset": 9, + "end_offset": 11, + "type": "<ALPHANUM>", + "position": 2 + } + ] +} +``` + +If you conduct a query as follows: + +``` +"term" : { + "k4": "Doris On ES" +} +``` + +Since there is no term in the dictionary that matches the term `Doris On ES`, no result will be returned. + +However, if you have set `enable_keyword_sniff: true`, the system will convert `k4 = "Doris On ES"` to `k4.keyword = "Doris On ES"` to match the SQL semantics. The converted ES query DSL will be: + +``` +"term" : { + "k4.keyword": "Doris On ES" +} +``` + +`k4.keyword` is of `keyword` type, so the data is written in ES as a complete term, allowing for successful matching. + +### Auto Node Discovery, Set to True by Default (nodes\_discovery=true) + +Set `"nodes_discovery" = "true"`. + +Then, Doris will discover all available data nodes (the allocated shards) in ES. If Doris BE hasn't accessed the ES data node addresses, then set `"nodes_discovery" = "false"` . ES clusters are deployed in private networks that are isolated from public Internet, so users will need proxy access. + +### HTTPS Access Mode for ES Clusters + +Set `"ssl" = "true"`. + +A temporary solution is to implement a "Trust All" method in FE/BE. In the future, the real user configuration certificates will be used. + +### Query Usage + +You can use the ES external tables in Doris the same way as using Doris internal tables, except that the Doris data models (Rollup, Pre-Aggregation, and Materialized Views) are unavailable. + +#### Basic Query + +``` +select * from es_table where k1 > 1000 and k3 ='term' or k4 like 'fu*z_' +``` + +#### Extended esquery(field, QueryDSL) + +The `esquery(field, QueryDSL)` function can be used to push queries that cannot be expressed in SQL, such as `match_phrase` and `geoshape` , to ES for filtering. + +In `esquery`, the first parameter (the column name) is used to associate with `index`, while the second parameter is the JSON expression of basic `Query DSL` in ES, which is surrounded by `{}`. The `root key` in JSON is unique, which can be `match_phrase`, `geo_shape` or `bool` , etc. + +A `match_phrase` query: + +``` +select * from es_table where esquery(k4, '{ + "match_phrase": { + "k4": "doris on es" + } + }'); +``` + +A `geo` query: + +``` +select * from es_table where esquery(k4, '{ + "geo_shape": { + "location": { + "shape": { + "type": "envelope", + "coordinates": [ + [ + 13, + 53 + ], + [ + 14, + 52 + ] + ] + }, + "relation": "within" + } + } + }'); +``` + +A `bool` query: + +``` +select * from es_table where esquery(k4, ' { + "bool": { + "must": [ + { + "terms": { + "k1": [ + 11, + 12 + ] + } + }, + { + "terms": { + "k2": [ + 100 + ] + } + } + ] + } + }'); +``` + +### Suggestions for Time Fields + +> These are only applicable for ES external tables. Time fields will be automatically mapped to Date or Datetime type in ES Catalogs. + +ES boasts flexible usage of time fields, but in ES external tables, improper type setting of time fields will result in predicate pushdown failures. + +It is recommended to allow the highest level of format compatibility for time fields when creating an index: + +``` + "dt": { + "type": "date", + "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis" + } +``` + +When creating this field in Doris, it is recommended to set its type to `date` or `datetime` (or `varchar` ) . You can use the following SQL statements to push the filters down to ES. + +``` +select * from doe where k2 > '2020-06-21'; + +select * from doe where k2 < '2020-06-21 12:00:00'; + +select * from doe where k2 < 1593497011; + +select * from doe where k2 < now(); + +select * from doe where k2 < date_format(now(), '%Y-%m-%d'); +``` + +Note: + +* The default format of time fields in ES is: + +``` +strict_date_optional_time||epoch_millis +``` + +* Timestamps ingested into ES need to be converted into `ms`, which is the internal processing format in ES; otherwise errors will occur in ES external tables. + +### Obtain ES Metadata Field `_id` + +Each ingested files, if not specified with an `_id` , will be given a globally unique `_id`, which is the primary key. Users can assign an `_id` with unique business meanings to the files during ingestion. + +To obtain such field values from ES external tables, you can add an `_id` field of `varchar` type when creating tables. + +``` +CREATE EXTERNAL TABLE `doe` ( + `_id` varchar COMMENT "", + `city` varchar COMMENT "" +) ENGINE=ELASTICSEARCH +PROPERTIES ( +"hosts" = "http://127.0.0.1:8200", +"user" = "root", +"password" = "root", +"index" = "doe" +} +``` + +To obtain such field values from ES Catalogs, please set `"mapping_es_id" = "true"`. + +Note: + +1. The `_id` field only supports `=` and `in` filtering. +2. The`_id` field must be of `varchar` type. + +## FAQ + +1. Are X-Pack authenticated ES clusters supported? + + All ES clusters with HTTP Basic authentications are supported. + +2. Why are some queries require longer response time than those in ES? + + For `_count ` queries, ES can directly read the metadata regarding the number of the specified files instead of filtering the original data. This is a huge time saver. + +3. Can aggregation operations be pushed down? + + Currently, Doris On ES does not support pushdown for aggregations such as sum, avg, and min/max. In such operations, Doris obtains all files that met the specified conditions from ES and then conducts computing internally. + + +## Appendix + +### How Doris Conducts Queries in ES + +``` ++----------------------------------------------+ +| | +| Doris +------------------+ | +| | FE +--------------+-------+ +| | | Request Shard Location +| +--+-------------+-+ | | +| ^ ^ | | +| | | | | +| +-------------------+ +------------------+ | | +| | | | | | | | | +| | +----------+----+ | | +--+-----------+ | | | +| | | BE | | | | BE | | | | +| | +---------------+ | | +--------------+ | | | ++----------------------------------------------+ | + | | | | | | | + | | | | | | | + | HTTP SCROLL | | HTTP SCROLL | | ++-----------+---------------------+------------+ | +| | v | | v | | | +| | +------+--------+ | | +------+-------+ | | | +| | | | | | | | | | | +| | | DataNode | | | | DataNode +<-----------+ +| | | | | | | | | | | +| | | +<--------------------------------+ +| | +---------------+ | | |--------------| | | | +| +-------------------+ +------------------+ | | +| Same Physical Node | | +| | | +| +-----------------------+ | | +| | | | | +| | MasterNode +<-----------------+ +| ES | | | +| +-----------------------+ | ++----------------------------------------------+ + + +``` + +1. Doris FE sends a request to the specified host for table creation in order to obtain information about the HTTP port and the index shard allocation. + +2. Based on the information about node and index metadata from FE, Doris generates a query plan and send it to the corresponding BE node. + +3. Following the principle of proximity, the BE node sends request to the locally deployed ES node, and obtain data from `_source` or `docvalue` from each shard of ES index concurrently by way of `HTTP Scroll`. + +4. Doris returns the computing results to the user. diff --git a/docs/en/docs/lakehouse/multi-catalog/faq.md b/docs/en/docs/lakehouse/multi-catalog/faq.md index 0a5d39be6b..bfe06dc27b 100644 --- a/docs/en/docs/lakehouse/multi-catalog/faq.md +++ b/docs/en/docs/lakehouse/multi-catalog/faq.md @@ -27,4 +27,22 @@ under the License. # FAQ -TODO: translate +1. What to do with errors such as `failed to get schema` and `Storage schema reading not supported` when accessing Icerberg tables via Hive Metastore? + + To fix this, please place the Jar file package of `iceberg` runtime in the `lib/` directory of Hive. + + And configure as follows in `hive-site.xml` : + + ``` + metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + ``` + + After configuring, please restart Hive Metastore. + +2. What to do with the `GSS initiate failed` error when connecting to Hive Metastore with Kerberos authentication? + + In Doris 1.2.1 and the older versions, gsasl is disabled for libhdfs3, so please update to Doris 1.2.2 or newer. + +3. What to do with the`java.lang.VerifyError: xxx` error when accessing HDFS 3.x? + + Doris 1.2.1 and the older versions rely on Hadoop 2.8. Please update Hadoop to 2.10.2 or update Doris to 1.2.2 or newer. diff --git a/docs/en/docs/lakehouse/multi-catalog/jdbc.md b/docs/en/docs/lakehouse/multi-catalog/jdbc.md index 8e1ae204cc..bf8d765ce4 100644 --- a/docs/en/docs/lakehouse/multi-catalog/jdbc.md +++ b/docs/en/docs/lakehouse/multi-catalog/jdbc.md @@ -27,4 +27,315 @@ under the License. # JDBC -TODO: translate +JDBC Catalogs in Doris are connected to external data sources using the standard JDBC protocol. + +Once connected, Doris will ingest metadata of databases and tables from the external data sources in order to enable quick access to external data. + +## Usage + +1. Supported datas sources include MySQL, PostgreSQL, Oracle, SQLServer, and Clickhouse. + +## Create Catalog + +1. MySQL + + ```sql + CREATE CATALOG jdbc_mysql PROPERTIES ( + "type"="jdbc", + "user"="root", + "password"="123456", + "jdbc_url" = "jdbc:mysql://127.0.0.1:3306/demo", + "driver_url" = "mysql-connector-java-5.1.47.jar", + "driver_class" = "com.mysql.jdbc.Driver" + ) + ``` + +2. PostgreSQL + + ```sql + CREATE CATALOG jdbc_postgresql PROPERTIES ( + "type"="jdbc", + "user"="root", + "password"="123456", + "jdbc_url" = "jdbc:postgresql://127.0.0.1:5449/demo", + "driver_url" = "postgresql-42.5.1.jar", + "driver_class" = "org.postgresql.Driver" + ); + ``` + + As for data mapping from PostgreSQL to Doris, one Database in Doris corresponds to one schema in the specified database in PostgreSQL (for example, "demo" in `jdbc_url` above), and one Table in that Database corresponds to one table in that schema. To make it more intuitive, the mapping relations are as follows: + + | Doris | PostgreSQL | + | -------- | ---------- | + | Catalog | Database | + | Database | Schema | + | Table | Table | + +3. Oracle + + ```sql + CREATE CATALOG jdbc_oracle PROPERTIES ( + "type"="jdbc", + "user"="root", + "password"="123456", + "jdbc_url" = "jdbc:oracle:thin:@127.0.0.1:1521:helowin", + "driver_url" = "ojdbc6.jar", + "driver_class" = "oracle.jdbc.driver.OracleDriver" + ); + ``` + + As for data mapping from Oracle to Doris, one Database in Doris corresponds to one User (for example, "helowin" in `jdbc_url` above), and one Table in that Database corresponds to one table that the User has access to. In conclusion, the mapping relations are as follows: + + | Doris | PostgreSQL | + | -------- | ---------- | + | Catalog | Database | + | Database | User | + | Table | Table | + +4. Clickhouse + + ```sql + CREATE CATALOG jdbc_clickhouse PROPERTIES ( + "type"="jdbc", + "user"="root", + "password"="123456", + "jdbc_url" = "jdbc:clickhouse://127.0.0.1:8123/demo", + "driver_url" = "clickhouse-jdbc-0.3.2-patch11-all.jar", + "driver_class" = "com.clickhouse.jdbc.ClickHouseDriver" + ); + ``` + +5. SQLServer + + ```sql + CREATE CATALOG sqlserver_catalog PROPERTIES ( + "type"="jdbc", + "user"="SA", + "password"="Doris123456", + "jdbc_url" = "jdbc:sqlserver://localhost:1433;DataBaseName=doris_test", + "driver_url" = "mssql-jdbc-11.2.3.jre8.jar", + "driver_class" = "com.microsoft.sqlserver.jdbc.SQLServerDriver" + ); + ``` + + As for data mapping from SQLServer to Doris, one Database in Doris corresponds to one schema in the specified database in SQLServer (for example, "doris_test" in `jdbc_url` above), and one Table in that Database corresponds to one table in that schema. The mapping relations are as follows: + + | Doris | SQLServer | + | -------- | --------- | + | Catalog | Database | + | Database | Schema | + | Table | Table | + +### Parameter Description + +| Parameter | Required or Not | Default Value | Description | +| --------------- | --------------- | ------------- | -------------------------------------------------- | +| `user` | Yes | | Username in relation to the corresponding database | +| `password` | Yes | | Password for the corresponding database | +| `jdbc_url ` | Yes | | JDBC connection string | +| `driver_url ` | Yes | | JDBC Driver Jar | +| `driver_class ` | Yes | | JDBC Driver Class | + +> `driver_url` can be specified in three ways: +> +> 1. File name. For example, `mysql-connector-java-5.1.47.jar`. Please place the Jar file package in `jdbc_drivers/` under the FE/BE deployment directory in advance so the system can locate the file. You can change the location of the file by modifying `jdbc_drivers_dir` in fe.conf and be.conf. +> +> 2. Local absolute path. For example, `file:///path/to/mysql-connector-java-5.1.47.jar`. Please place the Jar file package in the specified paths of FE/BE node. +> +> 3. HTTP address. For example, `https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/jdbc_driver/mysql-connector-java-5.1.47.jar`. The system will download the Driver file from the HTTP address. This only supports HTTP services with no authentication requirements. + +## Column Type Mapping + +### MySQL + +| MYSQL Type | Doris Type | Comment | +| ------------------------------------------------------------ | ----------- | ------------------------------------------------------------ | +| BOOLEAN | BOOLEAN | | +| TINYINT | TINYINT | | +| SMALLINT | SMALLINT | | +| MEDIUMINT | INT | | +| INT | INT | | +| BIGINT | BIGINT | | +| UNSIGNED TINYINT | SMALLINT | Doris does not support UNSIGNED data types so UNSIGNED TINYINT will be mapped to SMALLINT. | +| UNSIGNED MEDIUMINT | INT | Doris does not support UNSIGNED data types so UNSIGNED MEDIUMINT will be mapped to INT. | +| UNSIGNED INT | BIGINT | Doris does not support UNSIGNED data types so UNSIGNED INT will be mapped to BIGINT. | +| UNSIGNED BIGINT | STRING | | +| FLOAT | FLOAT | | +| DOUBLE | DOUBLE | | +| DECIMAL | DECIMAL | | +| DATE | DATE | | +| TIMESTAMP | DATETIME | | +| DATETIME | DATETIME | | +| YEAR | SMALLINT | | +| TIME | STRING | | +| CHAR | CHAR | | +| VARCHAR | STRING | | +| TINYTEXT、TEXT、MEDIUMTEXT、LONGTEXT、TINYBLOB、BLOB、MEDIUMBLOB、LONGBLOB、TINYSTRING、STRING、MEDIUMSTRING、LONGSTRING、BINARY、VARBINARY、JSON、SET、BIT | STRING | | +| Other | UNSUPPORTED | | + +### PostgreSQL + + POSTGRESQL Type | Doris Type | Comment | +|---|---|---| +| boolean | BOOLEAN | | +| smallint/int2 | SMALLINT | | +| integer/int4 | INT | | +| bigint/int8 | BIGINT | | +| decimal/numeric | DECIMAL | | +| real/float4 | FLOAT | | +| double precision | DOUBLE | | +| smallserial | SMALLINT | | +| serial | INT | | +| bigserial | BIGINT | | +| char | CHAR | | +| varchar/text | STRING | | +| timestamp | DATETIME | | +| date | DATE | | +| time | STRING | | +| interval | STRING | | +| point/line/lseg/box/path/polygon/circle | STRING | | +| cidr/inet/macaddr | STRING | | +| bit/bit(n)/bit varying(n) | STRING | `bit ` will be mapped to `STRING` in Doris. It will be read as `true/false` instead of `1/0` | +| uuid/josnb | STRING | | +|Other| UNSUPPORTED | + +### Oracle + +| ORACLE Type | Doris Type | Comment | +| ----------------------------- | ----------- | ------------------------------------------------------------ | +| number(p) / number(p,0) | | Doris will determine the type to map to based on the value of p: `p < 3` -> `TINYINT`; `p < 5` -> `SMALLINT`; `p < 10` -> `INT`; `p < 19` -> `BIGINT`; `p > 19` -> `LARGEINT` | +| number(p,s) | DECIMAL | | +| decimal | DECIMAL | | +| float/real | DOUBLE | | +| DATE | DATETIME | | +| TIMESTAMP | DATETIME | | +| CHAR/NCHAR | STRING | | +| VARCHAR2/NVARCHAR2 | STRING | | +| LONG/ RAW/ LONG RAW/ INTERVAL | STRING | | +| Other | UNSUPPORTED | | + +### SQLServer + +| SQLServer Type | Doris Type | Comment | +| -------------------------------------- | ----------- | ------------------------------------------------------------ | +| bit | BOOLEAN | | +| tinyint | SMALLINT | The tinyint type in SQLServer is an unsigned number so it will be mapped to SMALLINT in Doris. | +| smallint | SMALLINT | | +| int | INT | | +| bigint | BIGINT | | +| real | FLOAT | | +| float/money/smallmoney | DOUBLE | | +| decimal/numeric | DECIMAL | | +| date | DATE | | +| datetime/datetime2/smalldatetime | DATETIMEV2 | | +| char/varchar/text/nchar/nvarchar/ntext | STRING | | +| binary/varbinary | STRING | | +| time/datetimeoffset | STRING | | +| Other | UNSUPPORTED | | + + +### Clickhouse + +| ClickHouse Type | Doris Type | Comment | +| ---------------------- | ----------- | ------------------------------------------------------------ | +| Bool | BOOLEAN | | +| String | STRING | | +| Date/Date32 | DATE | | +| DateTime/DateTime64 | DATETIME | Data beyond the maximum precision of DateTime in Doris will be truncated. | +| Float32 | FLOAT | | +| Float64 | DOUBLE | | +| Int8 | TINYINT | | +| Int16/UInt8 | SMALLINT | Doris does not support UNSIGNED data types so UInt8 will be mapped to SMALLINT. | +| Int32/UInt16 | INT | Doris does not support UNSIGNED data types so UInt16 will be mapped to INT. | +| Int64/Uint32 | BIGINT | Doris does not support UNSIGNED data types so UInt32 will be mapped to BIGINT. | +| Int128/UInt64 | LARGEINT | Doris does not support UNSIGNED data types so UInt64 will be mapped to LARGEINT. | +| Int256/UInt128/UInt256 | STRING | Doris does not support data types of such orders of magnitude so these will be mapped to STRING. | +| DECIMAL | DECIMAL | Data beyond the maximum decimal precision in Doris will be truncated. | +| Enum/IPv4/IPv6/UUID | STRING | Data of IPv4 and IPv6 type will be displayed with an extra `/` as a prefix. To remove the `/`, you can use the `split_part`function. | +| Other | UNSUPPORTED | | + +## FAQ + +1. Are there any other databases supported besides MySQL, Oracle, PostgreSQL, SQLServer, and ClickHouse? + + Currently, Doris supports MySQL, Oracle, PostgreSQL, SQLServer, and ClickHouse. We are planning to expand this list. Technically, any databases that support JDBC access can be connected to Doris in the form of JDBC external tables. You are more than welcome to be a Doris contributor to expedite this effort. + +2. Why does Mojibake occur when Doris tries to read emojis from MySQL external tables? + + In MySQL, utf8mb3 is the default utf8 format. It cannot represent emojis, which require 4-byte encoding. To solve this, when creating MySQL external tables, you need to set utf8mb4 encoding for the corresponding columns, set the server encoding to utf8mb4, and leave the characterEncoding in JDBC URl empty (because utf8mb4 is not supported for this property, and anything other than utf8mb4 will cause a failure to write the emojis). + + You can modify the configuration items globally: + + ``` + Modify the my.ini file in the mysql directory (for linux system, modify the my.cnf file in the etc directory) + [client] + default-character-set=utf8mb4 + + [mysql] + Set the mysql default character set + default-character-set=utf8mb4 + + [mysqld] + Set the mysql character set server + character-set-server=utf8mb4 + collation-server=utf8mb4_unicode_ci + init_connect='SET NAMES utf8mb4 + + Modify the type for the corresponding tables and columns + ALTER TABLE table_name MODIFY colum_name VARCHAR(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; + ALTER TABLE table_name CHARSET=utf8mb4; + SET NAMES utf8mb4 + ``` + +3. Why does the error message "CAUSED BY: DataReadException: Zero date value prohibited" pop up when DateTime="0000:00:00 00:00:00" while reading MySQL external tables? + + This error occurs because of an illegal DateTime. It can be fixed by modifying the `zeroDateTimeBehavior` parameter. + + The options for this parameter include: `EXCEPTION`,`CONVERT_TO_NULL`,`ROUND`. Respectively, they mean to report error, convert to null, and round the DateTime to "0001-01-01 00:00:00" when encountering an illegal DateTime. + + You can add `"jdbc_url"="jdbc:mysql://IP:PORT/doris_test?zeroDateTimeBehavior=convertToNull"` to the URL. + +4. Why do loading failures happen when reading MySQL or other external tables? + + For example: + + ``` + failed to load driver class com.mysql.jdbc.driver in either of hikariconfig class loader + ``` + + Such errors occur because the `driver_class` has been wrongly put when creating the Resource. The problem with the above example is the letter case. It should be corrected as `"driver_class" = "com.mysql.jdbc.Driver"`. + +5. How to fix communication link failures? + + If you run into the following errors: + + ``` + ERROR 1105 (HY000): errCode = 2, detailMessage = PoolInitializationException: Failed to initialize pool: Communications link failure + + The last packet successfully received from the server was 7 milliseconds ago. The last packet sent successfully to the server was 4 milliseconds ago. + CAUSED BY: CommunicationsException: Communications link failure + + The last packet successfully received from the server was 7 milliseconds ago. The last packet sent successfully to the server was 4 milliseconds ago. + CAUSED BY: SSLHandshakeExcepti + ``` + + Please check the be.out log of BE. + + If it contains the following message: + + ``` + WARN: Establishing SSL connection without server's identity verification is not recommended. + According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. + For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. + You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. + ``` + + You can add `?useSSL=false` to the end of the JDBC connection string when creating Catalog. For example, `"jdbc_url" = "jdbc:mysql://127.0.0.1:3306/test?useSSL=false"`. + +6. What to do with the `OutOfMemoryError` when querying MySQL databases? + + To reduce memory usage, Doris obtains one batch of query results at a time, and has a size limit for each batch. However, MySQL conducts one-off loading of all query results by default, which means the "loading in batches" method won't work. To solve this, you need to specify "jdbc_url"="jdbc:mysql://IP:PORT/doris_test?useCursorFetch=true" in the URL. + + 7. What to do with errors such as "CAUSED BY: SQLException OutOfMemoryError" when performing JDBC queries? + + If you have set `useCursorFetch` for MySQL, you can increase the JVM memory limit by modifying the value of `jvm_max_heap_size` in be.conf. The current default value is 1024M. diff --git a/docs/zh-CN/docs/lakehouse/external-table/jdbc.md b/docs/zh-CN/docs/lakehouse/external-table/jdbc.md index 4c53d27243..58454ab353 100644 --- a/docs/zh-CN/docs/lakehouse/external-table/jdbc.md +++ b/docs/zh-CN/docs/lakehouse/external-table/jdbc.md @@ -168,7 +168,7 @@ PROPERTIES ( 目前只测试了这一个版本其他版本测试后补充 -#### 4.ClickHouse测试 +#### 5.ClickHouse测试 | ClickHouse版本 | ClickHouse JDBC驱动版本 | |----------| ------------------- | | 22 | clickhouse-jdbc-0.3.2-patch11-all.jar | --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org