wutiangan commented on a change in pull request #3867: URL: https://github.com/apache/incubator-doris/pull/3867#discussion_r440105196
########## File path: docs/en/extending-doris/doris-on-es.md ########## @@ -26,62 +26,314 @@ under the License. # Doris On ES -Doris-On-ES combines Doris's distributed query planning capability with ES (Elastic search)'s full-text search capability to provide a more complete OLAP scenario solution: +Doris-On-ES not only take advantage of Doris's distributed query planning capability but also ES (Elastic search)'s full-text search capability, provide a more complete OLAP scenario solution: 1. Multi-index Distributed Join Query in ES 2. Joint Query of Tables in Doris and ES, More Complex Full-Text Retrieval and Filtering -3. Aggregated queries for fields of ES keyword type: suitable for frequent changes in index, tens of millions or more of single fragmented documents, and the cardinality of the field is very large This document mainly introduces the realization principle and usage of this function. -## Noun Interpretation +## Glossary + +### Noun in Doris * FE: Frontend, the front-end node of Doris. Responsible for metadata management and request access. * BE: Backend, Doris's back-end node. Responsible for query execution and data storage. -* Elastic search (ES): The most popular open source distributed search engine. + +### Noun in ES + * DataNode: The data storage and computing node of ES. * MasterNode: The Master node of ES, which manages metadata, nodes, data distribution, etc. * scroll: The built-in data set cursor feature of ES for streaming scanning and filtering of data. +* _source: contains the original JSON document body that was passed at index time +* doc_values: store the same values as the _source but in a column-oriented fashion +* keyword: string datatype in ES, but the content not analyzed by analyzer +* text: string datatype in ES, the content analyzed by analyzer -## How to use it +## How To Use -### Create appearance +### Create ES Index ``` -CREATE EXTERNAL TABLE `es_table` ( - `id` bigint(20) COMMENT "", +PUT test +{ + "settings": { + "index": { + "number_of_shards": "1", + "number_of_replicas": "0" + } + }, + "mappings": { + "doc": { // ES 7.x版本之后创建索引时不需要指定type,会有一个默认且唯一的`_doc` type + "properties": { + "k1": { + "type": "long" + }, + "k2": { + "type": "date" + }, + "k3": { + "type": "keyword" + }, + "k4": { + "type": "text", + "analyzer": "standard" + }, + "k5": { + "type": "float" + } + } + } + } +} +``` + +### Add JSON documents to ES index + +``` +POST /_bulk +{"index":{"_index":"test","_type":"doc"}} +{ "k1" : 100, "k2": "2020-01-01", "k3": "Trying out Elasticsearch", "k4": "Trying out Elasticsearch", "k5": 10.0} +{"index":{"_index":"test","_type":"doc"}} +{ "k1" : 100, "k2": "2020-01-01", "k3": "Trying out Doris", "k4": "Trying out Doris", "k5": 10.0} +{"index":{"_index":"test","_type":"doc"}} +{ "k1" : 100, "k2": "2020-01-01", "k3": "Doris On ES", "k4": "Doris On ES", "k5": 10.0} +{"index":{"_index":"test","_type":"doc"}} +{ "k1" : 100, "k2": "2020-01-01", "k3": "Doris", "k4": "Doris", "k5": 10.0} +{"index":{"_index":"test","_type":"doc"}} +{ "k1" : 100, "k2": "2020-01-01", "k3": "ES", "k4": "ES", "k5": 10.0} +``` + +### Create external ES table + +``` +CREATE EXTERNAL TABLE `test` ( + `k1` bigint(20) COMMENT "", + `k2` datetime COMMENT "", + `k3` varchar(20) COMMENT "", + `k4` varchar(100) COMMENT "", + `k5` float COMMENT "" +) ENGINE=ELASTICSEARCH // ENGINE必须是Elasticsearch +PROPERTIES ( +"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200", +"index" = "test”, +"type" = "doc", + +"user" = "root", +"password" = "root" +); +``` + +The following parameters are accepted by ES table: + +参数 | 说明 +---|--- +**hosts** | ES Cluster Connection Address, maybe one or more node, load-balance is also accepted +**index** | the related ES index name +**type** | the type for this index,If not specified, `_doc` will be used +**user** | username for ES +**password** | password for the user + +* For clusters before 7.x, please pay attention to choosing the correct type when building the table +* The authentication method only supports Http Bastic authentication, and the fully open source ES cluster user and password do not need to be specified,Need to ensure that the user has access to /_cluster/state and other path permissions and read permissions to index +* The column names in the Doris table need to exactly match the field names in the ES, and the field types should be as consistent as possible +* **ENGINE** must be: **Elasticsearch** + +##### Filter to push down + +An important ability of `Doris On ES` is the push-down of filter conditions: The filtering conditions are pushed to ES, so that only the data that really meets the conditions will be returned, which can significantly improve query performance and reduce CPU, memory, and IO utilization of Doris and ES + +The following operators (Operators) will be optimized to the following ES Query: + +| SQL syntax | ES 5.x+ syntax | +|-------|:---:| +| = | term query| +| in | terms query | +| > , < , >= , ⇐ | range query | +| and | bool.filter | +| or | bool.should | +| not | bool.must_not | +| not in | bool.must_not + terms query | +| is\_not\_null | exists query | +| is\_null | bool.must_not + exists query | +| esquery | QueryDSL in ES native json form | + +##### Data type mapping + +Doris\ES | byte | short | integer | long | float | double| keyword | text | date +------------- | ------------- | ------ | ---- | ----- | ---- | ------ | ----| --- | --- | +tinyint | √ | | | | | | | | +smallint | √ | √ | | | | | | | +int | √ | √ | √ | | | | | | +bigint | √ | √ | √ | √ | | | | | +float | | | | | √ | | | | +double | | | | | | √ | | | +char | | | | | | | √ | √ | +varchar | | | | | | | √ | √ | +date | | | | | | | | | √| +datetime | | | | | | | | | √| + + +### Enable column scan to optimize query speed(enable\_docvalue\_scan=true) + +``` +CREATE EXTERNAL TABLE `test` ( `k1` bigint(20) COMMENT "", `k2` datetime COMMENT "", `k3` varchar(20) COMMENT "", `k4` varchar(100) COMMENT "", `k5` float COMMENT "" ) ENGINE=ELASTICSEARCH -PARTITION BY RANGE(`id`) -() PROPERTIES ( "hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200", +"index" = "test”, +"type" = "doc", "user" = "root", "password" = "root", -"index" = "tindex”, -"type" = "doc" + +"enable_docvalue_scan" = "true" ); ``` -Description of parameters: +Parameter Description: -Parameter | description +Parameter | Description ---|--- -Hosts | ES Cluster Connection Address, which can specify one or more, through which Doris obtains the share distribution information of ES version number and index -User | Open the user name of the ES cluster authenticated by basic, you need to ensure that the user has access to: / cluster / state / nodes / HTTP and other path permissions and read permissions for index -Password | corresponding user's password information -The index name of the ES corresponding to the table in index | Doris can be alias -Type | Specifies the type of index, defaulting to _doc -Transport | Internal reservation, default to http +**enable\_docvalue\_scan** | whether to enable ES/Lucene column storage to get the value of the query field, the default is false + +Doris obtains data from ES following the following two principles: + +* **Best effort**: Automatically detect whether the column to be read has column storage enabled (doc_value: true).If all the fields obtained have column storage, Doris will obtain the values of all fields from the column storage(doc_values) +* **Automatic downgrade**: If the field to be obtained has one or more field that is not have doc_value, the values of all fields will be parsed from the line store `_source` + +##### Advantage: + +By default, Doris On ES will get all the required columns from the row storage, which is `_source`, and the storage of `_source` is the origin json format document,Inferior to column storage in batch read performance,Especially obvious when only a few columns are needed,When only a few columns are obtained, the performance of docvalue is about ten times that of _source -### Query +##### Tip +1. Fields of type `text` are not column-stored in ES, so if the value of the field to be obtained has a field of type `text`, it will be automatically downgraded to get from `_source` +2. In the case of too many fields obtained (`>= 25`), the performance of getting field values from `docvalue` will be basically the same as getting field values from `_source` -#### Basic Conditions Filtration + +### Detect keyword type field(enable\_keyword\_sniff=true) + +``` +CREATE EXTERNAL TABLE `test` ( + `k1` bigint(20) COMMENT "", + `k2` datetime COMMENT "", + `k3` varchar(20) COMMENT "", + `k4` varchar(100) COMMENT "", + `k5` float COMMENT "" +) ENGINE=ELASTICSEARCH +PROPERTIES ( +"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200", +"index" = "test”, +"type" = "doc", +"user" = "root", +"password" = "root", + +"enable_keyword_sniff" = "true" +); +``` + +参数说明: + Review comment: change it to english ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org