wutiangan commented on a change in pull request #3867:
URL: https://github.com/apache/incubator-doris/pull/3867#discussion_r440105196



##########
File path: docs/en/extending-doris/doris-on-es.md
##########
@@ -26,62 +26,314 @@ under the License.
 
 # Doris On ES
 
-Doris-On-ES combines Doris's distributed query planning capability with ES 
(Elastic search)'s full-text search capability to provide a more complete OLAP 
scenario solution:
+Doris-On-ES not only take advantage of Doris's distributed query planning 
capability but also ES (Elastic search)'s full-text search capability, provide 
a more complete OLAP scenario solution:
 
 1. Multi-index Distributed Join Query in ES
 2. Joint Query of Tables in Doris and ES, More Complex Full-Text Retrieval and 
Filtering
-3. Aggregated queries for fields of ES keyword type: suitable for frequent 
changes in index, tens of millions or more of single fragmented documents, and 
the cardinality of the field is very large
 
 This document mainly introduces the realization principle and usage of this 
function.
 
-## Noun Interpretation
+## Glossary
+
+### Noun in Doris
 
 * FE: Frontend, the front-end node of Doris. Responsible for metadata 
management and request access.
 * BE: Backend, Doris's back-end node. Responsible for query execution and data 
storage.
-* Elastic search (ES): The most popular open source distributed search engine.
+
+### Noun in ES
+
 * DataNode: The data storage and computing node of ES.
 * MasterNode: The Master node of ES, which manages metadata, nodes, data 
distribution, etc.
 * scroll: The built-in data set cursor feature of ES for streaming scanning 
and filtering of data.
+* _source: contains the original JSON document body that was passed at index 
time
+* doc_values: store the same values as the _source but in a column-oriented 
fashion
+* keyword: string datatype in ES, but the content not analyzed by analyzer
+* text: string datatype in ES, the content analyzed by analyzer
 
 
-## How to use it
+## How To Use
 
-### Create appearance
+### Create ES Index
 
 ```
-CREATE EXTERNAL TABLE `es_table` (
-  `id` bigint(20) COMMENT "",
+PUT test
+{
+   "settings": {
+      "index": {
+         "number_of_shards": "1",
+         "number_of_replicas": "0"
+      }
+   },
+   "mappings": {
+      "doc": { // ES 7.x版本之后创建索引时不需要指定type,会有一个默认且唯一的`_doc` type
+         "properties": {
+            "k1": {
+               "type": "long"
+            },
+            "k2": {
+               "type": "date"
+            },
+            "k3": {
+               "type": "keyword"
+            },
+            "k4": {
+               "type": "text",
+               "analyzer": "standard"
+            },
+            "k5": {
+               "type": "float"
+            }
+         }
+      }
+   }
+}
+```
+
+### Add JSON documents to ES index
+
+```
+POST /_bulk
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Trying out Elasticsearch", "k4": 
"Trying out Elasticsearch", "k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Trying out Doris", "k4": "Trying out 
Doris", "k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Doris On ES", "k4": "Doris On ES", 
"k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Doris", "k4": "Doris", "k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "ES", "k4": "ES", "k5": 10.0}
+```
+
+### Create external ES table
+
+```
+CREATE EXTERNAL TABLE `test` (
+  `k1` bigint(20) COMMENT "",
+  `k2` datetime COMMENT "",
+  `k3` varchar(20) COMMENT "",
+  `k4` varchar(100) COMMENT "",
+  `k5` float COMMENT ""
+) ENGINE=ELASTICSEARCH // ENGINE必须是Elasticsearch
+PROPERTIES (
+"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200";,
+"index" = "test”,
+"type" = "doc",
+
+"user" = "root",
+"password" = "root"
+);
+```
+
+The following parameters are accepted by ES table:
+
+参数 | 说明
+---|---
+**hosts** | ES Cluster Connection Address, maybe one or more node, 
load-balance is also accepted
+**index** | the related ES index name
+**type** | the type for this index,If not specified, `_doc` will be used
+**user** | username for ES
+**password** | password for the user
+
+* For clusters before 7.x, please pay attention to choosing the correct type 
when building the table
+* The authentication method only supports Http Bastic authentication, and the 
fully open source ES cluster user and password do not need to be specified,Need 
to ensure that the user has access to /_cluster/state and other path 
permissions and read permissions to index
+* The column names in the Doris table need to exactly match the field names in 
the ES, and the field types should be as consistent as possible
+*  **ENGINE** must be: **Elasticsearch**
+
+##### Filter to push down
+
+An important ability of `Doris On ES` is the push-down of filter conditions: 
The filtering conditions are pushed to ES, so that only the data that really 
meets the conditions will be returned, which can significantly improve query 
performance and reduce CPU, memory, and IO utilization of Doris and ES
+
+The following operators (Operators) will be optimized to the following ES 
Query:
+
+| SQL syntax  | ES 5.x+ syntax | 
+|-------|:---:|
+| =   | term query|
+| in  | terms query   |
+| > , < , >= , ⇐  | range query |
+| and  | bool.filter   |
+| or  | bool.should   |
+| not  | bool.must_not   |
+| not in  | bool.must_not + terms query |
+| is\_not\_null  | exists query |
+| is\_null  | bool.must_not + exists query |
+| esquery  | QueryDSL in ES native json form   |
+
+##### Data type mapping
+
+Doris\ES  |  byte | short | integer | long | float | double| keyword | text | 
date
+------------- | ------------- | ------  | ---- | ----- | ----  | ------ | 
----| --- | --- |
+tinyint  | &radic; |  |  |  |   |   |   |   |  
+smallint | &radic; | &radic; |  | |   |   |   |   |  
+int | &radic; |  &radic; | &radic; | |   |   |   |   |  
+bigint | &radic;  | &radic;  | &radic;  | &radic; |   |   |   |   | 
+float |   |   |   |   | &radic; |   |   |   |  
+double |   |   |   |   |   | &radic; |   |   |  
+char |   |   |   |   |   |   | &radic; | &radic; |  
+varchar |  |   |   |   |   |   | &radic; | &radic; |   
+date |   |   |   |   |   |   |   |   | &radic;|  
+datetime |   |   |   |   |   |   |   |   | &radic;|  
+
+
+### Enable column scan to optimize query speed(enable\_docvalue\_scan=true)
+
+```
+CREATE EXTERNAL TABLE `test` (
   `k1` bigint(20) COMMENT "",
   `k2` datetime COMMENT "",
   `k3` varchar(20) COMMENT "",
   `k4` varchar(100) COMMENT "",
   `k5` float COMMENT ""
 ) ENGINE=ELASTICSEARCH
-PARTITION BY RANGE(`id`)
-()
 PROPERTIES (
 "hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200";,
+"index" = "test”,
+"type" = "doc",
 "user" = "root",
 "password" = "root",
-"index" = "tindex”,
-"type" = "doc"
+
+"enable_docvalue_scan" = "true"
 );
 ```
 
-Description of parameters:
+Parameter Description:
 
-Parameter | description
+Parameter | Description
 ---|---
-Hosts | ES Cluster Connection Address, which can specify one or more, through 
which Doris obtains the share distribution information of ES version number and 
index
-User | Open the user name of the ES cluster authenticated by basic, you need 
to ensure that the user has access to: / cluster / state / nodes / HTTP and 
other path permissions and read permissions for index
-Password | corresponding user's password information
-The index name of the ES corresponding to the table in index | Doris can be 
alias
-Type | Specifies the type of index, defaulting to _doc
-Transport | Internal reservation, default to http
+**enable\_docvalue\_scan** | whether to enable ES/Lucene column storage to get 
the value of the query field, the default is false
+
+Doris obtains data from ES following the following two principles:
+
+* **Best effort**: Automatically detect whether the column to be read has 
column storage enabled (doc_value: true).If all the fields obtained have column 
storage, Doris will obtain the values ​​of all fields from the column 
storage(doc_values)
+* **Automatic downgrade**: If the field to be obtained has one or more field 
that is not have doc_value, the values ​​of all fields will be parsed from the 
line store `_source`
+
+##### Advantage:
+
+By default, Doris On ES will get all the required columns from the row 
storage, which is `_source`, and the storage of `_source` is the origin json 
format document,Inferior to column storage in batch read performance,Especially 
obvious when only a few columns are needed,When only a few columns are 
obtained, the performance of docvalue is about ten times that of _source
 
-### Query
+##### Tip
+1. Fields of type `text` are not column-stored in ES, so if the value of the 
field to be obtained has a field of type `text`, it will be automatically 
downgraded to get from `_source`
+2. In the case of too many fields obtained (`>= 25`), the performance of 
getting field values ​​from `docvalue` will be basically the same as getting 
field values ​​from `_source`
 
-#### Basic Conditions Filtration
+
+### Detect keyword type field(enable\_keyword\_sniff=true)
+
+```
+CREATE EXTERNAL TABLE `test` (
+  `k1` bigint(20) COMMENT "",
+  `k2` datetime COMMENT "",
+  `k3` varchar(20) COMMENT "",
+  `k4` varchar(100) COMMENT "",
+  `k5` float COMMENT ""
+) ENGINE=ELASTICSEARCH
+PROPERTIES (
+"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200";,
+"index" = "test”,
+"type" = "doc",
+"user" = "root",
+"password" = "root",
+
+"enable_keyword_sniff" = "true"
+);
+```
+
+参数说明:
+

Review comment:
       change it to english




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to