[incubator-doris] branch master updated: [Doris On ES][Docs] refator documentation for doe (#3867)

kangkaisen Tue, 16 Jun 2020 19:55:17 -0700

This is an automated email from the ASF dual-hosted git repository.

kangkaisen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-doris.git



The following commit(s) were added to refs/heads/master by this push:
     new c6f2b5e  [Doris On ES][Docs] refator documentation for doe (#3867)
c6f2b5e is described below

commit c6f2b5ef0df61902360a3f3fcdfc00a6fff86dfe
Author: Yunfeng,Wu <wuyunfen...@baidu.com>
AuthorDate: Wed Jun 17 10:54:28 2020 +0800

    [Doris On ES][Docs] refator documentation for doe (#3867)
---
 docs/en/extending-doris/doris-on-es.md    | 331 +++++++++++++++++++++++++----
 docs/zh-CN/extending-doris/doris-on-es.md | 341 +++++++++++++++++++++++++-----
 2 files changed, 577 insertions(+), 95 deletions(-)

diff --git a/docs/en/extending-doris/doris-on-es.md 
b/docs/en/extending-doris/doris-on-es.md
index cf66cab..04fdb84 100644
--- a/docs/en/extending-doris/doris-on-es.md
+++ b/docs/en/extending-doris/doris-on-es.md
@@ -26,76 +26,327 @@ under the License.
 
 # Doris On ES
 
-Doris-On-ES combines Doris's distributed query planning capability with ES 
(Elastic search)'s full-text search capability to provide a more complete OLAP 
scenario solution:
+Doris-On-ES not only take advantage of Doris's distributed query planning 
capability but also ES (Elastic search)'s full-text search capability, provide 
a more complete OLAP scenario solution:
 
 1. Multi-index Distributed Join Query in ES
 2. Joint Query of Tables in Doris and ES, More Complex Full-Text Retrieval and 
Filtering
-3. Aggregated queries for fields of ES keyword type: suitable for frequent 
changes in index, tens of millions or more of single fragmented documents, and 
the cardinality of the field is very large
 
 This document mainly introduces the realization principle and usage of this 
function.
 
-## Noun Interpretation
+## Glossary
+
+### Noun in Doris
 
 * FE: Frontend, the front-end node of Doris. Responsible for metadata 
management and request access.
 * BE: Backend, Doris's back-end node. Responsible for query execution and data 
storage.
-* Elastic search (ES): The most popular open source distributed search engine.
+
+### Noun in ES
+
 * DataNode: The data storage and computing node of ES.
 * MasterNode: The Master node of ES, which manages metadata, nodes, data 
distribution, etc.
 * scroll: The built-in data set cursor feature of ES for streaming scanning 
and filtering of data.
+* _source: contains the original JSON document body that was passed at index 
time
+* doc_values: store the same values as the _source but in a column-oriented 
fashion
+* keyword: string datatype in ES, but the content not analyzed by analyzer
+* text: string datatype in ES, the content analyzed by analyzer
 
 
-## How to use it
+## How To Use
 
-### Create appearance
+### Create ES Index
 
 ```
-CREATE EXTERNAL TABLE `es_table` (
-  `id` bigint(20) COMMENT "",
+PUT test
+{
+   "settings": {
+      "index": {
+         "number_of_shards": "1",
+         "number_of_replicas": "0"
+      }
+   },
+   "mappings": {
+      "doc": { // There is no need to specify the type when creating indexes 
after ES7.x version, there is one and only type of `_doc`
+         "properties": {
+            "k1": {
+               "type": "long"
+            },
+            "k2": {
+               "type": "date"
+            },
+            "k3": {
+               "type": "keyword"
+            },
+            "k4": {
+               "type": "text",
+               "analyzer": "standard"
+            },
+            "k5": {
+               "type": "float"
+            }
+         }
+      }
+   }
+}
+```
+
+### Add JSON documents to ES index
+
+```
+POST /_bulk
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Trying out Elasticsearch", "k4": 
"Trying out Elasticsearch", "k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Trying out Doris", "k4": "Trying out 
Doris", "k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Doris On ES", "k4": "Doris On ES", 
"k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Doris", "k4": "Doris", "k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "ES", "k4": "ES", "k5": 10.0}
+```
+
+### Create external ES table
+
+```
+CREATE EXTERNAL TABLE `test` (
+  `k1` bigint(20) COMMENT "",
+  `k2` datetime COMMENT "",
+  `k3` varchar(20) COMMENT "",
+  `k4` varchar(100) COMMENT "",
+  `k5` float COMMENT ""
+) ENGINE=ELASTICSEARCH // ENGINE must be Elasticsearch
+PROPERTIES (
+"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200";,
+"index" = "test”,
+"type" = "doc",
+
+"user" = "root",
+"password" = "root"
+);
+```
+
+The following parameters are accepted by ES table:
+
+Parameter | Description
+---|---
+**hosts** | ES Cluster Connection Address, maybe one or more node, 
load-balance is also accepted
+**index** | the related ES index name
+**type** | the type for this index，If not specified, `_doc` will be used
+**user** | username for ES
+**password** | password for the user
+
+* For clusters before 7.x, please pay attention to choosing the correct type 
when building the table
+* The authentication method only supports Http Bastic authentication, need to 
ensure that this user has access to: /\_cluster/state/, \_nodes/http and other 
paths and index read permissions;The cluster has not turned on security 
authentication, and the user name and password do not need to be set
+* The column names in the Doris table need to exactly match the field names in 
the ES, and the field types should be as consistent as possible
+*  **ENGINE** must be: **Elasticsearch**
+
+##### Filter to push down
+
+An important ability of `Doris On ES` is the push-down of filter conditions: 
The filtering conditions are pushed to ES, so that only the data that really 
meets the conditions will be returned, which can significantly improve query 
performance and reduce CPU, memory, and IO utilization of Doris and ES
+
+The following operators (Operators) will be optimized to the following ES 
Query:
+
+| SQL syntax  | ES 5.x+ syntax | 
+|-------|:---:|
+| =   | term query|
+| in  | terms query   |
+| > , < , >= , ⇐  | range query |
+| and  | bool.filter   |
+| or  | bool.should   |
+| not  | bool.must_not   |
+| not in  | bool.must_not + terms query |
+| is\_not\_null  | exists query |
+| is\_null  | bool.must_not + exists query |
+| esquery  | QueryDSL in ES native json form   |
+
+##### Data type mapping
+
+Doris\ES  |  byte | short | integer | long | float | double| keyword | text | 
date
+------------- | ------------- | ------  | ---- | ----- | ----  | ------ | 
----| --- | --- |
+tinyint  | &radic; |  |  |  |   |   |   |   |  
+smallint | &radic; | &radic; |  | |   |   |   |   |  
+int | &radic; |  &radic; | &radic; | |   |   |   |   |  
+bigint | &radic;  | &radic;  | &radic;  | &radic; |   |   |   |   | 
+float |   |   |   |   | &radic; |   |   |   |  
+double |   |   |   |   |   | &radic; |   |   |  
+char |   |   |   |   |   |   | &radic; | &radic; |  
+varchar |  |   |   |   |   |   | &radic; | &radic; |   
+date |   |   |   |   |   |   |   |   | &radic;|  
+datetime |   |   |   |   |   |   |   |   | &radic;|  
+
+
+### Enable column scan to optimize query speed(enable\_docvalue\_scan=true)
+
+```
+CREATE EXTERNAL TABLE `test` (
   `k1` bigint(20) COMMENT "",
   `k2` datetime COMMENT "",
   `k3` varchar(20) COMMENT "",
   `k4` varchar(100) COMMENT "",
   `k5` float COMMENT ""
 ) ENGINE=ELASTICSEARCH
-PARTITION BY RANGE(`id`)
-()
 PROPERTIES (
 "hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200";,
+"index" = "test”,
+"type" = "doc",
 "user" = "root",
 "password" = "root",
-"index" = "tindex”,
-"type" = "doc"
+
+"enable_docvalue_scan" = "true"
 );
 ```
 
-Description of parameters:
+Parameter Description:
 
-Parameter | description
+Parameter | Description
 ---|---
-Hosts | ES Cluster Connection Address, which can specify one or more, through 
which Doris obtains the share distribution information of ES version number and 
index
-User | Open the user name of the ES cluster authenticated by basic, you need 
to ensure that the user has access to: / cluster / state / nodes / HTTP and 
other path permissions and read permissions for index
-Password | corresponding user's password information
-The index name of the ES corresponding to the table in index | Doris can be 
alias
-Type | Specifies the type of index, defaulting to _doc
-Transport | Internal reservation, default to http
+**enable\_docvalue\_scan** | whether to enable ES/Lucene column storage to get 
the value of the query field, the default is false
+
+Doris obtains data from ES following the following two principles:
+
+* **Best effort**: Automatically detect whether the column to be read has 
column storage enabled (doc_value: true).If all the fields obtained have column 
storage, Doris will obtain the values of all fields from the column 
storage(doc_values)
+* **Automatic downgrade**: If the field to be obtained has one or more field 
that is not have doc_value, the values of all fields will be parsed from the 
line store `_source`
+
+##### Advantage：
+
+By default, Doris On ES will get all the required columns from the row 
storage, which is `_source`, and the storage of `_source` is the origin json 
format document，Inferior to column storage in batch read performance，Especially 
obvious when only a few columns are needed，When only a few columns are 
obtained, the performance of docvalue is about ten times that of _source
 
-### Query
+##### Tip
+1. Fields of type `text` are not column-stored in ES, so if the value of the 
field to be obtained has a field of type `text`, it will be automatically 
downgraded to get from `_source`
+2. In the case of too many fields obtained (`>= 25`), the performance of 
getting field values from `docvalue` will be basically the same as getting 
field values from `_source`
 
-#### Basic Conditions Filtration
+
+### Detect keyword type field(enable\_keyword\_sniff=true)
 
 ```
-select * from es_table where k1 > 1000 and k3 ='term' or k4 like 'fu*z_'
+CREATE EXTERNAL TABLE `test` (
+  `k1` bigint(20) COMMENT "",
+  `k2` datetime COMMENT "",
+  `k3` varchar(20) COMMENT "",
+  `k4` varchar(100) COMMENT "",
+  `k5` float COMMENT ""
+) ENGINE=ELASTICSEARCH
+PROPERTIES (
+"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200";,
+"index" = "test”,
+"type" = "doc",
+"user" = "root",
+"password" = "root",
+
+"enable_keyword_sniff" = "true"
+);
 ```
 
-#### Extended esquery SQL grammar
-The first column name parameter of `esquery` is used to associate `index`, the 
second parameter is the JSON expression of the basic `Query DSL`, and the curly 
bracket `{}` is used to include `root` of json. There is and can only be one 
key of json, such as mat. Ch, geo_shape, bool, etc.
+Parameter Description：
+
+Parameter | Description
+---|---
+**enable\_keyword\_sniff** | Whether to detect the string type (**text**) 
`fields` in ES to obtain additional not analyzed (**keyword**) field 
name(multi-fields mechanism)
+
+You can directly import data without creating an index. At this time, ES will 
automatically create a new index in ES, For a field of type string, a field of 
type `text` and field of type `keyword` will be created meantime, This is the 
multi-fields feature of ES, mapping is as follows：
 
+```
+"k4": {
+   "type": "text",
+   "fields": {
+      "keyword": {   
+         "type": "keyword",
+         "ignore_above": 256
+      }
+   }
+}
+```
+When performing conditional filtering on k4, for example =，Doris On ES will 
convert the query to ES's TermQuery
+
+SQL filter：
+
+```
+k4 = "Doris On ES"
+```
+
+The query DSL converted into ES is：
+
+```
+"term" : {
+    "k4": "Doris On ES"
+
+}
+```
+
+Because the first field type of k4 is `text`, when data is imported, it will 
perform word segmentation processing according to the word segmentator set by 
k4 (if it is not set, it is the standard word segmenter) to get three Term of 
doris, on, and es, as follows ES analyze API analysis：
+
+```
+POST /_analyze
+{
+  "analyzer": "standard",
+  "text": "Doris On ES"
+}
+```
+The result of analyzed is：
+
+```
+{
+   "tokens": [
+      {
+         "token": "doris",
+         "start_offset": 0,
+         "end_offset": 5,
+         "type": "<ALPHANUM>",
+         "position": 0
+      },
+      {
+         "token": "on",
+         "start_offset": 6,
+         "end_offset": 8,
+         "type": "<ALPHANUM>",
+         "position": 1
+      },
+      {
+         "token": "es",
+         "start_offset": 9,
+         "end_offset": 11,
+         "type": "<ALPHANUM>",
+         "position": 2
+      }
+   ]
+}
+```
+The query uses：
+
+```
+"term" : {
+    "k4": "Doris On ES"
+}
+```
+This term does not match any term in the dictionary，and will not return any 
results，enable `enable_keyword_sniff: true` will automatically convert `k4 = 
"Doris On ES"` into `k4.keyword = "Doris On ES"`to exactly match SQL 
semantics，The converted ES query DSL is:
+
+```
+"term" : {
+    "k4.keyword": "Doris On ES"
+}
+```
+
+The type of `k4.keyword` is `keyword`, and writing data into ES is a complete 
term, so it can be matched
+
+
+### Query usage
+
+After create the ES external table in Doris, there is no difference except 
that the data model (rollup, pre-aggregation, materialized view, etc.) with 
other table in Doris
+
+#### Basic usage
+
+```
+select * from es_table where k1 > 1000 and k3 ='term' or k4 like 'fu*z_'
+```
+
+#### Extended esquery(field, QueryDSL)
+Through the `esquery(field, QueryDSL)` function, some queries that cannot be 
expressed in sql, such as match_phrase, geoshape, etc., are pushed down to the 
ES for filtering. The first column name parameter of `esquery` is used to 
associate the `index`, the second This parameter is the basic JSON expression 
of ES's `Query DSL`, which is contained in curly braces `{}`, and there can be 
only one root key of json, such as match_phrase, geo_shape, bool, etc.
 Match query:
 
 ```
 select * from es_table where esquery(k4, '{
         "match": {
-           "k4": "doris on elasticsearch"
+           "k4": "doris on es"
         }
     }');
 ```
@@ -193,7 +444,7 @@ select * from es_table where esquery(k4, ' {
 
 ```
 
-1. After the ES appearance is created, FE requests the host specified by the 
table to obtain HTTP port information of all nodes and share distribution 
information of index. If the request fails, it will traverse the host list 
sequentially until it succeeds or fails completely.
+1. FE requests the hosts specified by the table to obtain node‘s HTTP port, 
shards location of the index. If the request fails, it will traverse the host 
list sequentially until it succeeds or fails completely.
 
 2. When querying, the query plan will be generated and sent to the 
corresponding BE node according to some node information obtained by FE and 
metadata information of index.
 
@@ -201,24 +452,7 @@ select * from es_table where esquery(k4, ' {
 
 4. After calculating the result, return it to client
 
-## Push-Down operations
-An important function of `Doris On Elastic` search is to push down filtering 
conditions: push ES under filtering conditions, so that only data that really 
meets the conditions can be returned, which can significantly improve query 
performance and reduce the CPU, memory and IO utilization of Doris and Elastic 
search.
-
-The following operators are optimized to push down filters as follows:
-
-| SQL syntax  | ES 5.x+ syntax | 
-|-------|:---:|
-| =   | term query|
-| in  | terms query   |
-| > , < , >= , ⇐  | range   |
-| and  | bool.filter   |
-| or  | bool.should   |
-| not  | bool.must_not   |
-| not in  | bool.must_not + terms  |
-| esquery  | ES Query DSL   |
-
-
-## Other notes
+## Q&A
 
 1. ES Version Requirements
 
@@ -231,3 +465,10 @@ The following operators are optimized to push down filters 
as follows:
 3. Some queries are much slower than requesting ES
 
        Yes, for example, query related to _count, etc., the ES internal will 
directly read the number of documents that meet the requirements of the 
relevant metadata, without the need to filter the real data.
+       
+4. Whether the aggregation operation can be pushed down
+
+   At present, Doris On ES does not support push-down operations such as sum, 
avg, min/max, etc., all documents satisfying the conditions are obtained from 
the ES in batch flow, and then calculated in Doris
+   
+5. Filters for date type fields cannot be pushed down
+   Due to the time format problem, the date type field will not be pushed down 
in most cases; the date type filtering can be in the form of a string, and the 
date format needs to be completely consistent with ES
diff --git a/docs/zh-CN/extending-doris/doris-on-es.md 
b/docs/zh-CN/extending-doris/doris-on-es.md
index d40fe89..9b58243 100644
--- a/docs/zh-CN/extending-doris/doris-on-es.md
+++ b/docs/zh-CN/extending-doris/doris-on-es.md
@@ -30,41 +30,166 @@ Doris-On-ES将Doris的分布式查询规划能力和ES(Elasticsearch)的全文
  
  1. ES中的多index分布式Join查询
  2. Doris和ES中的表联合查询，更复杂的全文检索过滤
- 3. ES keyword类型字段的聚合查询：适用于index 频繁发生变化、单个分片文档数量千万级以上且该字段基数(cardinality)非常大
 
 本文档主要介绍该功能的实现原理、使用方式等。
 
 ## 名词解释
 
-* FE：Frontend，Doris 的前端节点。负责元数据管理和请求接入。
-* BE：Backend，Doris 的后端节点。负责查询执行和数据存储。
-* Elasticsearch(ES)：目前最流行的开源分布式搜索引擎。
-* DataNode：ES的数据存储与计算节点。
-* MasterNode：ES的Master节点，管理元数据、节点、数据分布等。
-* scroll：ES内置的数据集游标特性，用来对数据进行流式扫描和过滤。
+### Doirs相关
+* FE：Frontend，Doris 的前端节点,负责元数据管理和请求接入
+* BE：Backend，Doris 的后端节点,负责查询执行和数据存储
 
+### ES相关
+* DataNode：ES的数据存储与计算节点
+* MasterNode：ES的Master节点，管理元数据、节点、数据分布等
+* scroll：ES内置的数据集游标特性，用来对数据进行流式扫描和过滤
+* _source: 导入时传入的原始JSON格式文档内容
+* doc_values: ES/Lucene 中字段的列式存储定义
+* keyword: 字符串类型字段，ES/Lucene不会对文本内容进行分词处理
+* text: 字符串类型字段，ES/Lucene会对文本内容进行分词处理，分词器需要用户指定，默认为standard英文分词器
 
-## 如何使用
 
-### 创建外表
+## 使用方法
+
+### 创建ES索引
 
 ```
-CREATE EXTERNAL TABLE `es_table` (
-  `id` bigint(20) COMMENT "",
+PUT test
+{
+   "settings": {
+      "index": {
+         "number_of_shards": "1",
+         "number_of_replicas": "0"
+      }
+   },
+   "mappings": {
+      "doc": { // ES 7.x版本之后创建索引时不需要指定type，会有一个默认且唯一的`_doc` type
+         "properties": {
+            "k1": {
+               "type": "long"
+            },
+            "k2": {
+               "type": "date"
+            },
+            "k3": {
+               "type": "keyword"
+            },
+            "k4": {
+               "type": "text",
+               "analyzer": "standard"
+            },
+            "k5": {
+               "type": "float"
+            }
+         }
+      }
+   }
+}
+```
+
+### ES索引导入数据
+
+```
+POST /_bulk
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Trying out Elasticsearch", "k4": 
"Trying out Elasticsearch", "k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Trying out Doris", "k4": "Trying out 
Doris", "k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Doris On ES", "k4": "Doris On ES", 
"k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Doris", "k4": "Doris", "k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "ES", "k4": "ES", "k5": 10.0}
+```
+
+### Doris中创建ES外表
+
+```
+CREATE EXTERNAL TABLE `test` (
+  `k1` bigint(20) COMMENT "",
+  `k2` datetime COMMENT "",
+  `k3` varchar(20) COMMENT "",
+  `k4` varchar(100) COMMENT "",
+  `k5` float COMMENT ""
+) ENGINE=ELASTICSEARCH // ENGINE必须是Elasticsearch
+PROPERTIES (
+"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200";,
+"index" = "test”,
+"type" = "doc",
+
+"user" = "root",
+"password" = "root"
+);
+```
+
+参数说明：
+
+参数 | 说明
+---|---
+**hosts** | ES集群地址，可以是一个或多个，也可以是ES前端的负载均衡地址
+**index** | 对应的ES的index名字
+**type** | index的type，不指定的情况会使用_doc
+**user** | ES集群用户名
+**password** | 对应用户的密码信息
+
+* ES 7.x之前的集群请注意在建表的时候选择正确的**索引类型type**
+* 认证方式目前仅支持Http Bastic认证，并且需要确保该用户有访问: 
/\_cluster/state/、\_nodes/http等路径和index的读权限; 集群未开启安全认证，用户名和密码不需要设置
+* Doris表中的列名需要和ES中的字段名完全匹配，字段类型应该保持一致
+*  **ENGINE**必须是 **Elasticsearch**
+
+##### 过滤条件下推
+`Doris On ES`一个重要的功能就是过滤条件的下推: 
过滤条件下推给ES，这样只有真正满足条件的数据才会被返回，能够显著的提高查询性能和降低Doris和Elasticsearch的CPU、memory、IO利用率
+
+下面的操作符(Operators)会被优化成如下ES Query:
+
+| SQL syntax  | ES 5.x+ syntax | 
+|-------|:---:|
+| =   | term query|
+| in  | terms query   |
+| > , < , >= , ⇐  | range query |
+| and  | bool.filter   |
+| or  | bool.should   |
+| not  | bool.must_not   |
+| not in  | bool.must_not + terms query |
+| is\_not\_null  | exists query |
+| is\_null  | bool.must_not + exists query |
+| esquery  | ES原生json形式的QueryDSL   |
+
+##### 数据类型映射
+
+Doris\ES  |  byte | short | integer | long | float | double| keyword | text | 
date
+------------- | ------------- | ------  | ---- | ----- | ----  | ------ | 
----| --- | --- |
+tinyint  | &radic; |  |  |  |   |   |   |   |  
+smallint | &radic; | &radic; |  | |   |   |   |   |  
+int | &radic; |  &radic; | &radic; | |   |   |   |   |  
+bigint | &radic;  | &radic;  | &radic;  | &radic; |   |   |   |   | 
+float |   |   |   |   | &radic; |   |   |   |  
+double |   |   |   |   |   | &radic; |   |   |  
+char |   |   |   |   |   |   | &radic; | &radic; |  
+varchar |  |   |   |   |   |   | &radic; | &radic; |   
+date |   |   |   |   |   |   |   |   | &radic;|  
+datetime |   |   |   |   |   |   |   |   | &radic;|  
+
+
+### 启用列式扫描优化查询速度(enable\_docvalue\_scan=true)
+
+```
+CREATE EXTERNAL TABLE `test` (
   `k1` bigint(20) COMMENT "",
   `k2` datetime COMMENT "",
   `k3` varchar(20) COMMENT "",
   `k4` varchar(100) COMMENT "",
   `k5` float COMMENT ""
 ) ENGINE=ELASTICSEARCH
-PARTITION BY RANGE(`id`)
-()
 PROPERTIES (
 "hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200";,
+"index" = "test”,
+"type" = "doc",
 "user" = "root",
 "password" = "root",
-"index" = "tindex”,
-"type" = "doc"
+
+"enable_docvalue_scan" = "true"
 );
 ```
 
@@ -72,30 +197,153 @@ PROPERTIES (
 
 参数 | 说明
 ---|---
-hosts | ES集群连接地址，可指定一个或多个，Doris通过这个地址获取到ES版本号、index的shard分布信息
-user | 开启basic认证的ES集群的用户名，需要确保该用户有访问: 
/\_cluster/state/\_nodes/http等路径权限和对index的读权限
-password | 对应用户的密码信息
-index | Doris中的表对应的ES的index名字，可以是alias
-type | 指定index的type，默认是_doc
-transport | 内部保留，默认为http
+**enable\_docvalue\_scan** | 是否开启通过ES/Lucene列式存储获取查询字段的值，默认为false
+
+开启后Doris从ES中获取数据会遵循以下两个原则：
+
+* **尽力而为**: 自动探测要读取的字段是否开启列式存储(doc_value: 
true)，如果获取的字段全部有列存，Doris会从列式存储中获取所有字段的值
+* **自动降级**: 如果要获取的字段只要有一个字段没有列存，所有字段的值都会从行存`_source`中解析获取
+
+##### 优势：
+
+默认情况下，Doris On 
ES会从行存也就是`_source`中获取所需的所有列，`_source`的存储采用的行式+json的形式存储，在批量读取性能上要劣于列式存储，尤其在只需要少数列的情况下尤为明显，只获取少数列的情况下，docvalue的性能大约是_source性能的十几倍
+
+##### 注意
+1. `text`类型的字段在ES中是没有列式存储，因此如果要获取的字段值有`text`类型字段会自动降级为从`_source`中获取
+2. 在获取的字段数量过多的情况下(`>= 25`)，从`docvalue`中获取字段值的性能会和从`_source`中获取字段值基本一样
 
-### 查询
 
-#### 基本条件过滤
+### 探测keyword类型字段(enable\_keyword\_sniff=true)
+
+```
+CREATE EXTERNAL TABLE `test` (
+  `k1` bigint(20) COMMENT "",
+  `k2` datetime COMMENT "",
+  `k3` varchar(20) COMMENT "",
+  `k4` varchar(100) COMMENT "",
+  `k5` float COMMENT ""
+) ENGINE=ELASTICSEARCH
+PROPERTIES (
+"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200";,
+"index" = "test”,
+"type" = "doc",
+"user" = "root",
+"password" = "root",
+
+"enable_keyword_sniff" = "true"
+);
+```
+
+参数说明：
+
+参数 | 说明
+---|---
+**enable\_keyword\_sniff** | 是否对ES中字符串类型分词类型(**text**) `fields` 
进行探测，获取额外的未分词(**keyword**)字段名(multi-fields机制)
+
+在ES中可以不建立index直接进行数据导入，这时候ES会自动创建一个新的索引，针对字符串类型的字段ES会创建一个既有`text`类型的字段又有`keyword`类型的字段，这就是ES的multi
 fields特性，mapping如下：
+
+```
+"k4": {
+   "type": "text",
+   "fields": {
+      "keyword": {   
+         "type": "keyword",
+         "ignore_above": 256
+      }
+   }
+}
+```
+对k4进行条件过滤时比如=，Doris On ES会将查询转换为ES的TermQuery
+
+SQL过滤条件：
+
+```
+k4 = "Doris On ES"
+```
+
+转换成ES的query DSL为：
+
+```
+"term" : {
+    "k4": "Doris On ES"
+
+}
+```
+
+因为k4的第一字段类型为`text`，在数据导入的时候就会根据k4设置的分词器(如果没有设置，就是standard分词器)进行分词处理得到doris、on、es三个Term，如下ES
 analyze API分析：
+
+```
+POST /_analyze
+{
+  "analyzer": "standard",
+  "text": "Doris On ES"
+}
+```
+分词的结果是：
+
+```
+{
+   "tokens": [
+      {
+         "token": "doris",
+         "start_offset": 0,
+         "end_offset": 5,
+         "type": "<ALPHANUM>",
+         "position": 0
+      },
+      {
+         "token": "on",
+         "start_offset": 6,
+         "end_offset": 8,
+         "type": "<ALPHANUM>",
+         "position": 1
+      },
+      {
+         "token": "es",
+         "start_offset": 9,
+         "end_offset": 11,
+         "type": "<ALPHANUM>",
+         "position": 2
+      }
+   ]
+}
+```
+查询时使用的是：
+
+```
+"term" : {
+    "k4": "Doris On ES"
+}
+```
+`Doris On ES`这个term匹配不到词典中的任何term，不会返回任何结果，而启用`enable_keyword_sniff: 
true`会自动将`k4 = "Doris On ES"`转换成`k4.keyword = "Doris On ES"`来完全匹配SQL语义，转换后的ES 
query DSL为:
+
+```
+"term" : {
+    "k4.keyword": "Doris On ES"
+}
+```
+
+`k4.keyword` 的类型是`keyword`，数据写入ES中是一个完整的term，所以可以匹配
+
+### 查询用法
+
+完成在Doris中建立ES外表后，除了无法使用Doris中的数据模型(rollup、预聚合、物化视图等)外并无区别
+
+#### 基本查询
 
 ```
 select * from es_table where k1 > 1000 and k3 ='term' or k4 like 'fu*z_'
 ```
 
-#### 扩展的esquery sql语法
-通过`esquery`函数将一些无法用sql表述的ES 
query如match、geoshape等下推给ES进行过滤处理，`esquery`的第一个列名参数用于关联`index`，第二个参数是ES的基本`Query 
DSL`的json表述，使用花括号`{}`包含，json的`root key`有且只能有一个，如match、geo_shape、bool等
+#### 扩展的esquery(field, QueryDSL)
+通过`esquery(field, 
QueryDSL)`函数将一些无法用sql表述的query如match_phrase、geoshape等下推给ES进行过滤处理，`esquery`的第一个列名参数用于关联`index`，第二个参数是ES的基本`Query
 DSL`的json表述，使用花括号`{}`包含，json的`root key`有且只能有一个，如match_phrase、geo_shape、bool等
 
-match查询：
+match_phrase查询：
 
 ```
 select * from es_table where esquery(k4, '{
-        "match": {
-           "k4": "doris on elasticsearch"
+        "match_phrase": {
+           "k4": "doris on es"
         }
     }');
 ```
@@ -195,37 +443,30 @@ select * from es_table where esquery(k4, ' {
 
 1. 
创建ES外表后，FE会请求建表指定的主机，获取所有节点的HTTP端口信息以及index的shard分布信息等，如果请求失败会顺序遍历host列表直至成功或完全失败
 
-2. 查询时，会根据FE得到的一些节点信息和index的元数据信息，生成查询计划并发给对应的BE节点
+2. 查询时会根据FE得到的一些节点信息和index的元数据信息，生成查询计划并发给对应的BE节点
 
-3. BE节点会根据`就近原则`即优先请求本地部署的ES节点，BE通过`HTTP Scroll`方式流式的从ES index的每个分片中并发的获取数据
+3. BE节点会根据`就近原则`即优先请求本地部署的ES节点，BE通过`HTTP Scroll`方式流式的从ES 
index的每个分片中并发的从`_source`或`docvalue`中获取数据
 
-4. 计算完结果后，返回给client端
-
-## Push-Down operations
-`Doris On Elasticsearch`一个重要的功能就是过滤条件的下推: 
过滤条件下推给ES，这样只有真正满足条件的数据才会被返回，能够显著的提高查询性能和降低Doris和Elasticsearch的CPU、memory、IO利用率
-
-下面的操作符(Operators)会被优化成如下下推filters:
-
-| SQL syntax  | ES 5.x+ syntax | 
-|-------|:---:|
-| =   | term query|
-| in  | terms query   |
-| > , < , >= , ⇐  | range   |
-| and  | bool.filter   |
-| or  | bool.should   |
-| not  | bool.must_not   |
-| not in  | bool.must_not + terms  |
-| esquery  | ES Query DSL   |
+4. Doris计算完结果后，返回给用户
 
 
-## 其他说明
+## Q&A
 
-1. ES的版本要求
+1. Doris On ES对ES的版本要求
 
-    ES主版本大于5，ES在2.x之前和5.x之后数据的扫描方式不同，目前支持5.x之后的
+    ES主版本大于5，ES在2.x之前和5.x之后数据的扫描方式不同，目前支持仅5.x之后的
+    
 2. 是否支持X-Pack认证的ES集群
 
     支持所有使用HTTP Basic认证方式的ES集群
 3. 一些查询比请求ES慢很多
 
     是，比如_count相关的query等，ES内部会直接读取满足条件的文档个数相关的元数据，不需要对真实的数据进行过滤
+
+4. 聚合操作是否可以下推
+
+   目前Doris On ES不支持聚合操作如sum, avg, min/max 
等下推，计算方式是批量流式的从ES获取所有满足条件的文档，然后在Doris中进行计算
+   
+5. 日期类型字段的过滤条件无法下推
+   
+   日期类型的字段因为时间格式的问题，大多数情况下都不会下推；对于日期类型的过滤可以采用字符串形式，日期格式需要和ES中保持完全一致
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[incubator-doris] branch master updated: [Doris On ES][Docs] refator documentation for doe (#3867)

Reply via email to