[doris] branch master updated: [Docs](inverted index) add tokenize function doc (#23518)

jianliangqi Sun, 27 Aug 2023 19:19:26 -0700

This is an automated email from the ASF dual-hosted git repository.

jianliangqi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git


The following commit(s) were added to refs/heads/master by this push:
     new 4dbec854e4 [Docs](inverted index) add tokenize function doc (#23518)
4dbec854e4 is described below

commit 4dbec854e43871e9ac0e0375cf052e2ef291912f
Author: airborne12 <airborn...@gmail.com>
AuthorDate: Mon Aug 28 10:19:03 2023 +0800

    [Docs](inverted index) add tokenize function doc (#23518)
---
 docs/en/docs/data-table/index/inverted-index.md    | 45 ++++++++++++++++++++++
 docs/zh-CN/docs/data-table/index/inverted-index.md | 45 ++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/docs/en/docs/data-table/index/inverted-index.md 
b/docs/en/docs/data-table/index/inverted-index.md
index e3ee40c64b..c711b1fca4 100644
--- a/docs/en/docs/data-table/index/inverted-index.md
+++ b/docs/en/docs/data-table/index/inverted-index.md
@@ -168,6 +168,51 @@ SELECT * FROM table_name WHERE ts > '2023-01-01 00:00:00';
 SELECT * FROM table_name WHERE op_type IN ('add', 'delete');
 ```
 
+- Tokenization Function
+
+To evaluate the actual effects of tokenization or to tokenize a block of text, 
the `tokenize` function can be utilized.
+```sql
+mysql> SELECT 
TOKENIZE('武汉长江大桥','"parser"="chinese","parser_mode"="fine_grained");
++-----------------------------------------------------------------------------------+
+| tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"')       |
++-----------------------------------------------------------------------------------+
+| ["武汉", "武汉长江大桥", "长江", "长江大桥", "大桥"]                              |
++-----------------------------------------------------------------------------------+
+1 row in set (0.02 sec)
+
+mysql> SELECT 
TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="fine_grained");
++--------------------------------------------------------------------------------------+
+| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="fine_grained"')       
 |
++--------------------------------------------------------------------------------------+
+| ["武汉", "武汉市", "市长", "长江", "长江大桥", "大桥"]                               |
++--------------------------------------------------------------------------------------+
+1 row in set (0.02 sec)
+
+mysql> SELECT 
TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="coarse_grained");
++----------------------------------------------------------------------------------------+
+| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"')     
   |
++----------------------------------------------------------------------------------------+
+| ["武汉市", "长江大桥"]                                                              
   |
++----------------------------------------------------------------------------------------+
+1 row in set (0.02 sec)
+
+mysql> SELECT TOKENIZE('I love CHINA','"parser"="english");
++------------------------------------------------+
+| tokenize('I love CHINA', '"parser"="english"') |
++------------------------------------------------+
+| ["i", "love", "china"]                         |
++------------------------------------------------+
+1 row in set (0.02 sec)
+
+mysql> SELECT TOKENIZE('I love CHINA 我爱我的祖国','"parser"="unicode");
++-------------------------------------------------------------------+
+| tokenize('I love CHINA 我爱我的祖国', '"parser"="unicode"')       |
++-------------------------------------------------------------------+
+| ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"]        |
++-------------------------------------------------------------------+
+1 row in set (0.02 sec)
+```
+
 ## Examples
 
 This example will demostrate inverted index creation, fulltext query, normal 
query using a hackernews dataset with 1 million rows. The performanc 
comparation between using  and without inverted index will also be showed.
diff --git a/docs/zh-CN/docs/data-table/index/inverted-index.md 
b/docs/zh-CN/docs/data-table/index/inverted-index.md
index 25633f0913..01ee5d1ea9 100644
--- a/docs/zh-CN/docs/data-table/index/inverted-index.md
+++ b/docs/zh-CN/docs/data-table/index/inverted-index.md
@@ -167,6 +167,51 @@ SELECT * FROM table_name WHERE ts > '2023-01-01 00:00:00';
 SELECT * FROM table_name WHERE op_type IN ('add', 'delete');
 ```
 
+- 分词函数
+
+如果想检查分词实际效果或者对一段文本进行分词的话，可以使用tokenize函数
+```sql
+mysql> SELECT 
TOKENIZE('武汉长江大桥','"parser"="chinese","parser_mode"="fine_grained"');
++-----------------------------------------------------------------------------------+
+| tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"')       |
++-----------------------------------------------------------------------------------+
+| ["武汉", "武汉长江大桥", "长江", "长江大桥", "大桥"]                              |
++-----------------------------------------------------------------------------------+
+1 row in set (0.02 sec)
+
+mysql> SELECT 
TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="fine_grained"');
++--------------------------------------------------------------------------------------+
+| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="fine_grained"')       
 |
++--------------------------------------------------------------------------------------+
+| ["武汉", "武汉市", "市长", "长江", "长江大桥", "大桥"]                               |
++--------------------------------------------------------------------------------------+
+1 row in set (0.02 sec)
+
+mysql> SELECT 
TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="coarse_grained"');
++----------------------------------------------------------------------------------------+
+| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"')     
   |
++----------------------------------------------------------------------------------------+
+| ["武汉市", "长江大桥"]                                                              
   |
++----------------------------------------------------------------------------------------+
+1 row in set (0.02 sec)
+
+mysql> SELECT TOKENIZE('I love CHINA','"parser"="english"');
++------------------------------------------------+
+| tokenize('I love CHINA', '"parser"="english"') |
++------------------------------------------------+
+| ["i", "love", "china"]                         |
++------------------------------------------------+
+1 row in set (0.02 sec)
+
+mysql> SELECT TOKENIZE('I love CHINA 我爱我的祖国','"parser"="unicode"');
++-------------------------------------------------------------------+
+| tokenize('I love CHINA 我爱我的祖国', '"parser"="unicode"')       |
++-------------------------------------------------------------------+
+| ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"]        |
++-------------------------------------------------------------------+
+1 row in set (0.02 sec)
+```
+
 ## 使用示例
 
 用hackernews 100万条数据展示倒排索引的创建、全文检索、普通查询，包括跟无索引的查询性能进行简单对比。


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[doris] branch master updated: [Docs](inverted index) add tokenize function doc (#23518)

Reply via email to