This is an automated email from the ASF dual-hosted git repository. jianliangqi pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push: new 4dbec854e4 [Docs](inverted index) add tokenize function doc (#23518) 4dbec854e4 is described below commit 4dbec854e43871e9ac0e0375cf052e2ef291912f Author: airborne12 <airborn...@gmail.com> AuthorDate: Mon Aug 28 10:19:03 2023 +0800 [Docs](inverted index) add tokenize function doc (#23518) --- docs/en/docs/data-table/index/inverted-index.md | 45 ++++++++++++++++++++++ docs/zh-CN/docs/data-table/index/inverted-index.md | 45 ++++++++++++++++++++++ 2 files changed, 90 insertions(+) diff --git a/docs/en/docs/data-table/index/inverted-index.md b/docs/en/docs/data-table/index/inverted-index.md index e3ee40c64b..c711b1fca4 100644 --- a/docs/en/docs/data-table/index/inverted-index.md +++ b/docs/en/docs/data-table/index/inverted-index.md @@ -168,6 +168,51 @@ SELECT * FROM table_name WHERE ts > '2023-01-01 00:00:00'; SELECT * FROM table_name WHERE op_type IN ('add', 'delete'); ``` +- Tokenization Function + +To evaluate the actual effects of tokenization or to tokenize a block of text, the `tokenize` function can be utilized. +```sql +mysql> SELECT TOKENIZE('武汉长江大桥','"parser"="chinese","parser_mode"="fine_grained"); ++-----------------------------------------------------------------------------------+ +| tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') | ++-----------------------------------------------------------------------------------+ +| ["武汉", "武汉长江大桥", "长江", "长江大桥", "大桥"] | ++-----------------------------------------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="fine_grained"); ++--------------------------------------------------------------------------------------+ +| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') | ++--------------------------------------------------------------------------------------+ +| ["武汉", "武汉市", "市长", "长江", "长江大桥", "大桥"] | ++--------------------------------------------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="coarse_grained"); ++----------------------------------------------------------------------------------------+ +| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"') | ++----------------------------------------------------------------------------------------+ +| ["武汉市", "长江大桥"] | ++----------------------------------------------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT TOKENIZE('I love CHINA','"parser"="english"); ++------------------------------------------------+ +| tokenize('I love CHINA', '"parser"="english"') | ++------------------------------------------------+ +| ["i", "love", "china"] | ++------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT TOKENIZE('I love CHINA 我爱我的祖国','"parser"="unicode"); ++-------------------------------------------------------------------+ +| tokenize('I love CHINA 我爱我的祖国', '"parser"="unicode"') | ++-------------------------------------------------------------------+ +| ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"] | ++-------------------------------------------------------------------+ +1 row in set (0.02 sec) +``` + ## Examples This example will demostrate inverted index creation, fulltext query, normal query using a hackernews dataset with 1 million rows. The performanc comparation between using and without inverted index will also be showed. diff --git a/docs/zh-CN/docs/data-table/index/inverted-index.md b/docs/zh-CN/docs/data-table/index/inverted-index.md index 25633f0913..01ee5d1ea9 100644 --- a/docs/zh-CN/docs/data-table/index/inverted-index.md +++ b/docs/zh-CN/docs/data-table/index/inverted-index.md @@ -167,6 +167,51 @@ SELECT * FROM table_name WHERE ts > '2023-01-01 00:00:00'; SELECT * FROM table_name WHERE op_type IN ('add', 'delete'); ``` +- 分词函数 + +如果想检查分词实际效果或者对一段文本进行分词的话,可以使用tokenize函数 +```sql +mysql> SELECT TOKENIZE('武汉长江大桥','"parser"="chinese","parser_mode"="fine_grained"'); ++-----------------------------------------------------------------------------------+ +| tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') | ++-----------------------------------------------------------------------------------+ +| ["武汉", "武汉长江大桥", "长江", "长江大桥", "大桥"] | ++-----------------------------------------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="fine_grained"'); ++--------------------------------------------------------------------------------------+ +| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') | ++--------------------------------------------------------------------------------------+ +| ["武汉", "武汉市", "市长", "长江", "长江大桥", "大桥"] | ++--------------------------------------------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="coarse_grained"'); ++----------------------------------------------------------------------------------------+ +| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"') | ++----------------------------------------------------------------------------------------+ +| ["武汉市", "长江大桥"] | ++----------------------------------------------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT TOKENIZE('I love CHINA','"parser"="english"'); ++------------------------------------------------+ +| tokenize('I love CHINA', '"parser"="english"') | ++------------------------------------------------+ +| ["i", "love", "china"] | ++------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT TOKENIZE('I love CHINA 我爱我的祖国','"parser"="unicode"'); ++-------------------------------------------------------------------+ +| tokenize('I love CHINA 我爱我的祖国', '"parser"="unicode"') | ++-------------------------------------------------------------------+ +| ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"] | ++-------------------------------------------------------------------+ +1 row in set (0.02 sec) +``` + ## 使用示例 用hackernews 100万条数据展示倒排索引的创建、全文检索、普通查询,包括跟无索引的查询性能进行简单对比。 --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org