This is an automated email from the ASF dual-hosted git repository. kassiez pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 43dc913a3f4 fix : fix format of inverted index (#2129) 43dc913a3f4 is described below commit 43dc913a3f492296f7e42819eb73ad9bcbd44a03 Author: lsy3993 <110876560+lsy3...@users.noreply.github.com> AuthorDate: Wed Feb 26 15:51:21 2025 +0800 fix : fix format of inverted index (#2129) ## Versions - [x] dev - [x] 3.0 - [x] 2.1 - [ ] 2.0 ## Languages - [x] Chinese - [x] English ## Docs Checklist - [ ] Checked by AI - [ ] Test Cases Built --- docs/table-design/index/inverted-index.md | 90 ++++++--------- .../current/table-design/index/inverted-index.md | 95 ++++++---------- .../table-design/index/inverted-index.md | 124 ++++++++++----------- .../table-design/index/inverted-index.md | 103 +++++++---------- .../table-design/index/inverted-index.md | 89 ++++++--------- .../table-design/index/inverted-index.md | 90 ++++++--------- 6 files changed, 241 insertions(+), 350 deletions(-) diff --git a/docs/table-design/index/inverted-index.md b/docs/table-design/index/inverted-index.md index b4298228c39..7b12fe7c90b 100644 --- a/docs/table-design/index/inverted-index.md +++ b/docs/table-design/index/inverted-index.md @@ -342,13 +342,13 @@ To check the actual effect of tokenization or to tokenize a piece of text, you c The first parameter of the `TOKENIZE` function is the text to be tokenized, and the second parameter specifies the tokenization parameters used when creating the index. -mysql> SELECT TOKENIZE('I love Doris','"parser"="english"'); +SELECT TOKENIZE('I love Doris','"parser"="english"'); +------------------------------------------------+ | tokenize('I love Doris', '"parser"="english"') | +------------------------------------------------+ | ["i", "love", "doris"] | +------------------------------------------------+ -1 row in set (0.02 sec) + ``` @@ -422,13 +422,12 @@ curl --location-trusted -u root: -H "compress_type:gz" -T hacknernews_1m.csv.gz **Confirm Data Import Success with SQL count()** ```sql -mysql> SELECT count() FROM hackernews_1m; +SELECT count() FROM hackernews_1m; +---------+ | count() | +---------+ | 1000000 | +---------+ -1 row in set (0.02 sec) ``` ### Queries @@ -438,13 +437,12 @@ mysql> SELECT count() FROM hackernews_1m; - Using `LIKE` to match and count rows containing 'OLAP' in the `comment` column took 0.18s. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%'; +---------+ | count() | +---------+ | 34 | +---------+ - 1 row in set (0.18 sec) ``` - Using full-text search with `MATCH_ANY` based on the inverted index to count rows containing 'OLAP' in the `comment` column took 0.02s, resulting in a 9x speedup. The performance improvement would be even more significant on larger datasets. @@ -452,33 +450,31 @@ mysql> SELECT count() FROM hackernews_1m; The difference in the number of results is due to the inverted index normalizing the terms by converting them to lowercase, among other processes, hence `MATCH_ANY` yields more results than `LIKE`. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP'; +---------+ | count() | +---------+ | 35 | +---------+ - 1 row in set (0.02 sec) ``` - Similarly, comparing the performance for counting occurrences of 'OLTP', 0.07s vs 0.01s. Due to caching, both `LIKE` and `MATCH_ANY` improved, but the inverted index still provided a 7x speedup. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 48 | +---------+ - 1 row in set (0.07 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP'; + + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP'; +---------+ | count() | +---------+ | 51 | +---------+ - 1 row in set (0.01 sec) ``` - Counting rows where both 'OLAP' and 'OLTP' appear took 0.13s vs 0.01s, a 13x speedup. @@ -486,21 +482,20 @@ mysql> SELECT count() FROM hackernews_1m; To require multiple terms to appear simultaneously (AND relationship), use `MATCH_ALL 'keyword1 keyword2 ...'`. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 14 | +---------+ - 1 row in set (0.13 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; + + SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; +---------+ | count() | +---------+ | 15 | +---------+ - 1 row in set (0.01 sec) ``` - Counting rows where either 'OLAP' or 'OLTP' appears took 0.12s vs 0.01s, a 12x speedup. @@ -508,21 +503,19 @@ mysql> SELECT count() FROM hackernews_1m; To require any one or more of multiple terms to appear (OR relationship), use `MATCH_ANY 'keyword1 keyword2 ...'`. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 68 | +---------+ - 1 row in set (0.12 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP'; +---------+ | count() | +---------+ | 71 | +---------+ - 1 row in set (0.01 sec) ``` ### 02 Standard Equality and Range Queries @@ -530,13 +523,12 @@ mysql> SELECT count() FROM hackernews_1m; - Range query on a `DateTime` type column ```sql - mysql> SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; + SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ - 1 row in set (0.03 sec) ``` - Adding an inverted index for the `timestamp` column @@ -544,71 +536,65 @@ mysql> SELECT count() FROM hackernews_1m; ```sql -- For date-time types, USING INVERTED does not require specifying a parser -- CREATE INDEX is one syntax for creating an index, another method will be shown later - mysql> CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED; - Query OK, 0 rows affected (0.03 sec) + CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED; ``` ```sql - mysql> BUILD INDEX idx_timestamp ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_timestamp ON hackernews_1m; ``` - Checking the index creation progress. From the difference between `FinishTime` and `CreateTime`, we can see that building the inverted index for 1 million rows on the `timestamp` column took only 1 second. ```sql - mysql> SHOW ALTER TABLE COLUMN; + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - 1 row in set (0.00 sec) ``` ```sql -- If the table has no partitions, PartitionName defaults to TableName - mysql> SHOW BUILD INDEX; + SHOW BUILD INDEX; +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 10191 | hackernews_1m | hackernews_1m | [ADD INDEX idx_timestamp (`timestamp`) USING INVERTED], | 2023-06-26 15:32:33.894 | 2023-06-26 15:32:34.847 | 3 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.04 sec) ``` - After the index is created, range queries use the same query syntax. Doris will automatically recognize the index for optimization. However, due to the small dataset, the performance difference is not significant. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; + SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ - 1 row in set (0.01 sec) ``` - Performing similar operations on a numeric column `parent` with an equality match query. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE parent = 11189; + SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ - 1 row in set (0.01 sec) -- For numeric types, USING INVERTED does not require specifying a parser -- ALTER TABLE t ADD INDEX is the second syntax for creating an index - mysql> ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED; - Query OK, 0 rows affected (0.01 sec) + ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED; + -- Execute BUILD INDEX to create the inverted index for existing data - mysql> BUILD INDEX idx_parent ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_parent ON hackernews_1m; - mysql> SHOW ALTER TABLE COLUMN; + + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ @@ -616,46 +602,44 @@ mysql> SELECT count() FROM hackernews_1m; | 10053 | hackernews_1m | 2023-02-10 19:49:32.893 | 2023-02-10 19:49:33.982 | hackernews_1m | 10054 | 10008 | 1:378856428 | 4 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - mysql> SHOW BUILD INDEX; + SHOW BUILD INDEX; +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 11005 | hackernews_1m | hackernews_1m | [ADD INDEX idx_parent (`parent`) USING INVERTED], | 2023-06-26 16:25:10.167 | 2023-06-26 16:25:10.838 | 1002 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.01 sec) - mysql> SELECT count() FROM hackernews_1m WHERE parent = 11189; + + SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ - 1 row in set (0.01 sec) ``` - Creating an inverted index for the string column `author` without tokenization. Equality queries can also leverage the index for speedup. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE author = 'faster'; + SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ - 1 row in set (0.03 sec) -- Here, USING INVERTED is used without tokenizing the `author` column, treating it as a single term - mysql> ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED; - Query OK, 0 rows affected (0.01 sec) + ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED; + -- Execute BUILD INDEX to add the inverted index for existing data - mysql> BUILD INDEX idx_author ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_author ON hackernews_1m; + Creating an incremental index for 1 million author records took only 1.5 seconds. ```sql -mysql> SHOW ALTER TABLE COLUMN; +SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ @@ -666,24 +650,22 @@ mysql> SHOW ALTER TABLE COLUMN; ``` ```sql -mysql> SHOW BUILD INDEX ORDER BY CreateTime DESC LIMIT 1; +SHOW BUILD INDEX ORDER BY CreateTime DESC LIMIT 1; +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 13006 | hackernews_1m | hackernews_1m | [ADD INDEX idx_author (`author`) USING INVERTED], | 2023-06-26 17:23:02.610 | 2023-06-26 17:23:03.755 | 3004 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ -1 row in set (0.01 sec) ``` -- After creating the index, string equality matches also showed significant acceleration. ```sql -mysql> SELECT count() FROM hackernews_1m WHERE author = 'faster'; +SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ -1 row in set (0.01 sec) ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/index/inverted-index.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/index/inverted-index.md index bc3789dafe7..4e736e8ac13 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/index/inverted-index.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/index/inverted-index.md @@ -344,45 +344,40 @@ SELECT * FROM table_name WHERE op_type IN ('add', 'delete'); TOKENIZE 函数的第一个参数是待分词的文本,第二个参数是创建索引指定的分词参数。 ```sql -mysql> SELECT TOKENIZE('武汉长江大桥','"parser"="chinese","parser_mode"="fine_grained"'); +SELECT TOKENIZE('武汉长江大桥','"parser"="chinese","parser_mode"="fine_grained"'); +-----------------------------------------------------------------------------------+ | tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') | +-----------------------------------------------------------------------------------+ | ["武汉", "武汉长江大桥", "长江", "长江大桥", "大桥"] | +-----------------------------------------------------------------------------------+ -1 row in set (0.02 sec) -mysql> SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="fine_grained"'); +SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="fine_grained"'); +--------------------------------------------------------------------------------------+ | tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') | +--------------------------------------------------------------------------------------+ | ["武汉", "武汉市", "市长", "长江", "长江大桥", "大桥"] | +--------------------------------------------------------------------------------------+ -1 row in set (0.02 sec) -mysql> SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="coarse_grained"'); +SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="coarse_grained"'); +----------------------------------------------------------------------------------------+ | tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"') | +----------------------------------------------------------------------------------------+ | ["武汉市", "长江大桥"] | +----------------------------------------------------------------------------------------+ -1 row in set (0.02 sec) -mysql> SELECT TOKENIZE('I love Doris','"parser"="english"'); +SELECT TOKENIZE('I love Doris','"parser"="english"'); +------------------------------------------------+ | tokenize('I love Doris', '"parser"="english"') | +------------------------------------------------+ | ["i", "love", "doris"] | +------------------------------------------------+ -1 row in set (0.02 sec) -mysql> SELECT TOKENIZE('I love CHINA 我爱我的祖国','"parser"="unicode"'); +SELECT TOKENIZE('I love CHINA 我爱我的祖国','"parser"="unicode"'); +-------------------------------------------------------------------+ | tokenize('I love CHINA 我爱我的祖国', '"parser"="unicode"') | +-------------------------------------------------------------------+ | ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"] | +-------------------------------------------------------------------+ -1 row in set (0.02 sec) ``` ## 使用示例 @@ -458,13 +453,12 @@ curl --location-trusted -u root: -H "compress_type:gz" -T hacknernews_1m.csv.gz **SQL 运行 count() 确认导入数据成功** ```sql -mysql> SELECT count() FROM hackernews_1m; +SELECT count() FROM hackernews_1m; +---------+ | count() | +---------+ | 1000000 | +---------+ -1 row in set (0.02 sec) ``` ### 查询 @@ -474,13 +468,12 @@ mysql> SELECT count() FROM hackernews_1m; - 用 `LIKE` 匹配计算 comment 中含有 'OLAP' 的行数,耗时 0.18s ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%'; +---------+ | count() | +---------+ | 34 | +---------+ - 1 row in set (0.18 sec) ``` - 用基于倒排索引的全文检索 `MATCH_ANY` 计算 comment 中含有'OLAP'的行数,耗时 0.02s,加速 9 倍,在更大的数据集上效果会更加明显 @@ -489,33 +482,30 @@ mysql> SELECT count() FROM hackernews_1m; 这里结果条数的差异,是因为倒排索引对 comment 分词后,还会对词进行进行统一成小写等归一化处理,因此 `MATCH_ANY` 比 `LIKE` 的结果多一些 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP'; +---------+ | count() | +---------+ | 35 | +---------+ - 1 row in set (0.02 sec) ``` - 同样的对比统计 'OLTP' 出现次数的性能,0.07s vs 0.01s,由于缓存的原因 `LIKE` 和 `MATCH_ANY` 都有提升,倒排索引仍然有 7 倍加速 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 48 | +---------+ - 1 row in set (0.07 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP'; +---------+ | count() | +---------+ | 51 | +---------+ - 1 row in set (0.01 sec) ``` - 同时出现 'OLAP' 和 'OLTP' 两个词,0.13s vs 0.01s,13 倍加速 @@ -523,21 +513,19 @@ mysql> SELECT count() FROM hackernews_1m; 要求多个词同时出现时(AND 关系)使用 `MATCH_ALL` 'keyword1 keyword2 ...' ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 14 | +---------+ - 1 row in set (0.13 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; +---------+ | count() | +---------+ | 15 | +---------+ - 1 row in set (0.01 sec) ``` - 任意出现 'OLAP' 和 'OLTP' 其中一个词,0.12s vs 0.01s,12 倍加速 @@ -545,21 +533,19 @@ mysql> SELECT count() FROM hackernews_1m; 只要求多个词任意一个或多个出现时(OR 关系)使用 `MATCH_ANY` 'keyword1 keyword2 ...' ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 68 | +---------+ - 1 row in set (0.12 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP'; +---------+ | count() | +---------+ | 71 | +---------+ - 1 row in set (0.01 sec) ``` @@ -568,13 +554,12 @@ mysql> SELECT count() FROM hackernews_1m; - DataTime 类型的列范围查询 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; + SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ - 1 row in set (0.03 sec) ``` - 为 timestamp 列增加一个倒排索引 @@ -582,71 +567,63 @@ mysql> SELECT count() FROM hackernews_1m; ```sql -- 对于日期时间类型 USING INVERTED,不用指定分词 -- CREATE INDEX 是第一种建索引的语法,另外一种在后面展示 - mysql> CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED; - Query OK, 0 rows affected (0.03 sec) + CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED; ``` ```sql - mysql> BUILD INDEX idx_timestamp ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_timestamp ON hackernews_1m; ``` - 查看索引创建进度,通过 FinishTime 和 CreateTime 的差值,可以看到 100 万条数据对 timestamp 列建倒排索引只用了 1s ```sql - mysql> SHOW ALTER TABLE COLUMN; + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - 1 row in set (0.00 sec) ``` ```sql -- 若 table 没有分区,PartitionName 默认就是 TableName - mysql> SHOW BUILD INDEX; + SHOW BUILD INDEX; +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 10191 | hackernews_1m | hackernews_1m | [ADD INDEX idx_timestamp (`timestamp`) USING INVERTED], | 2023-06-26 15:32:33.894 | 2023-06-26 15:32:34.847 | 3 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.04 sec) ``` - 索引创建后,范围查询用同样的查询方式,Doris 会自动识别索引进行优化,但是这里由于数据量小性能差别不大 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; + SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ - 1 row in set (0.01 sec) ``` - 在数值类型的列 Parent 进行类似 timestamp 的操作,这里查询使用等值匹配 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE parent = 11189; + SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ - 1 row in set (0.01 sec) -- 对于数值类型 USING INVERTED,不用指定分词 -- ALTER TABLE t ADD INDEX 是第二种建索引的语法 - mysql> ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED; - Query OK, 0 rows affected (0.01 sec) + ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED; -- 执行 BUILD INDEX 给存量数据构建倒排索引 - mysql> BUILD INDEX idx_parent ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_parent ON hackernews_1m; - mysql> SHOW ALTER TABLE COLUMN; + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ @@ -654,44 +631,39 @@ mysql> SELECT count() FROM hackernews_1m; | 10053 | hackernews_1m | 2023-02-10 19:49:32.893 | 2023-02-10 19:49:33.982 | hackernews_1m | 10054 | 10008 | 1:378856428 | 4 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - mysql> SHOW BUILD INDEX; + SHOW BUILD INDEX; +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 11005 | hackernews_1m | hackernews_1m | [ADD INDEX idx_parent (`parent`) USING INVERTED], | 2023-06-26 16:25:10.167 | 2023-06-26 16:25:10.838 | 1002 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.01 sec) - mysql> SELECT count() FROM hackernews_1m WHERE parent = 11189; + SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ - 1 row in set (0.01 sec) ``` - 对字符串类型的 `author` 建立不分词的倒排索引,等值查询也可以利用索引加速 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE author = 'faster'; + SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ - 1 row in set (0.03 sec) -- 这里只用了 USING INVERTED,不对 author 分词,整个当做一个词处理 - mysql> ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED; - Query OK, 0 rows affected (0.01 sec) + ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED; -- 执行 BUILD INDEX 给存量数据加上倒排索引: - mysql> BUILD INDEX idx_author ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_author ON hackernews_1m; -- 100 万条 author 数据增量建索引仅消耗 1.5s - mysql> SHOW ALTER TABLE COLUMN; + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ @@ -700,21 +672,18 @@ mysql> SELECT count() FROM hackernews_1m; | 10076 | hackernews_1m | 2023-02-10 19:54:20.046 | 2023-02-10 19:54:21.521 | hackernews_1m | 10077 | 10008 | 1:1335127701 | 5 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - mysql> SHOW BUILD INDEX order by CreateTime desc limit 1; + SHOW BUILD INDEX order by CreateTime desc limit 1; +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 13006 | hackernews_1m | hackernews_1m | [ADD INDEX idx_author (`author`) USING INVERTED], | 2023-06-26 17:23:02.610 | 2023-06-26 17:23:03.755 | 3004 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.01 sec) -- 创建索引后,字符串等值匹配也有明显加速 - mysql> SELECT count() FROM hackernews_1m WHERE author = 'faster'; + SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ - 1 row in set (0.01 sec) - ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/table-design/index/inverted-index.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/table-design/index/inverted-index.md index bc3789dafe7..638b43a9c5d 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/table-design/index/inverted-index.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/table-design/index/inverted-index.md @@ -344,45 +344,45 @@ SELECT * FROM table_name WHERE op_type IN ('add', 'delete'); TOKENIZE 函数的第一个参数是待分词的文本,第二个参数是创建索引指定的分词参数。 ```sql -mysql> SELECT TOKENIZE('武汉长江大桥','"parser"="chinese","parser_mode"="fine_grained"'); +SELECT TOKENIZE('武汉长江大桥','"parser"="chinese","parser_mode"="fine_grained"'); +-----------------------------------------------------------------------------------+ | tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') | +-----------------------------------------------------------------------------------+ | ["武汉", "武汉长江大桥", "长江", "长江大桥", "大桥"] | +-----------------------------------------------------------------------------------+ -1 row in set (0.02 sec) -mysql> SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="fine_grained"'); + +SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="fine_grained"'); +--------------------------------------------------------------------------------------+ | tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') | +--------------------------------------------------------------------------------------+ | ["武汉", "武汉市", "市长", "长江", "长江大桥", "大桥"] | +--------------------------------------------------------------------------------------+ -1 row in set (0.02 sec) -mysql> SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="coarse_grained"'); + +SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="coarse_grained"'); +----------------------------------------------------------------------------------------+ | tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"') | +----------------------------------------------------------------------------------------+ | ["武汉市", "长江大桥"] | +----------------------------------------------------------------------------------------+ -1 row in set (0.02 sec) -mysql> SELECT TOKENIZE('I love Doris','"parser"="english"'); + +SELECT TOKENIZE('I love Doris','"parser"="english"'); +------------------------------------------------+ | tokenize('I love Doris', '"parser"="english"') | +------------------------------------------------+ | ["i", "love", "doris"] | +------------------------------------------------+ -1 row in set (0.02 sec) -mysql> SELECT TOKENIZE('I love CHINA 我爱我的祖国','"parser"="unicode"'); + +SELECT TOKENIZE('I love CHINA 我爱我的祖国','"parser"="unicode"'); +-------------------------------------------------------------------+ | tokenize('I love CHINA 我爱我的祖国', '"parser"="unicode"') | +-------------------------------------------------------------------+ | ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"] | +-------------------------------------------------------------------+ -1 row in set (0.02 sec) + ``` ## 使用示例 @@ -458,13 +458,13 @@ curl --location-trusted -u root: -H "compress_type:gz" -T hacknernews_1m.csv.gz **SQL 运行 count() 确认导入数据成功** ```sql -mysql> SELECT count() FROM hackernews_1m; +SELECT count() FROM hackernews_1m; +---------+ | count() | +---------+ | 1000000 | +---------+ -1 row in set (0.02 sec) + ``` ### 查询 @@ -474,13 +474,13 @@ mysql> SELECT count() FROM hackernews_1m; - 用 `LIKE` 匹配计算 comment 中含有 'OLAP' 的行数,耗时 0.18s ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%'; +---------+ | count() | +---------+ | 34 | +---------+ - 1 row in set (0.18 sec) + ``` - 用基于倒排索引的全文检索 `MATCH_ANY` 计算 comment 中含有'OLAP'的行数,耗时 0.02s,加速 9 倍,在更大的数据集上效果会更加明显 @@ -489,33 +489,33 @@ mysql> SELECT count() FROM hackernews_1m; 这里结果条数的差异,是因为倒排索引对 comment 分词后,还会对词进行进行统一成小写等归一化处理,因此 `MATCH_ANY` 比 `LIKE` 的结果多一些 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP'; +---------+ | count() | +---------+ | 35 | +---------+ - 1 row in set (0.02 sec) + ``` - 同样的对比统计 'OLTP' 出现次数的性能,0.07s vs 0.01s,由于缓存的原因 `LIKE` 和 `MATCH_ANY` 都有提升,倒排索引仍然有 7 倍加速 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 48 | +---------+ - 1 row in set (0.07 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP'; + + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP'; +---------+ | count() | +---------+ | 51 | +---------+ - 1 row in set (0.01 sec) + ``` - 同时出现 'OLAP' 和 'OLTP' 两个词,0.13s vs 0.01s,13 倍加速 @@ -523,21 +523,21 @@ mysql> SELECT count() FROM hackernews_1m; 要求多个词同时出现时(AND 关系)使用 `MATCH_ALL` 'keyword1 keyword2 ...' ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 14 | +---------+ - 1 row in set (0.13 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; + + SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; +---------+ | count() | +---------+ | 15 | +---------+ - 1 row in set (0.01 sec) + ``` - 任意出现 'OLAP' 和 'OLTP' 其中一个词,0.12s vs 0.01s,12 倍加速 @@ -545,21 +545,21 @@ mysql> SELECT count() FROM hackernews_1m; 只要求多个词任意一个或多个出现时(OR 关系)使用 `MATCH_ANY` 'keyword1 keyword2 ...' ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 68 | +---------+ - 1 row in set (0.12 sec) + - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP'; +---------+ | count() | +---------+ | 71 | +---------+ - 1 row in set (0.01 sec) + ``` @@ -568,13 +568,13 @@ mysql> SELECT count() FROM hackernews_1m; - DataTime 类型的列范围查询 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; + SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ - 1 row in set (0.03 sec) + ``` - 为 timestamp 列增加一个倒排索引 @@ -582,71 +582,71 @@ mysql> SELECT count() FROM hackernews_1m; ```sql -- 对于日期时间类型 USING INVERTED,不用指定分词 -- CREATE INDEX 是第一种建索引的语法,另外一种在后面展示 - mysql> CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED; - Query OK, 0 rows affected (0.03 sec) + CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED; + ``` ```sql - mysql> BUILD INDEX idx_timestamp ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_timestamp ON hackernews_1m; + ``` - 查看索引创建进度,通过 FinishTime 和 CreateTime 的差值,可以看到 100 万条数据对 timestamp 列建倒排索引只用了 1s ```sql - mysql> SHOW ALTER TABLE COLUMN; + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - 1 row in set (0.00 sec) + ``` ```sql -- 若 table 没有分区,PartitionName 默认就是 TableName - mysql> SHOW BUILD INDEX; + SHOW BUILD INDEX; +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 10191 | hackernews_1m | hackernews_1m | [ADD INDEX idx_timestamp (`timestamp`) USING INVERTED], | 2023-06-26 15:32:33.894 | 2023-06-26 15:32:34.847 | 3 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.04 sec) + ``` - 索引创建后,范围查询用同样的查询方式,Doris 会自动识别索引进行优化,但是这里由于数据量小性能差别不大 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; + SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ - 1 row in set (0.01 sec) + ``` - 在数值类型的列 Parent 进行类似 timestamp 的操作,这里查询使用等值匹配 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE parent = 11189; + SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ - 1 row in set (0.01 sec) + -- 对于数值类型 USING INVERTED,不用指定分词 -- ALTER TABLE t ADD INDEX 是第二种建索引的语法 - mysql> ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED; - Query OK, 0 rows affected (0.01 sec) + ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED; + -- 执行 BUILD INDEX 给存量数据构建倒排索引 - mysql> BUILD INDEX idx_parent ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_parent ON hackernews_1m; + - mysql> SHOW ALTER TABLE COLUMN; + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ @@ -654,44 +654,44 @@ mysql> SELECT count() FROM hackernews_1m; | 10053 | hackernews_1m | 2023-02-10 19:49:32.893 | 2023-02-10 19:49:33.982 | hackernews_1m | 10054 | 10008 | 1:378856428 | 4 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - mysql> SHOW BUILD INDEX; + SHOW BUILD INDEX; +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 11005 | hackernews_1m | hackernews_1m | [ADD INDEX idx_parent (`parent`) USING INVERTED], | 2023-06-26 16:25:10.167 | 2023-06-26 16:25:10.838 | 1002 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.01 sec) - mysql> SELECT count() FROM hackernews_1m WHERE parent = 11189; + + SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ - 1 row in set (0.01 sec) + ``` - 对字符串类型的 `author` 建立不分词的倒排索引,等值查询也可以利用索引加速 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE author = 'faster'; + SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ - 1 row in set (0.03 sec) + -- 这里只用了 USING INVERTED,不对 author 分词,整个当做一个词处理 - mysql> ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED; - Query OK, 0 rows affected (0.01 sec) + ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED; + -- 执行 BUILD INDEX 给存量数据加上倒排索引: - mysql> BUILD INDEX idx_author ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_author ON hackernews_1m; + -- 100 万条 author 数据增量建索引仅消耗 1.5s - mysql> SHOW ALTER TABLE COLUMN; + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ @@ -700,21 +700,21 @@ mysql> SELECT count() FROM hackernews_1m; | 10076 | hackernews_1m | 2023-02-10 19:54:20.046 | 2023-02-10 19:54:21.521 | hackernews_1m | 10077 | 10008 | 1:1335127701 | 5 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - mysql> SHOW BUILD INDEX order by CreateTime desc limit 1; + SHOW BUILD INDEX order by CreateTime desc limit 1; +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 13006 | hackernews_1m | hackernews_1m | [ADD INDEX idx_author (`author`) USING INVERTED], | 2023-06-26 17:23:02.610 | 2023-06-26 17:23:03.755 | 3004 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.01 sec) + -- 创建索引后,字符串等值匹配也有明显加速 - mysql> SELECT count() FROM hackernews_1m WHERE author = 'faster'; + SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ - 1 row in set (0.01 sec) + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/table-design/index/inverted-index.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/table-design/index/inverted-index.md index bc3789dafe7..882634aa816 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/table-design/index/inverted-index.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/table-design/index/inverted-index.md @@ -344,45 +344,40 @@ SELECT * FROM table_name WHERE op_type IN ('add', 'delete'); TOKENIZE 函数的第一个参数是待分词的文本,第二个参数是创建索引指定的分词参数。 ```sql -mysql> SELECT TOKENIZE('武汉长江大桥','"parser"="chinese","parser_mode"="fine_grained"'); +SELECT TOKENIZE('武汉长江大桥','"parser"="chinese","parser_mode"="fine_grained"'); +-----------------------------------------------------------------------------------+ | tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') | +-----------------------------------------------------------------------------------+ | ["武汉", "武汉长江大桥", "长江", "长江大桥", "大桥"] | +-----------------------------------------------------------------------------------+ -1 row in set (0.02 sec) -mysql> SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="fine_grained"'); +SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="fine_grained"'); +--------------------------------------------------------------------------------------+ | tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') | +--------------------------------------------------------------------------------------+ | ["武汉", "武汉市", "市长", "长江", "长江大桥", "大桥"] | +--------------------------------------------------------------------------------------+ -1 row in set (0.02 sec) -mysql> SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="coarse_grained"'); +SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="coarse_grained"'); +----------------------------------------------------------------------------------------+ | tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"') | +----------------------------------------------------------------------------------------+ | ["武汉市", "长江大桥"] | +----------------------------------------------------------------------------------------+ -1 row in set (0.02 sec) -mysql> SELECT TOKENIZE('I love Doris','"parser"="english"'); +SELECT TOKENIZE('I love Doris','"parser"="english"'); +------------------------------------------------+ | tokenize('I love Doris', '"parser"="english"') | +------------------------------------------------+ | ["i", "love", "doris"] | +------------------------------------------------+ -1 row in set (0.02 sec) -mysql> SELECT TOKENIZE('I love CHINA 我爱我的祖国','"parser"="unicode"'); +SELECT TOKENIZE('I love CHINA 我爱我的祖国','"parser"="unicode"'); +-------------------------------------------------------------------+ | tokenize('I love CHINA 我爱我的祖国', '"parser"="unicode"') | +-------------------------------------------------------------------+ | ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"] | +-------------------------------------------------------------------+ -1 row in set (0.02 sec) ``` ## 使用示例 @@ -458,13 +453,12 @@ curl --location-trusted -u root: -H "compress_type:gz" -T hacknernews_1m.csv.gz **SQL 运行 count() 确认导入数据成功** ```sql -mysql> SELECT count() FROM hackernews_1m; +SELECT count() FROM hackernews_1m; +---------+ | count() | +---------+ | 1000000 | +---------+ -1 row in set (0.02 sec) ``` ### 查询 @@ -474,13 +468,12 @@ mysql> SELECT count() FROM hackernews_1m; - 用 `LIKE` 匹配计算 comment 中含有 'OLAP' 的行数,耗时 0.18s ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%'; +---------+ | count() | +---------+ | 34 | +---------+ - 1 row in set (0.18 sec) ``` - 用基于倒排索引的全文检索 `MATCH_ANY` 计算 comment 中含有'OLAP'的行数,耗时 0.02s,加速 9 倍,在更大的数据集上效果会更加明显 @@ -489,33 +482,31 @@ mysql> SELECT count() FROM hackernews_1m; 这里结果条数的差异,是因为倒排索引对 comment 分词后,还会对词进行进行统一成小写等归一化处理,因此 `MATCH_ANY` 比 `LIKE` 的结果多一些 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP'; +---------+ | count() | +---------+ | 35 | +---------+ - 1 row in set (0.02 sec) ``` - 同样的对比统计 'OLTP' 出现次数的性能,0.07s vs 0.01s,由于缓存的原因 `LIKE` 和 `MATCH_ANY` 都有提升,倒排索引仍然有 7 倍加速 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 48 | +---------+ - 1 row in set (0.07 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP'; + + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP'; +---------+ | count() | +---------+ | 51 | +---------+ - 1 row in set (0.01 sec) ``` - 同时出现 'OLAP' 和 'OLTP' 两个词,0.13s vs 0.01s,13 倍加速 @@ -523,21 +514,20 @@ mysql> SELECT count() FROM hackernews_1m; 要求多个词同时出现时(AND 关系)使用 `MATCH_ALL` 'keyword1 keyword2 ...' ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 14 | +---------+ - 1 row in set (0.13 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; + + SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; +---------+ | count() | +---------+ | 15 | +---------+ - 1 row in set (0.01 sec) ``` - 任意出现 'OLAP' 和 'OLTP' 其中一个词,0.12s vs 0.01s,12 倍加速 @@ -545,21 +535,19 @@ mysql> SELECT count() FROM hackernews_1m; 只要求多个词任意一个或多个出现时(OR 关系)使用 `MATCH_ANY` 'keyword1 keyword2 ...' ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 68 | +---------+ - 1 row in set (0.12 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP'; +---------+ | count() | +---------+ | 71 | +---------+ - 1 row in set (0.01 sec) ``` @@ -568,13 +556,12 @@ mysql> SELECT count() FROM hackernews_1m; - DataTime 类型的列范围查询 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; + SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ - 1 row in set (0.03 sec) ``` - 为 timestamp 列增加一个倒排索引 @@ -582,71 +569,66 @@ mysql> SELECT count() FROM hackernews_1m; ```sql -- 对于日期时间类型 USING INVERTED,不用指定分词 -- CREATE INDEX 是第一种建索引的语法,另外一种在后面展示 - mysql> CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED; - Query OK, 0 rows affected (0.03 sec) + CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED; ``` ```sql - mysql> BUILD INDEX idx_timestamp ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_timestamp ON hackernews_1m; ``` - 查看索引创建进度,通过 FinishTime 和 CreateTime 的差值,可以看到 100 万条数据对 timestamp 列建倒排索引只用了 1s ```sql - mysql> SHOW ALTER TABLE COLUMN; + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - 1 row in set (0.00 sec) ``` ```sql -- 若 table 没有分区,PartitionName 默认就是 TableName - mysql> SHOW BUILD INDEX; + SHOW BUILD INDEX; +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 10191 | hackernews_1m | hackernews_1m | [ADD INDEX idx_timestamp (`timestamp`) USING INVERTED], | 2023-06-26 15:32:33.894 | 2023-06-26 15:32:34.847 | 3 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.04 sec) ``` - 索引创建后,范围查询用同样的查询方式,Doris 会自动识别索引进行优化,但是这里由于数据量小性能差别不大 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; + SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ - 1 row in set (0.01 sec) ``` - 在数值类型的列 Parent 进行类似 timestamp 的操作,这里查询使用等值匹配 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE parent = 11189; + SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ - 1 row in set (0.01 sec) + -- 对于数值类型 USING INVERTED,不用指定分词 -- ALTER TABLE t ADD INDEX 是第二种建索引的语法 - mysql> ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED; - Query OK, 0 rows affected (0.01 sec) + ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED; + -- 执行 BUILD INDEX 给存量数据构建倒排索引 - mysql> BUILD INDEX idx_parent ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_parent ON hackernews_1m; + - mysql> SHOW ALTER TABLE COLUMN; + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ @@ -654,44 +636,42 @@ mysql> SELECT count() FROM hackernews_1m; | 10053 | hackernews_1m | 2023-02-10 19:49:32.893 | 2023-02-10 19:49:33.982 | hackernews_1m | 10054 | 10008 | 1:378856428 | 4 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - mysql> SHOW BUILD INDEX; + SHOW BUILD INDEX; +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 11005 | hackernews_1m | hackernews_1m | [ADD INDEX idx_parent (`parent`) USING INVERTED], | 2023-06-26 16:25:10.167 | 2023-06-26 16:25:10.838 | 1002 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.01 sec) - mysql> SELECT count() FROM hackernews_1m WHERE parent = 11189; + + SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ - 1 row in set (0.01 sec) ``` - 对字符串类型的 `author` 建立不分词的倒排索引,等值查询也可以利用索引加速 ```sql - mysql> SELECT count() FROM hackernews_1m WHERE author = 'faster'; + SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ - 1 row in set (0.03 sec) -- 这里只用了 USING INVERTED,不对 author 分词,整个当做一个词处理 - mysql> ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED; - Query OK, 0 rows affected (0.01 sec) + ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED; + -- 执行 BUILD INDEX 给存量数据加上倒排索引: - mysql> BUILD INDEX idx_author ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_author ON hackernews_1m; + -- 100 万条 author 数据增量建索引仅消耗 1.5s - mysql> SHOW ALTER TABLE COLUMN; + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ @@ -700,21 +680,18 @@ mysql> SELECT count() FROM hackernews_1m; | 10076 | hackernews_1m | 2023-02-10 19:54:20.046 | 2023-02-10 19:54:21.521 | hackernews_1m | 10077 | 10008 | 1:1335127701 | 5 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - mysql> SHOW BUILD INDEX order by CreateTime desc limit 1; + SHOW BUILD INDEX order by CreateTime desc limit 1; +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 13006 | hackernews_1m | hackernews_1m | [ADD INDEX idx_author (`author`) USING INVERTED], | 2023-06-26 17:23:02.610 | 2023-06-26 17:23:03.755 | 3004 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.01 sec) -- 创建索引后,字符串等值匹配也有明显加速 - mysql> SELECT count() FROM hackernews_1m WHERE author = 'faster'; + SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ - 1 row in set (0.01 sec) - ``` diff --git a/versioned_docs/version-2.1/table-design/index/inverted-index.md b/versioned_docs/version-2.1/table-design/index/inverted-index.md index da8359834b8..79ae29c9211 100644 --- a/versioned_docs/version-2.1/table-design/index/inverted-index.md +++ b/versioned_docs/version-2.1/table-design/index/inverted-index.md @@ -339,13 +339,13 @@ To check the actual effect of tokenization or to tokenize a piece of text, you c The first parameter of the `TOKENIZE` function is the text to be tokenized, and the second parameter specifies the tokenization parameters used when creating the index. -mysql> SELECT TOKENIZE('I love Doris','"parser"="english"'); +SELECT TOKENIZE('I love Doris','"parser"="english"'); +------------------------------------------------+ | tokenize('I love Doris', '"parser"="english"') | +------------------------------------------------+ | ["i", "love", "doris"] | +------------------------------------------------+ -1 row in set (0.02 sec) + ``` @@ -419,13 +419,12 @@ curl --location-trusted -u root: -H "compress_type:gz" -T hacknernews_1m.csv.gz **Confirm Data Import Success with SQL count()** ```sql -mysql> SELECT count() FROM hackernews_1m; +SELECT count() FROM hackernews_1m; +---------+ | count() | +---------+ | 1000000 | +---------+ -1 row in set (0.02 sec) ``` ### Queries @@ -435,13 +434,12 @@ mysql> SELECT count() FROM hackernews_1m; - Using `LIKE` to match and count rows containing 'OLAP' in the `comment` column took 0.18s. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%'; +---------+ | count() | +---------+ | 34 | +---------+ - 1 row in set (0.18 sec) ``` - Using full-text search with `MATCH_ANY` based on the inverted index to count rows containing 'OLAP' in the `comment` column took 0.02s, resulting in a 9x speedup. The performance improvement would be even more significant on larger datasets. @@ -449,33 +447,30 @@ mysql> SELECT count() FROM hackernews_1m; The difference in the number of results is due to the inverted index normalizing the terms by converting them to lowercase, among other processes, hence `MATCH_ANY` yields more results than `LIKE`. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP'; +---------+ | count() | +---------+ | 35 | +---------+ - 1 row in set (0.02 sec) ``` - Similarly, comparing the performance for counting occurrences of 'OLTP', 0.07s vs 0.01s. Due to caching, both `LIKE` and `MATCH_ANY` improved, but the inverted index still provided a 7x speedup. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 48 | +---------+ - 1 row in set (0.07 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP'; +---------+ | count() | +---------+ | 51 | +---------+ - 1 row in set (0.01 sec) ``` - Counting rows where both 'OLAP' and 'OLTP' appear took 0.13s vs 0.01s, a 13x speedup. @@ -483,21 +478,20 @@ mysql> SELECT count() FROM hackernews_1m; To require multiple terms to appear simultaneously (AND relationship), use `MATCH_ALL 'keyword1 keyword2 ...'`. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 14 | +---------+ - 1 row in set (0.13 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; + + SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; +---------+ | count() | +---------+ | 15 | +---------+ - 1 row in set (0.01 sec) ``` - Counting rows where either 'OLAP' or 'OLTP' appears took 0.12s vs 0.01s, a 12x speedup. @@ -505,21 +499,19 @@ mysql> SELECT count() FROM hackernews_1m; To require any one or more of multiple terms to appear (OR relationship), use `MATCH_ANY 'keyword1 keyword2 ...'`. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 68 | +---------+ - 1 row in set (0.12 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP'; +---------+ | count() | +---------+ | 71 | +---------+ - 1 row in set (0.01 sec) ``` ### 02 Standard Equality and Range Queries @@ -527,13 +519,12 @@ mysql> SELECT count() FROM hackernews_1m; - Range query on a `DateTime` type column ```sql - mysql> SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; + SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ - 1 row in set (0.03 sec) ``` - Adding an inverted index for the `timestamp` column @@ -541,71 +532,65 @@ mysql> SELECT count() FROM hackernews_1m; ```sql -- For date-time types, USING INVERTED does not require specifying a parser -- CREATE INDEX is one syntax for creating an index, another method will be shown later - mysql> CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED; - Query OK, 0 rows affected (0.03 sec) + CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED; ``` ```sql - mysql> BUILD INDEX idx_timestamp ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_timestamp ON hackernews_1m; ``` - Checking the index creation progress. From the difference between `FinishTime` and `CreateTime`, we can see that building the inverted index for 1 million rows on the `timestamp` column took only 1 second. ```sql - mysql> SHOW ALTER TABLE COLUMN; + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - 1 row in set (0.00 sec) ``` ```sql -- If the table has no partitions, PartitionName defaults to TableName - mysql> SHOW BUILD INDEX; + SHOW BUILD INDEX; +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 10191 | hackernews_1m | hackernews_1m | [ADD INDEX idx_timestamp (`timestamp`) USING INVERTED], | 2023-06-26 15:32:33.894 | 2023-06-26 15:32:34.847 | 3 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.04 sec) ``` - After the index is created, range queries use the same query syntax. Doris will automatically recognize the index for optimization. However, due to the small dataset, the performance difference is not significant. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; + SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ - 1 row in set (0.01 sec) ``` - Performing similar operations on a numeric column `parent` with an equality match query. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE parent = 11189; + SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ - 1 row in set (0.01 sec) -- For numeric types, USING INVERTED does not require specifying a parser -- ALTER TABLE t ADD INDEX is the second syntax for creating an index - mysql> ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED; - Query OK, 0 rows affected (0.01 sec) + ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED; + -- Execute BUILD INDEX to create the inverted index for existing data - mysql> BUILD INDEX idx_parent ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_parent ON hackernews_1m; + - mysql> SHOW ALTER TABLE COLUMN; + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ @@ -613,46 +598,44 @@ mysql> SELECT count() FROM hackernews_1m; | 10053 | hackernews_1m | 2023-02-10 19:49:32.893 | 2023-02-10 19:49:33.982 | hackernews_1m | 10054 | 10008 | 1:378856428 | 4 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - mysql> SHOW BUILD INDEX; + SHOW BUILD INDEX; +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 11005 | hackernews_1m | hackernews_1m | [ADD INDEX idx_parent (`parent`) USING INVERTED], | 2023-06-26 16:25:10.167 | 2023-06-26 16:25:10.838 | 1002 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.01 sec) - mysql> SELECT count() FROM hackernews_1m WHERE parent = 11189; + + SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ - 1 row in set (0.01 sec) ``` - Creating an inverted index for the string column `author` without tokenization. Equality queries can also leverage the index for speedup. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE author = 'faster'; + SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ - 1 row in set (0.03 sec) -- Here, USING INVERTED is used without tokenizing the `author` column, treating it as a single term - mysql> ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED; - Query OK, 0 rows affected (0.01 sec) + ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED; + -- Execute BUILD INDEX to add the inverted index for existing data - mysql> BUILD INDEX idx_author ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_author ON hackernews_1m; + Creating an incremental index for 1 million author records took only 1.5 seconds. ```sql -mysql> SHOW ALTER TABLE COLUMN; +SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ @@ -663,24 +646,22 @@ mysql> SHOW ALTER TABLE COLUMN; ``` ```sql -mysql> SHOW BUILD INDEX ORDER BY CreateTime DESC LIMIT 1; +SHOW BUILD INDEX ORDER BY CreateTime DESC LIMIT 1; +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 13006 | hackernews_1m | hackernews_1m | [ADD INDEX idx_author (`author`) USING INVERTED], | 2023-06-26 17:23:02.610 | 2023-06-26 17:23:03.755 | 3004 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ -1 row in set (0.01 sec) ``` -- After creating the index, string equality matches also showed significant acceleration. ```sql -mysql> SELECT count() FROM hackernews_1m WHERE author = 'faster'; +SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ -1 row in set (0.01 sec) ``` diff --git a/versioned_docs/version-3.0/table-design/index/inverted-index.md b/versioned_docs/version-3.0/table-design/index/inverted-index.md index b4298228c39..c098ca3188a 100644 --- a/versioned_docs/version-3.0/table-design/index/inverted-index.md +++ b/versioned_docs/version-3.0/table-design/index/inverted-index.md @@ -342,13 +342,12 @@ To check the actual effect of tokenization or to tokenize a piece of text, you c The first parameter of the `TOKENIZE` function is the text to be tokenized, and the second parameter specifies the tokenization parameters used when creating the index. -mysql> SELECT TOKENIZE('I love Doris','"parser"="english"'); +SELECT TOKENIZE('I love Doris','"parser"="english"'); +------------------------------------------------+ | tokenize('I love Doris', '"parser"="english"') | +------------------------------------------------+ | ["i", "love", "doris"] | +------------------------------------------------+ -1 row in set (0.02 sec) ``` @@ -422,13 +421,12 @@ curl --location-trusted -u root: -H "compress_type:gz" -T hacknernews_1m.csv.gz **Confirm Data Import Success with SQL count()** ```sql -mysql> SELECT count() FROM hackernews_1m; +SELECT count() FROM hackernews_1m; +---------+ | count() | +---------+ | 1000000 | +---------+ -1 row in set (0.02 sec) ``` ### Queries @@ -438,13 +436,12 @@ mysql> SELECT count() FROM hackernews_1m; - Using `LIKE` to match and count rows containing 'OLAP' in the `comment` column took 0.18s. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%'; +---------+ | count() | +---------+ | 34 | +---------+ - 1 row in set (0.18 sec) ``` - Using full-text search with `MATCH_ANY` based on the inverted index to count rows containing 'OLAP' in the `comment` column took 0.02s, resulting in a 9x speedup. The performance improvement would be even more significant on larger datasets. @@ -452,33 +449,31 @@ mysql> SELECT count() FROM hackernews_1m; The difference in the number of results is due to the inverted index normalizing the terms by converting them to lowercase, among other processes, hence `MATCH_ANY` yields more results than `LIKE`. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP'; +---------+ | count() | +---------+ | 35 | +---------+ - 1 row in set (0.02 sec) ``` - Similarly, comparing the performance for counting occurrences of 'OLTP', 0.07s vs 0.01s. Due to caching, both `LIKE` and `MATCH_ANY` improved, but the inverted index still provided a 7x speedup. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 48 | +---------+ - 1 row in set (0.07 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP'; + + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP'; +---------+ | count() | +---------+ | 51 | +---------+ - 1 row in set (0.01 sec) ``` - Counting rows where both 'OLAP' and 'OLTP' appear took 0.13s vs 0.01s, a 13x speedup. @@ -486,21 +481,20 @@ mysql> SELECT count() FROM hackernews_1m; To require multiple terms to appear simultaneously (AND relationship), use `MATCH_ALL 'keyword1 keyword2 ...'`. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 14 | +---------+ - 1 row in set (0.13 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; + + SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; +---------+ | count() | +---------+ | 15 | +---------+ - 1 row in set (0.01 sec) ``` - Counting rows where either 'OLAP' or 'OLTP' appears took 0.12s vs 0.01s, a 12x speedup. @@ -508,21 +502,19 @@ mysql> SELECT count() FROM hackernews_1m; To require any one or more of multiple terms to appear (OR relationship), use `MATCH_ANY 'keyword1 keyword2 ...'`. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%'; + SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 68 | +---------+ - 1 row in set (0.12 sec) - mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP'; + SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP'; +---------+ | count() | +---------+ | 71 | +---------+ - 1 row in set (0.01 sec) ``` ### 02 Standard Equality and Range Queries @@ -530,13 +522,12 @@ mysql> SELECT count() FROM hackernews_1m; - Range query on a `DateTime` type column ```sql - mysql> SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; + SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ - 1 row in set (0.03 sec) ``` - Adding an inverted index for the `timestamp` column @@ -544,71 +535,65 @@ mysql> SELECT count() FROM hackernews_1m; ```sql -- For date-time types, USING INVERTED does not require specifying a parser -- CREATE INDEX is one syntax for creating an index, another method will be shown later - mysql> CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED; - Query OK, 0 rows affected (0.03 sec) + CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED; ``` ```sql - mysql> BUILD INDEX idx_timestamp ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_timestamp ON hackernews_1m; ``` - Checking the index creation progress. From the difference between `FinishTime` and `CreateTime`, we can see that building the inverted index for 1 million rows on the `timestamp` column took only 1 second. ```sql - mysql> SHOW ALTER TABLE COLUMN; + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - 1 row in set (0.00 sec) ``` ```sql -- If the table has no partitions, PartitionName defaults to TableName - mysql> SHOW BUILD INDEX; + SHOW BUILD INDEX; +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 10191 | hackernews_1m | hackernews_1m | [ADD INDEX idx_timestamp (`timestamp`) USING INVERTED], | 2023-06-26 15:32:33.894 | 2023-06-26 15:32:34.847 | 3 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.04 sec) ``` - After the index is created, range queries use the same query syntax. Doris will automatically recognize the index for optimization. However, due to the small dataset, the performance difference is not significant. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; + SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ - 1 row in set (0.01 sec) ``` - Performing similar operations on a numeric column `parent` with an equality match query. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE parent = 11189; + SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ - 1 row in set (0.01 sec) -- For numeric types, USING INVERTED does not require specifying a parser -- ALTER TABLE t ADD INDEX is the second syntax for creating an index - mysql> ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED; - Query OK, 0 rows affected (0.01 sec) + ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED; + -- Execute BUILD INDEX to create the inverted index for existing data - mysql> BUILD INDEX idx_parent ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_parent ON hackernews_1m; + - mysql> SHOW ALTER TABLE COLUMN; + SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ @@ -616,46 +601,45 @@ mysql> SELECT count() FROM hackernews_1m; | 10053 | hackernews_1m | 2023-02-10 19:49:32.893 | 2023-02-10 19:49:33.982 | hackernews_1m | 10054 | 10008 | 1:378856428 | 4 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ - mysql> SHOW BUILD INDEX; + SHOW BUILD INDEX; +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 11005 | hackernews_1m | hackernews_1m | [ADD INDEX idx_parent (`parent`) USING INVERTED], | 2023-06-26 16:25:10.167 | 2023-06-26 16:25:10.838 | 1002 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ - 1 row in set (0.01 sec) - mysql> SELECT count() FROM hackernews_1m WHERE parent = 11189; + + SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ - 1 row in set (0.01 sec) ``` - Creating an inverted index for the string column `author` without tokenization. Equality queries can also leverage the index for speedup. ```sql - mysql> SELECT count() FROM hackernews_1m WHERE author = 'faster'; + SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ - 1 row in set (0.03 sec) + -- Here, USING INVERTED is used without tokenizing the `author` column, treating it as a single term - mysql> ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED; - Query OK, 0 rows affected (0.01 sec) + ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED; + -- Execute BUILD INDEX to add the inverted index for existing data - mysql> BUILD INDEX idx_author ON hackernews_1m; - Query OK, 0 rows affected (0.01 sec) + BUILD INDEX idx_author ON hackernews_1m; + Creating an incremental index for 1 million author records took only 1.5 seconds. ```sql -mysql> SHOW ALTER TABLE COLUMN; +SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ @@ -666,24 +650,22 @@ mysql> SHOW ALTER TABLE COLUMN; ``` ```sql -mysql> SHOW BUILD INDEX ORDER BY CreateTime DESC LIMIT 1; +SHOW BUILD INDEX ORDER BY CreateTime DESC LIMIT 1; +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 13006 | hackernews_1m | hackernews_1m | [ADD INDEX idx_author (`author`) USING INVERTED], | 2023-06-26 17:23:02.610 | 2023-06-26 17:23:03.755 | 3004 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ -1 row in set (0.01 sec) ``` -- After creating the index, string equality matches also showed significant acceleration. ```sql -mysql> SELECT count() FROM hackernews_1m WHERE author = 'faster'; +SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ -1 row in set (0.01 sec) ``` --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org