This is an automated email from the ASF dual-hosted git repository. zykkk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 35f94297b42 add ngram desc (#1735) 35f94297b42 is described below commit 35f94297b424fc4021c52ae9e6607b2357b00aa8 Author: wangtianyi2004 <376612...@qq.com> AuthorDate: Mon Jan 13 14:55:35 2025 +0800 add ngram desc (#1735) --- .../storage-compute-decoupled-deploy-manually.md | 14 +++++++++++--- docs/table-design/data-model/tips.md | 2 +- docs/table-design/index/ngram-bloomfilter-index.md | 4 ++-- docs/table-design/row-store.md | 11 ++++++----- .../storage-compute-decoupled-deploy-manually.md | 17 +++++++++++++---- .../table-design/index/ngram-bloomfilter-index.md | 4 ++-- .../current/table-design/row-store.md | 2 +- .../table-design/index/ngram-bloomfilter-index.md | 4 ++-- .../version-2.1/table-design/row-store.md | 2 +- .../storage-compute-decoupled-deploy-manually.md | 17 +++++++++++++---- .../table-design/index/ngram-bloomfilter-index.md | 4 ++-- .../version-3.0/table-design/row-store.md | 2 +- .../version-2.1/table-design/data-model/tips.md | 2 +- .../table-design/index/ngram-bloomfilter-index.md | 4 ++-- versioned_docs/version-2.1/table-design/row-store.md | 11 ++++++----- .../storage-compute-decoupled-deploy-manually.md | 14 +++++++++++--- .../version-3.0/table-design/data-model/tips.md | 2 +- .../table-design/index/ngram-bloomfilter-index.md | 4 ++-- versioned_docs/version-3.0/table-design/row-store.md | 11 ++++++----- 19 files changed, 84 insertions(+), 47 deletions(-) diff --git a/docs/install/deploy-manually/storage-compute-decoupled-deploy-manually.md b/docs/install/deploy-manually/storage-compute-decoupled-deploy-manually.md index a2e2a4fcd47..1a354de126a 100644 --- a/docs/install/deploy-manually/storage-compute-decoupled-deploy-manually.md +++ b/docs/install/deploy-manually/storage-compute-decoupled-deploy-manually.md @@ -255,8 +255,16 @@ To add Backend nodes to the cluster, perform the following steps for each Backen 1. Configure `be.conf` In the `be.conf` file, you need to configure the following key parameters: + - deploy_mode + - Description: Specifies the startup mode of doris + - Format: cloud indicates separation of storage and computing mode, others indicate integration of storage and computing mode + - Example: cloud + - file_cache_path + - Description: Disk path and other parameters used for file caching, represented in array form, each disk is an item. path specifies the disk path, total_size limits the cache size; -1 or 0 will use the entire disk space. + - Format: [{"path":"/path/to/file_cache", "total_size":21474836480}, {"path":"/path/to/file_cache2", "total_size":21474836480}] + - Example: [{"path":"/path/to/file_cache", "total_size":21474836480}, {"path":"/path/to/file_cache2", "total_size":21474836480}] - Default: [{"path":"${DORIS_HOME}/file_cache"}] -2. Start the BE process +3. Start the BE process Use the following command to start the Backend: @@ -264,7 +272,7 @@ To add Backend nodes to the cluster, perform the following steps for each Backen bin/start_be.sh --daemon ``` -3. Add BE to the cluster: +4. Add BE to the cluster: Connect to any Frontend using MySQL client and execute: @@ -278,7 +286,7 @@ To add Backend nodes to the cluster, perform the following steps for each Backen For more detailed usage, refer to [ADD BACKEND](../../sql-manual/sql-statements/Cluster-Management-Statements/ALTER-SYSTEM-ADD-BACKEND) and [REMOVE BACKEND](../../sql-manual/sql-statements/Cluster-Management-Statements/ALTER-SYSTEM-DROP-BACKEND). -4. Verify BE status +5. Verify BE status Check the Backend log files (`be.log`) to ensure it has successfully started and joined the cluster. diff --git a/docs/table-design/data-model/tips.md b/docs/table-design/data-model/tips.md index b61542f13ab..fac0d2eb2b4 100644 --- a/docs/table-design/data-model/tips.md +++ b/docs/table-design/data-model/tips.md @@ -145,7 +145,7 @@ AGGREGATE KEY columns. Otherwise, `select sum (count) from table;` can only expr Another method is to add a `cound` column of value 1 but aggregation type of REPLACE. Then `select sum (count) from table;` and `select count (*) from table;` could produce the same results. Moreover, this method does not require the absence of same AGGREGATE KEY columns in the import data. -### Merge on write of unique model +### MoW Unique Key Model The Merge on Write implementation in the Unique Model does not impose the same limitation as the Aggregate Model. In Merge on Write, the model adds a `delete bitmap` for each imported rowset to mark the data being overwritten or deleted. With the previous example, after Batch 1 is imported, the data status will be as follows: diff --git a/docs/table-design/index/ngram-bloomfilter-index.md b/docs/table-design/index/ngram-bloomfilter-index.md index be1a3a098c1..278054b30ef 100644 --- a/docs/table-design/index/ngram-bloomfilter-index.md +++ b/docs/table-design/index/ngram-bloomfilter-index.md @@ -27,7 +27,7 @@ under the License. ## Indexing Principles -The NGram BloomFilter index, similar to the BloomFilter index, is a skip index based on BloomFilter. +n-gram tokenization is a method of splitting a sentence or a piece of text into multiple adjacent word groups. The NGram BloomFilter index, similar to the BloomFilter index, is a skip index based on BloomFilter. Unlike the BloomFilter index, the NGram BloomFilter index is used to accelerate text LIKE queries. Instead of storing the original text values, it tokenizes the text using NGram and stores each token in the BloomFilter. For LIKE queries, the pattern in LIKE '%pattern%' is also tokenized using NGram. Each token is checked against the BloomFilter, and if any token is not found, the corresponding data block does not meet the LIKE condition and can be skipped, reducing IO and accelerating th [...] @@ -58,7 +58,7 @@ Explanation of the syntax: 1. **`idx_column_name(column_name)`** is mandatory. `column_name` is the column to be indexed and must appear in the column definitions above. `idx_column_name` is the index name, which must be unique at the table level. It is recommended to name it with a prefix `idx_` followed by the column name. 2. **`USING NGRAM_BF`** is mandatory and specifies that the index type is an NGram BloomFilter index. 3. **`PROPERTIES`** is optional and is used to specify additional properties for the NGram BloomFilter index. The supported properties are: - - **gram_size**: The N in NGram, specifying the number of consecutive characters to form a token. For example, 'an ngram example' with N = 3 would be tokenized into 'an ', 'n n', ' ng', 'ngr', 'gra', 'ram' (6 tokens). + - **gram_size**: The N in NGram, specifying the number of consecutive characters to form a token. For example, 'This is a simple ngram example' with N = 3 would be tokenized into 'This is a', 'is a simple', 'a simple ngram', 'simple ngram example' (4 tokens). - **bf_size**: The size of the BloomFilter in bits. bf_size determines the size of the index corresponding to each data block. The larger this value, the more storage space it occupies, but the lower the probability of hash collisions. It is recommended to set **gram_size** to the minimum length of the string in LIKE queries but not less than 2. Generally, "gram_size"="3", "bf_size"="1024" is recommended, then adjust based on the Query Profile. diff --git a/docs/table-design/row-store.md b/docs/table-design/row-store.md index acefd596ec6..171f618cad2 100644 --- a/docs/table-design/row-store.md +++ b/docs/table-design/row-store.md @@ -1,6 +1,6 @@ --- { - "title": "Hybrid Storage", + "title": "Hybrid Row-Columnar Storage", "language": "en" } --- @@ -24,11 +24,11 @@ specific language governing permissions and limitations under the License. --> -## Hybrid Storage +## Hybrid Row-Columnar Storage -Doris defaults to columnar storage, where each column is stored contiguously. Columnar storage offers excellent performance for analytical scenarios (such as aggregation, filtering, sorting, etc.), as it only reads the necessary columns, reducing unnecessary IO. However, in point query scenarios (such as `SELECT *`), all columns need to be read, requiring an IO operation for each column, which can lead to IOPS becoming a bottleneck, especially for wide tables with many columns (e.g., hun [...] +Doris uses columnar storage by default, with each column stored contiguously. Columnar storage offers excellent performance for analytical scenarios (such as aggregation, filtering, sorting, etc.), as it only reads the necessary columns, reducing unnecessary IO. However, in point query scenarios (such as `SELECT *`), all columns need to be read, requiring an IO operation for each column, which can lead to IOPS becoming a bottleneck, especially for wide tables with many columns (e.g., hun [...] -To address the IOPS bottleneck in point query scenarios, starting from version 2.0.0, Doris supports hybrid storage. When users create tables, they can specify whether to enable row storage. With row storage enabled, each row only requires one IO operation for point queries (such as `SELECT *`), significantly improving performance. +To address the IOPS bottleneck in point query scenarios, starting from version 2.0.0, Doris supports Hybrid Row-Columnar Storage. When users create tables, they can specify whether to enable row storage. With row storage enabled, each row only requires one IO operation for point queries (such as `SELECT *`), significantly improving performance. The principle of row storage is that an additional column is added during storage. This column concatenates all the columns of the corresponding row and stores them using a special binary format. @@ -51,7 +51,8 @@ When creating a table, specify whether to enable row storage, which columns to e "row_store_page_size" = "16384" ``` -The page is the smallest unit of storage read/write operations, and page_size is the size of the row storage page. This means that reading one row also requires generating an IO for a page. The larger the value, the better the compression effect and the lower the storage space usage, but the higher the IO overhead for point queries (since one IO reads at least one page), and vice versa. The smaller the value, the higher the storage space, the better the point query performance. The defau [...] +A page is the smallest unit for storage read and write operations, and `page_size` refers to the size of a row-store page. This means that reading a single row requires generating a page IO. The larger this value is, the better the compression effect and the lower the storage space usage. However, the IO overhead during point queries increases, resulting in lower performance (because each IO operation reads at least one page). Conversely, the smaller the value, the higher the storage spa [...] + ## Example diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/install/deploy-manually/storage-compute-decoupled-deploy-manually.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/install/deploy-manually/storage-compute-decoupled-deploy-manually.md index a23d60a0920..be350b5b62c 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/install/deploy-manually/storage-compute-decoupled-deploy-manually.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/install/deploy-manually/storage-compute-decoupled-deploy-manually.md @@ -254,8 +254,17 @@ ALTER SYSTEM ADD FOLLOWER "host:port"; 1. 配置 be.conf 在 `be.conf` 文件中,需要配置以下关键参数: - -2. 启动 BE 进程 + - deploy_mode + - 描述:指定 doris 启动模式 + - 格式:cloud 表示存算分离模式,其它存算一体模式 + - 示例:cloud + - file_cache_path + - 描述: 用于文件缓存的磁盘路径和其他参数,以数组形式表示,每个磁盘一项。path 指定磁盘路径,total_size 限制缓存的大小;-1 或 0 将使用整个磁盘空间。 + - 格式: [{"path":"/path/to/file_cache","total_size":21474836480},{"path":"/path/to/file_cache2","total_size":21474836480}] + - 示例: [{"path":"/path/to/file_cache","total_size":21474836480},{"path":"/path/to/file_cache2","total_size":21474836480}] + - 默认: [{"path":"${DORIS_HOME}/file_cache"}] + +3. 启动 BE 进程 使用以下命令启动 Backend: @@ -263,7 +272,7 @@ ALTER SYSTEM ADD FOLLOWER "host:port"; bin/start_be.sh --daemon ``` -3. 将 BE 添加到集群: +4. 将 BE 添加到集群: 使用 MySQL 客户端连接到任意 Frontend,并执行: @@ -277,7 +286,7 @@ ALTER SYSTEM ADD FOLLOWER "host:port"; 更详细的用法请参考 [ADD BACKEND](../../sql-manual/sql-statements/Cluster-Management-Statements/ALTER-SYSTEM-ADD-BACKEND) 和 [REMOVE BACKEND](../../sql-manual/sql-statements/Cluster-Management-Statements/ALTER-SYSTEM-DROP-BACKEND)。 -4. 验证 BE 状态 +5. 验证 BE 状态 检查 Backend 日志文件(`be.log`)以确保它已成功启动并加入集群。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/index/ngram-bloomfilter-index.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/index/ngram-bloomfilter-index.md index a9564ee82db..ae9c0221295 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/index/ngram-bloomfilter-index.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/index/ngram-bloomfilter-index.md @@ -26,7 +26,7 @@ under the License. ## 索引原理 -NGram BloomFilter 索引和 BloomFilter 索引类似,也是基于 BloomFilter 的跳数索引。 +n-gram 分词是将一句话或一段文字拆分成多个相邻的词组的分词方法。NGram BloomFilter 索引和 BloomFilter 索引类似,也是基于 BloomFilter 的跳数索引。 与 BloomFilter 索引不同的是,NGram BloomFilter 索引用于加速文本 LIKE 查询,它存入 BloomFilter 的不是原始文本的值,而是对文本进行 NGram 分词,每个词作为值存入 BloomFilter。对于 LIKE 查询,将 LIKE '%pattern%' 的 pattern 也进行 NGram 分词,判断每个词是否在 BloomFilter 中,如果某个词不在则对应的数据块就不满足 LIKE 条件,可以跳过这部分数据减少IO加速查询。 @@ -64,7 +64,7 @@ NGram BloomFilter 索引只能加速字符串 LIKE 查询,而且 LIKE pattern **3. `PROPERTIES` 是可选的,用于指定 NGram BloomFilter 索引的额外属性,目前支持的属性如下:** -- gram_size:NGram 中的 N,指定 N 个连续字符分词一个词,比如 'an ngram example' 在 N = 3 的时候分成 'an ', 'n n', ' ng', 'ngr', 'gra', 'ram' 6 个词。 +- gram_size:NGram 中的 N,指定 N 个连续字符分词一个词,比如 'This is a simple ngram example' 在 N = 3 的时候分成 'This is a', 'is a simple', 'a simple ngram', 'simple ngram example' 4 个词。 - bf_size:BloomFilter 的大小,单位是 Bit。bf_size 决定每个数据块对应的索引大小,这个值越大占用存储空间越大,同时 Hash 碰撞的概率也越低。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/row-store.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/row-store.md index 72ecbba19c1..dc91e1ac788 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/row-store.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/row-store.md @@ -51,7 +51,7 @@ Doris 默认采用列式存储,每个列连续存储,在分析场景(如 "row_store_page_size" = "16384" ``` -page 是存储读写的最小单元,page_size 是行存 page 的大小,也就是说读一行也需要产生一个 page 的 IO。这个值越大压缩效果越好存储空间占用越低,但是点查时 IO 开销越大性能越低(因为一次 IO 至少读一个 page),反过来值越小存储空间约高,点查性能越好。默认值 16KB 是大多数情况下比较均衡的选择,如果更偏向查询性能可以配置较小的值比如 4KB 甚至更低,如果更偏向存储空间可以配置较大的值比如 64KB 甚至更高。 +page 是存储读写的最小单元,page_size 是行存 page 的大小,也就是说读一行也需要产生一个 page 的 IO。这个值越大压缩效果越好存储空间占用越低,但是点查时 IO 开销越大性能越低(因为一次 IO 至少读一个 page),反过来值越小存储空间极高,点查性能越好。默认值 16KB 是大多数情况下比较均衡的选择,如果更偏向查询性能可以配置较小的值比如 4KB 甚至更低,如果更偏向存储空间可以配置较大的值比如 64KB 甚至更高。 ## 使用示例 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/table-design/index/ngram-bloomfilter-index.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/table-design/index/ngram-bloomfilter-index.md index a9564ee82db..ae9c0221295 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/table-design/index/ngram-bloomfilter-index.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/table-design/index/ngram-bloomfilter-index.md @@ -26,7 +26,7 @@ under the License. ## 索引原理 -NGram BloomFilter 索引和 BloomFilter 索引类似,也是基于 BloomFilter 的跳数索引。 +n-gram 分词是将一句话或一段文字拆分成多个相邻的词组的分词方法。NGram BloomFilter 索引和 BloomFilter 索引类似,也是基于 BloomFilter 的跳数索引。 与 BloomFilter 索引不同的是,NGram BloomFilter 索引用于加速文本 LIKE 查询,它存入 BloomFilter 的不是原始文本的值,而是对文本进行 NGram 分词,每个词作为值存入 BloomFilter。对于 LIKE 查询,将 LIKE '%pattern%' 的 pattern 也进行 NGram 分词,判断每个词是否在 BloomFilter 中,如果某个词不在则对应的数据块就不满足 LIKE 条件,可以跳过这部分数据减少IO加速查询。 @@ -64,7 +64,7 @@ NGram BloomFilter 索引只能加速字符串 LIKE 查询,而且 LIKE pattern **3. `PROPERTIES` 是可选的,用于指定 NGram BloomFilter 索引的额外属性,目前支持的属性如下:** -- gram_size:NGram 中的 N,指定 N 个连续字符分词一个词,比如 'an ngram example' 在 N = 3 的时候分成 'an ', 'n n', ' ng', 'ngr', 'gra', 'ram' 6 个词。 +- gram_size:NGram 中的 N,指定 N 个连续字符分词一个词,比如 'This is a simple ngram example' 在 N = 3 的时候分成 'This is a', 'is a simple', 'a simple ngram', 'simple ngram example' 4 个词。 - bf_size:BloomFilter 的大小,单位是 Bit。bf_size 决定每个数据块对应的索引大小,这个值越大占用存储空间越大,同时 Hash 碰撞的概率也越低。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/table-design/row-store.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/table-design/row-store.md index 959f99cc53b..1ec00e34335 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/table-design/row-store.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/table-design/row-store.md @@ -46,7 +46,7 @@ Doris 默认采用列式存储,每个列连续存储,在分析场景(如 "row_store_page_size" = "16384" ``` -page 是存储读写的最小单元,page_size 是行存 page 的大小,也就是说读一行也需要产生一个 page 的 IO。这个值越大压缩效果越好存储空间占用越低,但是点查时 IO 开销越大性能越低(因为一次 IO 至少读一个 page),反过来值越小存储空间约高,点查性能越好。默认值 16KB 是大多数情况下比较均衡的选择,如果更偏向查询性能可以配置较小的值比如 4KB 甚至更低,如果更偏向存储空间可以配置较大的值比如 64KB 甚至更高。 +page 是存储读写的最小单元,page_size 是行存 page 的大小,也就是说读一行也需要产生一个 page 的 IO。这个值越大压缩效果越好存储空间占用越低,但是点查时 IO 开销越大性能越低(因为一次 IO 至少读一个 page),反过来值越小存储空间极高,点查性能越好。默认值 16KB 是大多数情况下比较均衡的选择,如果更偏向查询性能可以配置较小的值比如 4KB 甚至更低,如果更偏向存储空间可以配置较大的值比如 64KB 甚至更高。 ## 使用示例 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/install/deploy-manually/storage-compute-decoupled-deploy-manually.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/install/deploy-manually/storage-compute-decoupled-deploy-manually.md index a23d60a0920..be350b5b62c 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/install/deploy-manually/storage-compute-decoupled-deploy-manually.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/install/deploy-manually/storage-compute-decoupled-deploy-manually.md @@ -254,8 +254,17 @@ ALTER SYSTEM ADD FOLLOWER "host:port"; 1. 配置 be.conf 在 `be.conf` 文件中,需要配置以下关键参数: - -2. 启动 BE 进程 + - deploy_mode + - 描述:指定 doris 启动模式 + - 格式:cloud 表示存算分离模式,其它存算一体模式 + - 示例:cloud + - file_cache_path + - 描述: 用于文件缓存的磁盘路径和其他参数,以数组形式表示,每个磁盘一项。path 指定磁盘路径,total_size 限制缓存的大小;-1 或 0 将使用整个磁盘空间。 + - 格式: [{"path":"/path/to/file_cache","total_size":21474836480},{"path":"/path/to/file_cache2","total_size":21474836480}] + - 示例: [{"path":"/path/to/file_cache","total_size":21474836480},{"path":"/path/to/file_cache2","total_size":21474836480}] + - 默认: [{"path":"${DORIS_HOME}/file_cache"}] + +3. 启动 BE 进程 使用以下命令启动 Backend: @@ -263,7 +272,7 @@ ALTER SYSTEM ADD FOLLOWER "host:port"; bin/start_be.sh --daemon ``` -3. 将 BE 添加到集群: +4. 将 BE 添加到集群: 使用 MySQL 客户端连接到任意 Frontend,并执行: @@ -277,7 +286,7 @@ ALTER SYSTEM ADD FOLLOWER "host:port"; 更详细的用法请参考 [ADD BACKEND](../../sql-manual/sql-statements/Cluster-Management-Statements/ALTER-SYSTEM-ADD-BACKEND) 和 [REMOVE BACKEND](../../sql-manual/sql-statements/Cluster-Management-Statements/ALTER-SYSTEM-DROP-BACKEND)。 -4. 验证 BE 状态 +5. 验证 BE 状态 检查 Backend 日志文件(`be.log`)以确保它已成功启动并加入集群。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/table-design/index/ngram-bloomfilter-index.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/table-design/index/ngram-bloomfilter-index.md index a9564ee82db..ae9c0221295 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/table-design/index/ngram-bloomfilter-index.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/table-design/index/ngram-bloomfilter-index.md @@ -26,7 +26,7 @@ under the License. ## 索引原理 -NGram BloomFilter 索引和 BloomFilter 索引类似,也是基于 BloomFilter 的跳数索引。 +n-gram 分词是将一句话或一段文字拆分成多个相邻的词组的分词方法。NGram BloomFilter 索引和 BloomFilter 索引类似,也是基于 BloomFilter 的跳数索引。 与 BloomFilter 索引不同的是,NGram BloomFilter 索引用于加速文本 LIKE 查询,它存入 BloomFilter 的不是原始文本的值,而是对文本进行 NGram 分词,每个词作为值存入 BloomFilter。对于 LIKE 查询,将 LIKE '%pattern%' 的 pattern 也进行 NGram 分词,判断每个词是否在 BloomFilter 中,如果某个词不在则对应的数据块就不满足 LIKE 条件,可以跳过这部分数据减少IO加速查询。 @@ -64,7 +64,7 @@ NGram BloomFilter 索引只能加速字符串 LIKE 查询,而且 LIKE pattern **3. `PROPERTIES` 是可选的,用于指定 NGram BloomFilter 索引的额外属性,目前支持的属性如下:** -- gram_size:NGram 中的 N,指定 N 个连续字符分词一个词,比如 'an ngram example' 在 N = 3 的时候分成 'an ', 'n n', ' ng', 'ngr', 'gra', 'ram' 6 个词。 +- gram_size:NGram 中的 N,指定 N 个连续字符分词一个词,比如 'This is a simple ngram example' 在 N = 3 的时候分成 'This is a', 'is a simple', 'a simple ngram', 'simple ngram example' 4 个词。 - bf_size:BloomFilter 的大小,单位是 Bit。bf_size 决定每个数据块对应的索引大小,这个值越大占用存储空间越大,同时 Hash 碰撞的概率也越低。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/table-design/row-store.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/table-design/row-store.md index 72ecbba19c1..dc91e1ac788 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/table-design/row-store.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/table-design/row-store.md @@ -51,7 +51,7 @@ Doris 默认采用列式存储,每个列连续存储,在分析场景(如 "row_store_page_size" = "16384" ``` -page 是存储读写的最小单元,page_size 是行存 page 的大小,也就是说读一行也需要产生一个 page 的 IO。这个值越大压缩效果越好存储空间占用越低,但是点查时 IO 开销越大性能越低(因为一次 IO 至少读一个 page),反过来值越小存储空间约高,点查性能越好。默认值 16KB 是大多数情况下比较均衡的选择,如果更偏向查询性能可以配置较小的值比如 4KB 甚至更低,如果更偏向存储空间可以配置较大的值比如 64KB 甚至更高。 +page 是存储读写的最小单元,page_size 是行存 page 的大小,也就是说读一行也需要产生一个 page 的 IO。这个值越大压缩效果越好存储空间占用越低,但是点查时 IO 开销越大性能越低(因为一次 IO 至少读一个 page),反过来值越小存储空间极高,点查性能越好。默认值 16KB 是大多数情况下比较均衡的选择,如果更偏向查询性能可以配置较小的值比如 4KB 甚至更低,如果更偏向存储空间可以配置较大的值比如 64KB 甚至更高。 ## 使用示例 diff --git a/versioned_docs/version-2.1/table-design/data-model/tips.md b/versioned_docs/version-2.1/table-design/data-model/tips.md index b61542f13ab..fac0d2eb2b4 100644 --- a/versioned_docs/version-2.1/table-design/data-model/tips.md +++ b/versioned_docs/version-2.1/table-design/data-model/tips.md @@ -145,7 +145,7 @@ AGGREGATE KEY columns. Otherwise, `select sum (count) from table;` can only expr Another method is to add a `cound` column of value 1 but aggregation type of REPLACE. Then `select sum (count) from table;` and `select count (*) from table;` could produce the same results. Moreover, this method does not require the absence of same AGGREGATE KEY columns in the import data. -### Merge on write of unique model +### MoW Unique Key Model The Merge on Write implementation in the Unique Model does not impose the same limitation as the Aggregate Model. In Merge on Write, the model adds a `delete bitmap` for each imported rowset to mark the data being overwritten or deleted. With the previous example, after Batch 1 is imported, the data status will be as follows: diff --git a/versioned_docs/version-2.1/table-design/index/ngram-bloomfilter-index.md b/versioned_docs/version-2.1/table-design/index/ngram-bloomfilter-index.md index be1a3a098c1..278054b30ef 100644 --- a/versioned_docs/version-2.1/table-design/index/ngram-bloomfilter-index.md +++ b/versioned_docs/version-2.1/table-design/index/ngram-bloomfilter-index.md @@ -27,7 +27,7 @@ under the License. ## Indexing Principles -The NGram BloomFilter index, similar to the BloomFilter index, is a skip index based on BloomFilter. +n-gram tokenization is a method of splitting a sentence or a piece of text into multiple adjacent word groups. The NGram BloomFilter index, similar to the BloomFilter index, is a skip index based on BloomFilter. Unlike the BloomFilter index, the NGram BloomFilter index is used to accelerate text LIKE queries. Instead of storing the original text values, it tokenizes the text using NGram and stores each token in the BloomFilter. For LIKE queries, the pattern in LIKE '%pattern%' is also tokenized using NGram. Each token is checked against the BloomFilter, and if any token is not found, the corresponding data block does not meet the LIKE condition and can be skipped, reducing IO and accelerating th [...] @@ -58,7 +58,7 @@ Explanation of the syntax: 1. **`idx_column_name(column_name)`** is mandatory. `column_name` is the column to be indexed and must appear in the column definitions above. `idx_column_name` is the index name, which must be unique at the table level. It is recommended to name it with a prefix `idx_` followed by the column name. 2. **`USING NGRAM_BF`** is mandatory and specifies that the index type is an NGram BloomFilter index. 3. **`PROPERTIES`** is optional and is used to specify additional properties for the NGram BloomFilter index. The supported properties are: - - **gram_size**: The N in NGram, specifying the number of consecutive characters to form a token. For example, 'an ngram example' with N = 3 would be tokenized into 'an ', 'n n', ' ng', 'ngr', 'gra', 'ram' (6 tokens). + - **gram_size**: The N in NGram, specifying the number of consecutive characters to form a token. For example, 'This is a simple ngram example' with N = 3 would be tokenized into 'This is a', 'is a simple', 'a simple ngram', 'simple ngram example' (4 tokens). - **bf_size**: The size of the BloomFilter in bits. bf_size determines the size of the index corresponding to each data block. The larger this value, the more storage space it occupies, but the lower the probability of hash collisions. It is recommended to set **gram_size** to the minimum length of the string in LIKE queries but not less than 2. Generally, "gram_size"="3", "bf_size"="1024" is recommended, then adjust based on the Query Profile. diff --git a/versioned_docs/version-2.1/table-design/row-store.md b/versioned_docs/version-2.1/table-design/row-store.md index 571432c59d0..156c9c69640 100644 --- a/versioned_docs/version-2.1/table-design/row-store.md +++ b/versioned_docs/version-2.1/table-design/row-store.md @@ -1,6 +1,6 @@ --- { - "title": "Hybrid Storage", + "title": "Hybrid Row-Columnar Storage", "language": "en" } --- @@ -24,11 +24,11 @@ specific language governing permissions and limitations under the License. --> -## Hybrid Storage +## Hybrid Row-Columnar Storage -Doris defaults to columnar storage, where each column is stored contiguously. Columnar storage offers excellent performance for analytical scenarios (such as aggregation, filtering, sorting, etc.), as it only reads the necessary columns, reducing unnecessary IO. However, in point query scenarios (such as `SELECT *`), all columns need to be read, requiring an IO operation for each column, which can lead to IOPS becoming a bottleneck, especially for wide tables with many columns (e.g., hun [...] +Doris uses columnar storage by default, with each column stored contiguously. Columnar storage offers excellent performance for analytical scenarios (such as aggregation, filtering, sorting, etc.), as it only reads the necessary columns, reducing unnecessary IO. However, in point query scenarios (such as `SELECT *`), all columns need to be read, requiring an IO operation for each column, which can lead to IOPS becoming a bottleneck, especially for wide tables with many columns (e.g., hun [...] -To address the IOPS bottleneck in point query scenarios, starting from version 2.0.0, Doris supports hybrid storage. When users create tables, they can specify whether to enable row storage. With row storage enabled, each row only requires one IO operation for point queries (such as `SELECT *`), significantly improving performance. +To address the IOPS bottleneck in point query scenarios, starting from version 2.0.0, Doris supports Hybrid Row-Columnar Storage. When users create tables, they can specify whether to enable row storage. With row storage enabled, each row only requires one IO operation for point queries (such as `SELECT *`), significantly improving performance. The principle of row storage is that an additional column is added during storage. This column concatenates all the columns of the corresponding row and stores them using a special binary format. @@ -46,7 +46,8 @@ When creating a table, specify whether to enable row storage, and the storage co "row_store_page_size" = "16384" ``` -The page is the smallest unit of storage read/write operations, and page_size is the size of the row storage page. This means that reading one row also requires generating an IO for a page. The larger the value, the better the compression effect and the lower the storage space usage, but the higher the IO overhead for point queries (since one IO reads at least one page), and vice versa. The smaller the value, the higher the storage space, the better the point query performance. The defau [...] +A page is the smallest unit for storage read and write operations, and `page_size` refers to the size of a row-store page. This means that reading a single row requires generating a page IO. The larger this value is, the better the compression effect and the lower the storage space usage. However, the IO overhead during point queries increases, resulting in lower performance (because each IO operation reads at least one page). Conversely, the smaller the value, the higher the storage spa [...] + ## Example diff --git a/versioned_docs/version-3.0/install/deploy-manually/storage-compute-decoupled-deploy-manually.md b/versioned_docs/version-3.0/install/deploy-manually/storage-compute-decoupled-deploy-manually.md index a2e2a4fcd47..1a354de126a 100644 --- a/versioned_docs/version-3.0/install/deploy-manually/storage-compute-decoupled-deploy-manually.md +++ b/versioned_docs/version-3.0/install/deploy-manually/storage-compute-decoupled-deploy-manually.md @@ -255,8 +255,16 @@ To add Backend nodes to the cluster, perform the following steps for each Backen 1. Configure `be.conf` In the `be.conf` file, you need to configure the following key parameters: + - deploy_mode + - Description: Specifies the startup mode of doris + - Format: cloud indicates separation of storage and computing mode, others indicate integration of storage and computing mode + - Example: cloud + - file_cache_path + - Description: Disk path and other parameters used for file caching, represented in array form, each disk is an item. path specifies the disk path, total_size limits the cache size; -1 or 0 will use the entire disk space. + - Format: [{"path":"/path/to/file_cache", "total_size":21474836480}, {"path":"/path/to/file_cache2", "total_size":21474836480}] + - Example: [{"path":"/path/to/file_cache", "total_size":21474836480}, {"path":"/path/to/file_cache2", "total_size":21474836480}] - Default: [{"path":"${DORIS_HOME}/file_cache"}] -2. Start the BE process +3. Start the BE process Use the following command to start the Backend: @@ -264,7 +272,7 @@ To add Backend nodes to the cluster, perform the following steps for each Backen bin/start_be.sh --daemon ``` -3. Add BE to the cluster: +4. Add BE to the cluster: Connect to any Frontend using MySQL client and execute: @@ -278,7 +286,7 @@ To add Backend nodes to the cluster, perform the following steps for each Backen For more detailed usage, refer to [ADD BACKEND](../../sql-manual/sql-statements/Cluster-Management-Statements/ALTER-SYSTEM-ADD-BACKEND) and [REMOVE BACKEND](../../sql-manual/sql-statements/Cluster-Management-Statements/ALTER-SYSTEM-DROP-BACKEND). -4. Verify BE status +5. Verify BE status Check the Backend log files (`be.log`) to ensure it has successfully started and joined the cluster. diff --git a/versioned_docs/version-3.0/table-design/data-model/tips.md b/versioned_docs/version-3.0/table-design/data-model/tips.md index b61542f13ab..fac0d2eb2b4 100644 --- a/versioned_docs/version-3.0/table-design/data-model/tips.md +++ b/versioned_docs/version-3.0/table-design/data-model/tips.md @@ -145,7 +145,7 @@ AGGREGATE KEY columns. Otherwise, `select sum (count) from table;` can only expr Another method is to add a `cound` column of value 1 but aggregation type of REPLACE. Then `select sum (count) from table;` and `select count (*) from table;` could produce the same results. Moreover, this method does not require the absence of same AGGREGATE KEY columns in the import data. -### Merge on write of unique model +### MoW Unique Key Model The Merge on Write implementation in the Unique Model does not impose the same limitation as the Aggregate Model. In Merge on Write, the model adds a `delete bitmap` for each imported rowset to mark the data being overwritten or deleted. With the previous example, after Batch 1 is imported, the data status will be as follows: diff --git a/versioned_docs/version-3.0/table-design/index/ngram-bloomfilter-index.md b/versioned_docs/version-3.0/table-design/index/ngram-bloomfilter-index.md index be1a3a098c1..278054b30ef 100644 --- a/versioned_docs/version-3.0/table-design/index/ngram-bloomfilter-index.md +++ b/versioned_docs/version-3.0/table-design/index/ngram-bloomfilter-index.md @@ -27,7 +27,7 @@ under the License. ## Indexing Principles -The NGram BloomFilter index, similar to the BloomFilter index, is a skip index based on BloomFilter. +n-gram tokenization is a method of splitting a sentence or a piece of text into multiple adjacent word groups. The NGram BloomFilter index, similar to the BloomFilter index, is a skip index based on BloomFilter. Unlike the BloomFilter index, the NGram BloomFilter index is used to accelerate text LIKE queries. Instead of storing the original text values, it tokenizes the text using NGram and stores each token in the BloomFilter. For LIKE queries, the pattern in LIKE '%pattern%' is also tokenized using NGram. Each token is checked against the BloomFilter, and if any token is not found, the corresponding data block does not meet the LIKE condition and can be skipped, reducing IO and accelerating th [...] @@ -58,7 +58,7 @@ Explanation of the syntax: 1. **`idx_column_name(column_name)`** is mandatory. `column_name` is the column to be indexed and must appear in the column definitions above. `idx_column_name` is the index name, which must be unique at the table level. It is recommended to name it with a prefix `idx_` followed by the column name. 2. **`USING NGRAM_BF`** is mandatory and specifies that the index type is an NGram BloomFilter index. 3. **`PROPERTIES`** is optional and is used to specify additional properties for the NGram BloomFilter index. The supported properties are: - - **gram_size**: The N in NGram, specifying the number of consecutive characters to form a token. For example, 'an ngram example' with N = 3 would be tokenized into 'an ', 'n n', ' ng', 'ngr', 'gra', 'ram' (6 tokens). + - **gram_size**: The N in NGram, specifying the number of consecutive characters to form a token. For example, 'This is a simple ngram example' with N = 3 would be tokenized into 'This is a', 'is a simple', 'a simple ngram', 'simple ngram example' (4 tokens). - **bf_size**: The size of the BloomFilter in bits. bf_size determines the size of the index corresponding to each data block. The larger this value, the more storage space it occupies, but the lower the probability of hash collisions. It is recommended to set **gram_size** to the minimum length of the string in LIKE queries but not less than 2. Generally, "gram_size"="3", "bf_size"="1024" is recommended, then adjust based on the Query Profile. diff --git a/versioned_docs/version-3.0/table-design/row-store.md b/versioned_docs/version-3.0/table-design/row-store.md index acefd596ec6..171f618cad2 100644 --- a/versioned_docs/version-3.0/table-design/row-store.md +++ b/versioned_docs/version-3.0/table-design/row-store.md @@ -1,6 +1,6 @@ --- { - "title": "Hybrid Storage", + "title": "Hybrid Row-Columnar Storage", "language": "en" } --- @@ -24,11 +24,11 @@ specific language governing permissions and limitations under the License. --> -## Hybrid Storage +## Hybrid Row-Columnar Storage -Doris defaults to columnar storage, where each column is stored contiguously. Columnar storage offers excellent performance for analytical scenarios (such as aggregation, filtering, sorting, etc.), as it only reads the necessary columns, reducing unnecessary IO. However, in point query scenarios (such as `SELECT *`), all columns need to be read, requiring an IO operation for each column, which can lead to IOPS becoming a bottleneck, especially for wide tables with many columns (e.g., hun [...] +Doris uses columnar storage by default, with each column stored contiguously. Columnar storage offers excellent performance for analytical scenarios (such as aggregation, filtering, sorting, etc.), as it only reads the necessary columns, reducing unnecessary IO. However, in point query scenarios (such as `SELECT *`), all columns need to be read, requiring an IO operation for each column, which can lead to IOPS becoming a bottleneck, especially for wide tables with many columns (e.g., hun [...] -To address the IOPS bottleneck in point query scenarios, starting from version 2.0.0, Doris supports hybrid storage. When users create tables, they can specify whether to enable row storage. With row storage enabled, each row only requires one IO operation for point queries (such as `SELECT *`), significantly improving performance. +To address the IOPS bottleneck in point query scenarios, starting from version 2.0.0, Doris supports Hybrid Row-Columnar Storage. When users create tables, they can specify whether to enable row storage. With row storage enabled, each row only requires one IO operation for point queries (such as `SELECT *`), significantly improving performance. The principle of row storage is that an additional column is added during storage. This column concatenates all the columns of the corresponding row and stores them using a special binary format. @@ -51,7 +51,8 @@ When creating a table, specify whether to enable row storage, which columns to e "row_store_page_size" = "16384" ``` -The page is the smallest unit of storage read/write operations, and page_size is the size of the row storage page. This means that reading one row also requires generating an IO for a page. The larger the value, the better the compression effect and the lower the storage space usage, but the higher the IO overhead for point queries (since one IO reads at least one page), and vice versa. The smaller the value, the higher the storage space, the better the point query performance. The defau [...] +A page is the smallest unit for storage read and write operations, and `page_size` refers to the size of a row-store page. This means that reading a single row requires generating a page IO. The larger this value is, the better the compression effect and the lower the storage space usage. However, the IO overhead during point queries increases, resulting in lower performance (because each IO operation reads at least one page). Conversely, the smaller the value, the higher the storage spa [...] + ## Example --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org