ZhangYu0123 opened a new pull request, #19021: URL: https://github.com/apache/doris/pull/19021
# Proposed changes **Support token_bf index for token search:** 1. Token_bf index is mainly used to optimise English text searching accurately. It can split sentences by non-numeric and non-characters and construct bloom filter. When searching by like、not like、startsWith、in、not in、endswith, it can accelerate searching time. This pr is only support like. 2. vs ngram_bf index, In English text (1) Token_bf index has 100% up. (2) It doesn't need to provide ngram_size parameter. 3. vs inverted index case sensitive 4. Limitation In like '%xxx%' sql, token_bf index will not be used. Because the bloom filter records the whole token and it can't process part of it. We can use like '% xxx %' or hastoken(xxx) function to process. **Test:** 2kw data, BUCKETS 1 ``` CREATE TABLE IF NOT EXISTS hits_url4 ( UserID int, url text DEFAULT '', url_ngram3 text DEFAULT '', url_ngram6 text DEFAULT '', url_token text DEFAULT '', url_inverted text DEFAULT '', INDEX idx_ngrambf (`url_ngram3`) USING NGRAM_BF PROPERTIES("gram_size"="3", "bf_size"="1024") COMMENT 'url_ngram ngram_bf index', INDEX idx_ngrambf2 (`url_ngram6`) USING NGRAM_BF PROPERTIES("gram_size"="6", "bf_size"="1024") COMMENT 'url_ngram ngram_bf index', INDEX url_token (`url_token`) USING TOKEN_BF PROPERTIES("bf_size"="1024") COMMENT 'url_token_bf index', INDEX idx_inverted (`url_inverted`) USING INVERTED PROPERTIES("parser"="english") COMMENT 'url_inverted index' ) DUPLICATE KEY(UserID) DISTRIBUTED BY HASH(UserID) BUCKETS 1 PROPERTIES("replication_num" = "1") ``` | index type | speed | up | |--------|--------|--------| | none | 0.76s <img width="618" alt="image" src="https://user-images.githubusercontent.com/67053339/233016348-dca7b81d-1ff8-4fb2-811a-02c09d7f8ce3.png"> | - | | ngram_bf gram=6 | 0.56s <img width="656" alt="image" src="https://user-images.githubusercontent.com/67053339/233034418-9b304548-b1c4-429d-8321-ef8c56fdc8f1.png"> | 36% | | ngram_bf gram=3 | 0.17s <img width="666" alt="image" src="https://user-images.githubusercontent.com/67053339/233015812-a425c8b5-cfd2-48b1-9f32-0cbe0bc34409.png"> | 347% | | token_bf | 0.08s <img width="667" alt="image" src="https://user-images.githubusercontent.com/67053339/233014026-8f969ecf-b2ba-4c8f-9c7e-381a434a5bc6.png"> | 850% | Issue Number: close #xxx ## Problem summary Describe your changes. ## Checklist(Required) * [ ] Does it affect the original behavior * [ ] Has unit tests been added * [ ] Has document been added or modified * [ ] Does it need to update dependencies * [ ] Is this PR support rollback (If NO, please explain WHY) ## Further comments If this is a relatively large or complex change, kick off the discussion at [d...@doris.apache.org](mailto:d...@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org