ZhangYu0123 opened a new pull request, #19021:
URL: https://github.com/apache/doris/pull/19021

   # Proposed changes
   **Support token_bf index for token search:** 
   
   1.  Token_bf index is mainly used to optimise English text searching 
accurately.  It can split sentences by non-numeric and non-characters and 
construct bloom filter. When searching by like、not like、startsWith、in、not 
in、endswith, it can accelerate searching time.  This pr is only support like.
   2. vs ngram_bf index, In English text
      (1) Token_bf index has 100% up. 
      (2) It doesn't need to provide ngram_size parameter.
   3. vs inverted index
       case sensitive
   4. Limitation
    In like '%xxx%' sql,  token_bf index will not be used. Because the bloom 
filter records the whole token and it can't process part of it.  We can use 
like '% xxx %'  or hastoken(xxx)  function to process.
   
    **Test:**
   2kw data,  BUCKETS 1
   ```
          CREATE TABLE IF NOT EXISTS hits_url4 (
                   UserID int,
                   url text DEFAULT '',
                   url_ngram3 text DEFAULT '',
                   url_ngram6 text DEFAULT '',
                   url_token text DEFAULT '',
                   url_inverted text DEFAULT '',
                   INDEX idx_ngrambf (`url_ngram3`) USING NGRAM_BF 
PROPERTIES("gram_size"="3", "bf_size"="1024") COMMENT 'url_ngram ngram_bf 
index',
                   INDEX idx_ngrambf2 (`url_ngram6`) USING NGRAM_BF 
PROPERTIES("gram_size"="6", "bf_size"="1024") COMMENT 'url_ngram ngram_bf 
index',
                  INDEX url_token (`url_token`) USING TOKEN_BF 
PROPERTIES("bf_size"="1024") COMMENT 'url_token_bf index', 
                  INDEX idx_inverted (`url_inverted`) USING INVERTED 
PROPERTIES("parser"="english") COMMENT 'url_inverted index'
               )
               DUPLICATE  KEY(UserID)
               DISTRIBUTED BY HASH(UserID) BUCKETS 1
               PROPERTIES("replication_num" = "1")
   ```
   
   | index type | speed | up |
   |--------|--------|--------|
   | none | 0.76s <img width="618" alt="image" 
src="https://user-images.githubusercontent.com/67053339/233016348-dca7b81d-1ff8-4fb2-811a-02c09d7f8ce3.png";>
 | - | 
   | ngram_bf gram=6 | 0.56s <img width="656" alt="image" 
src="https://user-images.githubusercontent.com/67053339/233034418-9b304548-b1c4-429d-8321-ef8c56fdc8f1.png";>
 | 36% | 
   | ngram_bf gram=3 | 0.17s <img width="666" alt="image" 
src="https://user-images.githubusercontent.com/67053339/233015812-a425c8b5-cfd2-48b1-9f32-0cbe0bc34409.png";>
 | 347% | 
   | token_bf | 0.08s <img width="667" alt="image" 
src="https://user-images.githubusercontent.com/67053339/233014026-8f969ecf-b2ba-4c8f-9c7e-381a434a5bc6.png";>
 | 850% | 
   
   
   Issue Number: close #xxx
   
   ## Problem summary
   
   Describe your changes.
   
   ## Checklist(Required)
   
   * [ ] Does it affect the original behavior
   * [ ] Has unit tests been added
   * [ ] Has document been added or modified
   * [ ] Does it need to update dependencies
   * [ ] Is this PR support rollback (If NO, please explain WHY)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at 
[d...@doris.apache.org](mailto:d...@doris.apache.org) by explaining why you 
chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to