compasses opened a new issue, #10733:
URL: https://github.com/apache/doris/issues/10733

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/incubator-doris/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### Description
   
   To speed up like queries we have pushed the like function to storage layer 
in PR #10355 , which can get 2x~3x performance gain, no matter vectorized or 
not. But we want to go the extra mile, and make it more faster and less 
resource overhead. Base on that, we are  going to implement a new index for 
like queries.
   
   We have researched several solutions such as pg_trgm from postgresql、ngrambf 
from clickhouse and FST from elasticsearch.  Since Doris have bloom filter 
index  already, in consideration of complexity、function scope and 
compatibility. Finally, we will choose the way as clickhouse did 
```ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, 
random_seed)```: the input column string is split into n-grams (first parameter 
– n-gram size), and then stored in a bloom filter. During query, the like 
pattern will also be split to n-grams and generate a bloom filter to do the 
filter, use the bloom filter to skip granule.
   
   For doris here is the details:
   1. Reuse the exist bloom filter index read/write process, and the storage 
layer will be unaffected.
   2. Add a new kind of bloom filter index, example : 
"ngram_bloom_filter_columns" = "(col1,n,512), (col2,n,512)",n-gram size,
   512-bloom filter size in bytes,n and 512 all can be configured,and both have 
default value like (3,512).
   3. Add new type of algorithm: NGRAM_BLOOM_FILTER, which will extract gram 
and calculate the bloom filter.
   4. For the new algorithm the HashStrategy will follow the clickhouse
   5. Query will support index filter pages for like queries , if exist the 
ngram bloom filter, which base the #10355 
   6. Support add index for history data:ALTER TABLE <db.table_name> SET 
("ngram_bloom_filter_columns" = "(col1,n,512), (col2,n,512)").
   
   
![image](https://user-images.githubusercontent.com/10161171/178133582-e9266441-88b1-49ba-9ac2-241b460db404.png)
   
   That's all, thanks.
   
   
   
   ### Use case
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to