Re: [PR] [ngram index part 1]Add realtime ngram filtering index and benchmark results. [pinot]

via GitHub Thu, 24 Jul 2025 12:03:41 -0700


ankitsultana commented on PR #16364:
URL: https://github.com/apache/pinot/pull/16364#issuecomment-3114553908


   In #16294, ES and StarRocks are mentioned but they use n-gram in a very 
different way. ES/Lucene use n-grams for tokenization and they still rely on a 
FST (afaik). StarRocks stores the n-grams into a bloom filter and doesn't 
explicitly track each of them separately based on their documentation: 
https://docs.starrocks.io/docs/table_design/indexes/Ngram_Bloom_Filter_Index/
   
   I don't see much details in #16294. Just some example questions that need 
more color:
   
   * Will this work for both raw encoded columns and dict-encoded columns?
   * How are we going to handle case sensitivity? Are we going to allow case 
insensitive matching too?
   * What's the UDF for queries going to look like. I suppose something like 
`ngram_substring_search(colName, 'xyzabc')`?
   * What are some concrete numbers around n-gram size and nature of data. e.g. 
would this scale for brand names from say a Grocery catalog? Outside of metric 
tag values, what are some other use-cases where we can use this index?
   * If a user has a Regex based use-case, will we need to ask them to add 
n_gram predicates in their query? Russ Cox's approach is to automatically 
generate n-grams from a user specified Regex pattern.
   * Should we consider taking inspiration from StarRocks and combine n-grams 
with bloom filters? e.g. one idea could be that within a segment we could store 
a configurable number of bloom-filters built on n-grams, and during search we 
can try to match all n-grams of the input string against each of the bloom 
filters. This could allow us to prune parts of a segment and can also ensure 
that the size of the index remains manageable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Re: [PR] [ngram index part 1]Add realtime ngram filtering index and benchmark results. [pinot]

Reply via email to