chenboat commented on PR #16364:
URL: https://github.com/apache/pinot/pull/16364#issuecomment-3119985002

   > In #16294, ES and StarRocks are mentioned but they use n-gram in a very 
different way. ES/Lucene use n-grams for tokenization and they still rely on a 
FST (afaik). StarRocks stores the n-grams into a bloom filter and doesn't 
explicitly track each of them separately based on their documentation: 
https://docs.starrocks.io/docs/table_design/indexes/Ngram_Bloom_Filter_Index/
   
   The issue cited two peer libs uses of n-gram index and advocate Pinot to 
adopt ngram index too. The goal is not to follow their implementations. 
   
   > 
   > I don't see much details in #16294. Just some example questions that need 
more color:
   > 
   > * Will this work for both raw encoded columns and dict-encoded columns?
   Ngram filtering in our proposed work on string column. Since the 
implementation uses Pinot's inverted index internally, there is dictionary for 
ngrams. The raw string will be retained for final validation. It can be raw 
encoded or dict-encoded.
   
   > * How are we going to handle case sensitivity? Are we going to allow case 
insensitive matching too?
   We aim for strict literal string match so it is case sensitive.
   
   > * What's the UDF for queries going to look like. I suppose something like 
`ngram_substring_search(colName, 'xyzabc')`?
   
   We will support SQL's LIKE() function as mentioned in the issue first. So 
there is no need to introduce a new UDF. It can be extended to support 
regex_like() in Pinot later.
   
   > * What are some concrete numbers around n-gram size and nature of data. 
e.g. would this scale for brand names from say a Grocery catalog? Outside of 
metric tag values, what are some other use-cases where we can use this index?
   
   Our internal test can work on multiple terabytes of data with recommended 
ngram length of 3 to 4 support 4-5k qps. Another example is that google [code 
search use trigram](https://swtch.com/~rsc/regexp/regexp4.html) to code search 
which reported similar founding to ours -- their report is about 20% of index 
vs raw data ratio. The above paper also gives a good summary of applicability: 
i.e., _ngram filtering is good for indexing large number of short documents_. 
Short can be interpreted in the range of ~100bytes in our test case.
   
   > * If a user has a Regex based use-case, will we need to ask them to add 
n_gram predicates in their query? Russ Cox's approach is to automatically 
generate n-grams from a user specified Regex pattern.
   
   Our approach is the same as Cox's approach. Given a LIKE('%pinot%') 
function, we can parse it can generate the ngrams to perform the filtering. 
   
   > * Should we consider taking inspiration from StarRocks and combine n-grams 
with bloom filters? e.g. one idea could be that within a segment we could store 
a configurable number of bloom-filters built on n-grams, and during search we 
can try to match all n-grams of the input string against each of the bloom 
filters. This could allow us to prune parts of a segment and can also ensure 
that the size of the index remains manageable.
   
   That can be a further improvement and test needs to be done -- ngram 
filtering in segment level may or may not be effective due to distribution of 
ngram in large number of strings in aggregate. Our version of n-gram filtering 
work more like a Pinot inverted index -- so we take Russ Cox route you can say 
(thanks @ankitsultana for the reference :)).
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to