chenboat commented on PR #16364: URL: https://github.com/apache/pinot/pull/16364#issuecomment-3119985002
> In #16294, ES and StarRocks are mentioned but they use n-gram in a very different way. ES/Lucene use n-grams for tokenization and they still rely on a FST (afaik). StarRocks stores the n-grams into a bloom filter and doesn't explicitly track each of them separately based on their documentation: https://docs.starrocks.io/docs/table_design/indexes/Ngram_Bloom_Filter_Index/ The issue cited two peer libs uses of n-gram index and advocate Pinot to adopt ngram index too. The goal is not to follow their implementations. > > I don't see much details in #16294. Just some example questions that need more color: > > * Will this work for both raw encoded columns and dict-encoded columns? Ngram filtering in our proposed work on string column. Since the implementation uses Pinot's inverted index internally, there is dictionary for ngrams. The raw string will be retained for final validation. It can be raw encoded or dict-encoded. > * How are we going to handle case sensitivity? Are we going to allow case insensitive matching too? We aim for strict literal string match so it is case sensitive. > * What's the UDF for queries going to look like. I suppose something like `ngram_substring_search(colName, 'xyzabc')`? We will support SQL's LIKE() function as mentioned in the issue first. So there is no need to introduce a new UDF. It can be extended to support regex_like() in Pinot later. > * What are some concrete numbers around n-gram size and nature of data. e.g. would this scale for brand names from say a Grocery catalog? Outside of metric tag values, what are some other use-cases where we can use this index? Our internal test can work on multiple terabytes of data with recommended ngram length of 3 to 4 support 4-5k qps. Another example is that google [code search use trigram](https://swtch.com/~rsc/regexp/regexp4.html) to code search which reported similar founding to ours -- their report is about 20% of index vs raw data ratio. The above paper also gives a good summary of applicability: i.e., _ngram filtering is good for indexing large number of short documents_. Short can be interpreted in the range of ~100bytes in our test case. > * If a user has a Regex based use-case, will we need to ask them to add n_gram predicates in their query? Russ Cox's approach is to automatically generate n-grams from a user specified Regex pattern. Our approach is the same as Cox's approach. Given a LIKE('%pinot%') function, we can parse it can generate the ngrams to perform the filtering. > * Should we consider taking inspiration from StarRocks and combine n-grams with bloom filters? e.g. one idea could be that within a segment we could store a configurable number of bloom-filters built on n-grams, and during search we can try to match all n-grams of the input string against each of the bloom filters. This could allow us to prune parts of a segment and can also ensure that the size of the index remains manageable. That can be a further improvement and test needs to be done -- ngram filtering in segment level may or may not be effective due to distribution of ngram in large number of strings in aggregate. Our version of n-gram filtering work more like a Pinot inverted index -- so we take Russ Cox route you can say (thanks @ankitsultana for the reference :)). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org