ankitsultana commented on PR #16364: URL: https://github.com/apache/pinot/pull/16364#issuecomment-3114553908
In #16294, ES and StarRocks are mentioned but they use n-gram in a very different way. ES/Lucene use n-grams for tokenization and they still rely on a FST (afaik). StarRocks stores the n-grams into a bloom filter and doesn't explicitly track each of them separately based on their documentation: https://docs.starrocks.io/docs/table_design/indexes/Ngram_Bloom_Filter_Index/ I don't see much details in #16294. Just some example questions that need more color: * Will this work for both raw encoded columns and dict-encoded columns? * How are we going to handle case sensitivity? Are we going to allow case insensitive matching too? * What's the UDF for queries going to look like. I suppose something like `ngram_substring_search(colName, 'xyzabc')`? * What are some concrete numbers around n-gram size and nature of data. e.g. would this scale for brand names from say a Grocery catalog? Outside of metric tag values, what are some other use-cases where we can use this index? * If a user has a Regex based use-case, will we need to ask them to add n_gram predicates in their query? Russ Cox's approach is to automatically generate n-grams from a user specified Regex pattern. * Should we consider taking inspiration from StarRocks and combine n-grams with bloom filters? e.g. one idea could be that within a segment we could store a configurable number of bloom-filters built on n-grams, and during search we can try to match all n-grams of the input string against each of the bloom filters. This could allow us to prune parts of a segment and can also ensure that the size of the index remains manageable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org