chenboat commented on PR #16364: URL: https://github.com/apache/pinot/pull/16364#issuecomment-3114365597
> @chenboat @deemoliu : is there a design doc for this? There's quite a lot that can be done with n-grams so it might be best to dive deeper here. (e.g. we can use n-grams to optimize regex queries based on Russ Cox's approach). I put the first version of mini-design doc and background study in the github issue https://github.com/apache/pinot/issues/16294. I think we can have further discussion over there because that is the umbrella issue. Regarding Russ Cox's approach, are your referring to [Regular Expression Matching with a Trigram Index](Regular Expression Matching with a Trigram Index). I think his approach is the same as ours and even the parameter for ngram length of 3 (they call it trigram) is the same we tested in production and recommend the comments above. As the author summarize: `Regular expression matching over large numbers of small documents can be made fast by using an index of the trigrams present in each document. ... We can run this query against the trigram index to identify a set of candidate documents and then run the full regular expression search against only those documents.` > > Also regarding index size explosion, we can also discuss how we will prevent this in production and what kind of knobs/configs we will support. e.g. Our primary recommendation is use short ngram length (maxLength 3 recommended) to control the ngram index size. In addition, we might add string column maxlength control to make sure we do not index long strings. With this, the n-gram index size is really reasonable when we test in production. This is also consistent with Russ Cox's report as he reported `The index file contains a list of paths (for reindexing, as described above), a list of indexed files, and then the same posting lists we saw at the beginning of this article, one for each trigram. In practice, this index tends to be around 20% of the size of the files being indexed.` > > * Would we add limits on max number of unique grams that we will track? Would we allow users to configure in the table-config specific gram combinations that they want to force to be tracked, etc.? > * Should we consider adopting a sparse gram approach like the one used by GitHub in their code-search implementation? Given the ngram length limit, Ngram filtering index needs to index all ngrams of that length to achieve filtering. So we can not put a limit. But again, with reasonable limit of ngram maxlength (3 recommended or 4 in rare case), we will not get into the scenario with too many unique gram in practice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org