Re: [PR] [ngram index part 1]Add realtime ngram filtering index and benchmark results. [pinot]

via GitHub Thu, 24 Jul 2025 11:03:43 -0700


chenboat commented on PR #16364:
URL: https://github.com/apache/pinot/pull/16364#issuecomment-3114365597


   > @chenboat @deemoliu : is there a design doc for this? There's quite a lot 
that can be done with n-grams so it might be best to dive deeper here. (e.g. we 
can use n-grams to optimize regex queries based on Russ Cox's approach).
   
   I put the first version of mini-design doc and background study in the 
github issue https://github.com/apache/pinot/issues/16294. I think we can have 
further discussion over there because that is the umbrella issue. Regarding 
Russ Cox's approach, are your referring to [Regular Expression Matching with a 
Trigram Index](Regular Expression Matching with a Trigram Index). I think his 
approach is the same as ours and even the parameter for ngram length of 3 (they 
call it trigram) is the same we tested in production and recommend the comments 
above. As the author summarize:
   `Regular expression matching over large numbers of small documents can be 
made fast by using an index of the trigrams present in each document. ...  We 
can run this query against the trigram index to identify a set of candidate 
documents and then run the full regular expression search against only those 
documents.` 
   
   > 
   > Also regarding index size explosion, we can also discuss how we will 
prevent this in production and what kind of knobs/configs we will support. e.g.
   
   Our primary recommendation is use short ngram length (maxLength 3 
recommended) to control the ngram index size. In addition, we might add string 
column maxlength control to make sure we do not index long strings. With this, 
the n-gram index size is really reasonable when we test in production. This is 
also consistent with Russ Cox's report as he reported 
   
   `The index file contains a list of paths (for reindexing, as described 
above), a list of indexed files, and then the same posting lists we saw at the 
beginning of this article, one for each trigram. In practice, this index tends 
to be around 20% of the size of the files being indexed.`
   
   
   > 
   > * Would we add limits on max number of unique grams that we will track? 
Would we allow users to configure in the table-config specific gram 
combinations that they want to force to be tracked, etc.?
   > * Should we consider adopting a sparse gram approach like the one used by 
GitHub in their code-search implementation?
   Given the ngram length limit, Ngram filtering index needs to index all 
ngrams of that length to achieve filtering. So we can not put a limit. But 
again, with reasonable limit of ngram maxlength (3 recommended or 4 in rare 
case), we will not get into the scenario with too many unique gram in practice.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Re: [PR] [ngram index part 1]Add realtime ngram filtering index and benchmark results. [pinot]

Reply via email to