atris commented on issue #7395: URL: https://github.com/apache/pinot/issues/7395#issuecomment-918968144
Thanks for reviewing the document, @siddharthteotia ! Here are my thoughts: Current text search infrastructure: Status quo, we simply build side car Lucene indices and expose a UDF which allows users to specify Lucene queries. IMO, this is a component that should ideally be outside of Pinot since it has no correlation with Pinot itself. So, an eventual goal is to move text search to native Pinot indices and dictionary, and follow the SQL Standard (LIKE operator) syntax as much as possible. Now, coming to the FST itself. There are three reasons as to why a native FST makes sense: 1. Flexibility and Control -- Lucene is a full fledged search library. It is built for generic text search use cases and consists of capabilities which allow ranked retrieval, norm storage and impact filtering, to name a few capabilities. None of these are of relevance to us since we do not perform ranking. As I mentioned before, if we are building our text search capabilities on top of Pinot data structures, then pulling in Lucene just for the FST is an overkill, and also stops us from any potential changes that we may wish to do. Lucene's FST is a generic engine, not optimized for our use cases (only dictionary IDs as output symbols, primary query load being prefix and suffix matches from LIKE operator). Other improvements may or may not come in later, but if we do not move to our native implementation, we remove the possibility of any such improvements. 2. Ability to perform Pinot specific optimizations -- As stated in the above point, it is not possible for us to do specific changes/enhancements. For e.g., it should be possible to short circuit the evaluation of regular expressions ending with match-all and having a short prefix before the same, thus accelerating a common use case of LIKE operator. 3. Realtime Capabilities -- Lucene builds FST during segment flush, thus forcing us to flush frequently. Also, this inhibits us from doing real time text search, which is a limitation. With a native FST implementation, we should be able to explore this path. Regarding TEXT_MATCH, while it is my dearest wish to deprecate the module, I understand that some users may wish to use it. As highlighted, both indices can co exist, with no mandate to migrate to one over the other. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org