atris commented on issue #7395:
URL: https://github.com/apache/pinot/issues/7395#issuecomment-918968144


   Thanks for reviewing the document, @siddharthteotia !
   
   Here are my thoughts:
   
   Current text search infrastructure: Status quo, we simply build side car 
Lucene indices and expose a UDF which allows users to specify Lucene queries. 
IMO, this is a component that should ideally be outside of Pinot since it has 
no correlation with Pinot itself.
   
   So, an eventual goal is to move text search to native Pinot indices and 
dictionary, and follow the SQL Standard (LIKE operator) syntax as much as 
possible.
   
   Now, coming to the FST itself. There are three reasons as to why a native 
FST makes sense:
   
   1. Flexibility and Control -- Lucene is a full fledged search library. It is 
built for generic text search use cases and consists of capabilities which 
allow ranked retrieval, norm storage and impact filtering,  to name a few 
capabilities. None of these are of relevance to us since we do not perform 
ranking. As I mentioned before, if we are building our text search capabilities 
on top of Pinot data structures, then pulling in Lucene just for the FST is an 
overkill, and also stops us from any potential changes that we may wish to do. 
Lucene's FST is a generic engine, not optimized for our use cases (only 
dictionary IDs as output symbols, primary query load being prefix and suffix 
matches from LIKE operator). Other improvements may or may not come in later, 
but if we do not move to our native implementation, we remove the possibility 
of any such improvements.
   
   2. Ability to perform Pinot specific optimizations -- As stated in the above 
point, it is not possible for us to do specific changes/enhancements. For e.g., 
it should be possible to short circuit the evaluation of regular expressions 
ending with match-all and having a short prefix before the same, thus 
accelerating a common use case of LIKE operator.
   
   3. Realtime Capabilities -- Lucene builds FST during segment flush, thus 
forcing us to flush frequently. Also, this inhibits us from doing real time 
text search, which is a limitation. With a native FST implementation,  we 
should be able to explore this path.
   
   Regarding TEXT_MATCH, while it is my dearest wish to deprecate the module, I 
understand that some users may wish to use it. As highlighted, both indices can 
co exist, with no mandate to migrate to one over the other.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to