atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1156103702

   @gortiz There has been a long comment trail on this PR so I am summarising 
my responses in this comment:
   
   1. Performance -- Like any other index, FST indices also have their worst 
case. FST is an automaton, and in an extreme scenario where a high number of 
branches are traversed for a query, the performance of the index will approach 
the brute force approximation.
   
   In this case, the query you tried had an abnormally large number of 
branches, to the extent that Lucene decided to not even try and process the 
query. The average user rarely fires such queries, and in the corner cases 
where they do, they should be ok with a brute force latency, as long as their 
99th percentile queries keep performing significantly better, which is the case 
as is evident from the benchmarks that test the FST index.
   
   2. Language Semantics -- Lucene has, for long, published its regex dialect 
clearly. Engines based on Lucene, such as ElasticSearch and Apache Solr, follow 
the same dialect 
(https://help.oncrawl.com/en/articles/1685982-lucene-regex-cheat-sheet)
   
   It is arguable whether this is correct or not, but we have to keep 
historical perspective in mind, since many Pinot users migrate from one of 
these engines.
   
   Please note that, from native FST index's perspective, it is pretty simple 
to support java regex language. The reason I did not do that was because the 
existing FST index was based on Lucene and thus supported Lucene's regex 
dialect. For back compatibility, native FST index supports the same dialect.
   
   3. Parsing And Reporting Invalid Regexes: I agree that in an ideal scenario, 
we should be parsing regexes and reporting characters that we do not accept. 
However, given the complexity of an average regex, the time complexity of doing 
this will be non trivial.
   
   4. Specific To This PR: I looked into the PR and while the effort is 
appreciative, I do not see significant performance improvement. Also, the API 
surface changes required for this change are high, which makes me a bit uneasy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to