atris commented on PR #8818: URL: https://github.com/apache/pinot/pull/8818#issuecomment-1156103702
@gortiz There has been a long comment trail on this PR so I am summarising my responses in this comment: 1. Performance -- Like any other index, FST indices also have their worst case. FST is an automaton, and in an extreme scenario where a high number of branches are traversed for a query, the performance of the index will approach the brute force approximation. In this case, the query you tried had an abnormally large number of branches, to the extent that Lucene decided to not even try and process the query. The average user rarely fires such queries, and in the corner cases where they do, they should be ok with a brute force latency, as long as their 99th percentile queries keep performing significantly better, which is the case as is evident from the benchmarks that test the FST index. 2. Language Semantics -- Lucene has, for long, published its regex dialect clearly. Engines based on Lucene, such as ElasticSearch and Apache Solr, follow the same dialect (https://help.oncrawl.com/en/articles/1685982-lucene-regex-cheat-sheet) It is arguable whether this is correct or not, but we have to keep historical perspective in mind, since many Pinot users migrate from one of these engines. Please note that, from native FST index's perspective, it is pretty simple to support java regex language. The reason I did not do that was because the existing FST index was based on Lucene and thus supported Lucene's regex dialect. For back compatibility, native FST index supports the same dialect. 3. Parsing And Reporting Invalid Regexes: I agree that in an ideal scenario, we should be parsing regexes and reporting characters that we do not accept. However, given the complexity of an average regex, the time complexity of doing this will be non trivial. 4. Specific To This PR: I looked into the PR and while the effort is appreciative, I do not see significant performance improvement. Also, the API surface changes required for this change are high, which makes me a bit uneasy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org