gortiz commented on PR #8818: URL: https://github.com/apache/pinot/pull/8818#issuecomment-1152506922
My last commit includes a JMH benchmark. I found several interesting things. Please take a look, some of my conclusions are so shocking that it wouldn't surprise me if another pair of eyes discovered that I did something wrong. TL;DR: I think we should apply this optimization only when executed in columns without indexes. This may make the implementation more difficult, but the performance gain may be very large. Benchmarks are: - optimal1Like: `SELECT INT_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'domain0'` - optimal1Regex: `SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, 'domain0')` - optimal10: `SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, 'domain\d')` - increasing10Like: `SELECT INT_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain0%' or DOMAIN_NAMES LIKE '%domain1%' or ... or DOMAIN_NAMES LIKE '%domain9%'` - increasing10Regex: `SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES LIKE, '^.domain0.*$') or ... or regexp_like(DOMAIN_NAMES LIKE, '^.domain9.*$')` - increasing10Fusing: `SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')` - decreasing9Like, decreasing9Regex and decreasing9Fusing: Are like `increasing10Like`, `increasing10Regex` and `increasing10Fusing` but going from 9 to 1 - Note: they have in fact 9 elements. This a mistake. Initially queries where created dynamically and I did it wrong. But it was a lucky mistake that shows that Lucene fails when there are 10 fused expression but it doesn't fail when there are 9. Last results I´ve got were: ``` Benchmark (_fstType) Mode Cnt Score Error Units BenchmarkFuseRegexp.decreasing9Fusing LUCENE avgt 5 24.198 ± 13.439 ms/op BenchmarkFuseRegexp.decreasing9Fusing NATIVE avgt 5 0.263 ± 0.019 ms/op BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 0.476 ± 0.040 ms/op BenchmarkFuseRegexp.decreasing9Like LUCENE avgt 5 0.615 ± 0.056 ms/op BenchmarkFuseRegexp.decreasing9Like NATIVE avgt 5 0.515 ± 0.111 ms/op BenchmarkFuseRegexp.decreasing9Like null avgt 5 3596.864 ± 1335.035 ms/op BenchmarkFuseRegexp.decreasing9Regex LUCENE avgt 5 0.593 ± 0.362 ms/op BenchmarkFuseRegexp.decreasing9Regex NATIVE avgt 5 0.422 ± 0.222 ms/op BenchmarkFuseRegexp.decreasing9Regex null avgt 5 1053.673 ± 379.729 ms/op BenchmarkFuseRegexp.increasing10Fusing LUCENE avgt 5 ERROR BenchmarkFuseRegexp.increasing10Fusing NATIVE avgt 5 0.326 ± 0.138 ms/op BenchmarkFuseRegexp.increasing10Fusing null avgt 5 0.226 ± 0.062 ms/op BenchmarkFuseRegexp.increasing10Like LUCENE avgt 5 0.696 ± 0.121 ms/op BenchmarkFuseRegexp.increasing10Like NATIVE avgt 5 0.540 ± 0.049 ms/op BenchmarkFuseRegexp.increasing10Like null avgt 5 3543.339 ± 183.168 ms/op BenchmarkFuseRegexp.increasing10Regex LUCENE avgt 5 0.590 ± 0.208 ms/op BenchmarkFuseRegexp.increasing10Regex NATIVE avgt 5 0.435 ± 0.159 ms/op BenchmarkFuseRegexp.increasing10Regex null avgt 5 1096.238 ± 289.928 ms/op BenchmarkFuseRegexp.optimal10 LUCENE avgt 5 0.169 ± 0.066 ms/op BenchmarkFuseRegexp.optimal10 NATIVE avgt 5 0.136 ± 0.011 ms/op BenchmarkFuseRegexp.optimal10 null avgt 5 0.149 ± 0.082 ms/op BenchmarkFuseRegexp.optimal1Like LUCENE avgt 5 0.179 ± 0.066 ms/op BenchmarkFuseRegexp.optimal1Like NATIVE avgt 5 0.134 ± 0.029 ms/op BenchmarkFuseRegexp.optimal1Like null avgt 5 470.387 ± 187.235 ms/op BenchmarkFuseRegexp.optimal1Regex LUCENE avgt 5 0.172 ± 0.029 ms/op BenchmarkFuseRegexp.optimal1Regex NATIVE avgt 5 0.141 ± 0.027 ms/op BenchmarkFuseRegexp.optimal1Regex null avgt 5 243.160 ± 59.722 ms/op ``` After analyzing it, there are some points that are quite clear: ## We cannot use this optimization with Lucene The cost is too high when the expression gets complex. It even with ` java.lang.RuntimeException: org.apache.lucene.util.automaton.TooComplexToDeterminizeException: Determinizing .*(?:^.*domain0.*$)|(?:^.*domain1.*$)|(?:^.*domain2.*$)|(?:^.*domain3.*$)|(?:^.*domain4.*$)|(?:^.*domain5.*$)|(?:^.*domain6.*$)|(?:^.*domain7.*$)|(?:^.*domain8.*$)|(?:^.*domain9.*$).* would result in more than 10000 states.` when the expression have 10 branches. ## In general `COL like 'whatever'` expression is 2x-3x worse than `regexp_like(COL, 'similar expression')` Reading the code it looks to me that Pinot automatically translate one to the other, but it that transformation is not applied when `BaseQueriesTest.getBrokerResponse` is called (which is great, because we can test the difference). I've also tried the controller UI interface and this issue is repeated there. Using the github_events dataset, ``` select * from githubEvents where regexp_like(actor_url, '^.*victor\d.*$') limit 1 ``` is quite faster than ``` select * from githubEvents where actor_url LIKE '%victor7%' limit 1 ``` ## Pinot regex execute faster wildcards than a single values It is quite clear in the optimal cases: ``` -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')` BenchmarkFuseRegexp.optimal10 null avgt 5 0.149 ± 0.082 ms/op -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain0')` BenchmarkFuseRegexp.optimal1Regex null avgt 5 243.160 ± 59.722 ms/op ``` At the beginning I thought it was an error in my code, but it is easy to try that using the github_events dataset. ``` select * from githubEvents where regexp_like(actor_url, 'victor\d') limit 1 ``` Responds quite faster than ``` select * from githubEvents where regexp_like(actor_url, 'victor7') limit 1 ``` The difference is not the 2.000x that the benchmark shows, but is consistently between 2-3x. I'm guessing that the issue actually comes from Java regex engine, but it is just a hunch, I didn't actually test it. ## Native or Lucene indexes don't actually increase the performance of this regexp This benchmark wasn't focused on that and more studies should be done, but it seems clear that: ### When there are several regexp in the actual expression, indexes are very useful. Examples: ``` BenchmarkFuseRegexp.decreasing9Regex LUCENE avgt 5 0.593 ± 0.362 ms/op BenchmarkFuseRegexp.decreasing9Regex NATIVE avgt 5 0.422 ± 0.222 ms/op BenchmarkFuseRegexp.decreasing9Regex null avgt 5 1053.673 ± 379.729 ms/op BenchmarkFuseRegexp.increasing10Regex LUCENE avgt 5 0.590 ± 0.208 ms/op BenchmarkFuseRegexp.increasing10Regex NATIVE avgt 5 0.435 ± 0.159 ms/op BenchmarkFuseRegexp.increasing10Regex null avgt 5 1096.238 ± 289.928 ms/op ``` - When there is only one regexp in the actual expressions, indexes are not very useful. For example: ``` -- this actually has a single, quite complex regex expression BenchmarkFuseRegexp.decreasing9Fusing LUCENE avgt 5 24.198 ± 13.439 ms/op BenchmarkFuseRegexp.decreasing9Fusing NATIVE avgt 5 0.263 ± 0.019 ms/op BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 0.476 ± 0.040 ms/op -- this actually has a single, quite complex regex expression BenchmarkFuseRegexp.increasing10Fusing LUCENE avgt 5 ERROR BenchmarkFuseRegexp.increasing10Fusing NATIVE avgt 5 0.326 ± 0.138 ms/op BenchmarkFuseRegexp.increasing10Fusing null avgt 5 0.226 ± 0.062 ms/op -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')` BenchmarkFuseRegexp.optimal10 LUCENE avgt 5 0.169 ± 0.066 ms/op BenchmarkFuseRegexp.optimal10 NATIVE avgt 5 0.136 ± 0.011 ms/op BenchmarkFuseRegexp.optimal10 null avgt 5 0.149 ± 0.082 ms/op -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain0')` BenchmarkFuseRegexp.optimal1Regex LUCENE avgt 5 0.172 ± 0.029 ms/op BenchmarkFuseRegexp.optimal1Regex NATIVE avgt 5 0.141 ± 0.027 ms/op BenchmarkFuseRegexp.optimal1Regex null avgt 5 243.160 ± 59.722 ms/op ``` The last one shows a big difference, but it may be related to the issue with regex without wildcards that I discussed above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org