gortiz commented on PR #8818: URL: https://github.com/apache/pinot/pull/8818#issuecomment-1155165270
I was able to invest more time in this topic today and I have to say that my initial benchmark was completely useless. Apart from some typos in the queries (which may have some impact), there were two problems that completely fooled me: - The implicit limit introduced by Pinot. As only the first 10 elements were returned, queries that actually found elements could finish faster. - Something is wrong with Lucene and Native indexes. I guess I'm not configuring them correctly, but they simply return 0 results when the regexp is a little bit complex (like having groups). Therefore I've changed all the tests to return a count and each iteration is actually verifying whether the expected value is returned or not, in which case the benchmark is stopped. As a result, all fusing tests fail whenever there is an index. I hope it is something related to my test (ie I didn't configure the index properly) and not some semantic discrepancy these indexes have in relation how each engine evaluates regexp, as that could mean that there are customers receiving false results. These are the results I've got. Note that some combinations are not shown. For example, decreasing9Fusing and LUCENE or NATIVE. That means that these combinations did not return what it was expected (in all these particular cases it means that they return 0 rows). Note that apart from the fusing case, benchmarks like `optimal10`, which is `regexp_like(DOMAIN_NAMES, 'domain\d')` did also failed. ``` Benchmark (_fstType) Mode Cnt Score Error Units BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 1438.185 ± 73.963 ms/op BenchmarkFuseRegexp.decreasing9Like LUCENE avgt 5 41.927 ± 0.984 ms/op BenchmarkFuseRegexp.decreasing9Like NATIVE avgt 5 40.232 ± 5.597 ms/op BenchmarkFuseRegexp.decreasing9Like null avgt 5 3957.902 ± 154.228 ms/op BenchmarkFuseRegexp.decreasing9Regex LUCENE avgt 5 40.744 ± 2.059 ms/op BenchmarkFuseRegexp.decreasing9Regex NATIVE avgt 5 61.443 ± 1.301 ms/op BenchmarkFuseRegexp.decreasing9Regex null avgt 5 2116.747 ± 602.570 ms/op BenchmarkFuseRegexp.increasing10Fusing null avgt 5 1320.869 ± 45.509 ms/op BenchmarkFuseRegexp.increasing10Like LUCENE avgt 5 42.253 ± 0.695 ms/op BenchmarkFuseRegexp.increasing10Like NATIVE avgt 5 39.049 ± 1.829 ms/op BenchmarkFuseRegexp.increasing10Like null avgt 5 5071.207 ± 3134.887 ms/op BenchmarkFuseRegexp.increasing10Regex LUCENE avgt 5 44.084 ± 2.737 ms/op BenchmarkFuseRegexp.increasing10Regex NATIVE avgt 5 39.650 ± 1.567 ms/op BenchmarkFuseRegexp.increasing10Regex null avgt 5 2318.479 ± 75.632 ms/op BenchmarkFuseRegexp.optimal10 null avgt 5 185.757 ± 3.426 ms/op BenchmarkFuseRegexp.optimal10NotFound LUCENE avgt 5 0.212 ± 0.136 ms/op BenchmarkFuseRegexp.optimal10NotFound NATIVE avgt 5 0.201 ± 0.140 ms/op BenchmarkFuseRegexp.optimal10NotFound null avgt 5 248.525 ± 60.819 ms/op BenchmarkFuseRegexp.optimal1Like LUCENE avgt 5 16.431 ± 3.533 ms/op BenchmarkFuseRegexp.optimal1Like NATIVE avgt 5 16.586 ± 0.828 ms/op BenchmarkFuseRegexp.optimal1Like null avgt 5 468.731 ± 185.623 ms/op BenchmarkFuseRegexp.optimal1LikeNotFound LUCENE avgt 5 0.197 ± 0.062 ms/op BenchmarkFuseRegexp.optimal1LikeNotFound NATIVE avgt 5 0.170 ± 0.072 ms/op BenchmarkFuseRegexp.optimal1LikeNotFound null avgt 5 477.176 ± 129.445 ms/op BenchmarkFuseRegexp.optimal1Regex LUCENE avgt 5 15.485 ± 0.517 ms/op BenchmarkFuseRegexp.optimal1Regex NATIVE avgt 5 16.711 ± 0.857 ms/op BenchmarkFuseRegexp.optimal1Regex null avgt 5 205.676 ± 6.921 ms/op BenchmarkFuseRegexp.optimal1RegexNotFound LUCENE avgt 5 0.212 ± 0.123 ms/op BenchmarkFuseRegexp.optimal1RegexNotFound NATIVE avgt 5 0.186 ± 0.029 ms/op BenchmarkFuseRegexp.optimal1RegexNotFound null avgt 5 239.459 ± 13.846 ms/op BenchmarkFuseRegexp.selective2Fusing null avgt 5 541.530 ± 168.173 ms/op BenchmarkFuseRegexp.selective2Like LUCENE avgt 5 19.961 ± 0.937 ms/op BenchmarkFuseRegexp.selective2Like NATIVE avgt 5 20.387 ± 1.953 ms/op BenchmarkFuseRegexp.selective2Like null avgt 5 916.620 ± 96.702 ms/op BenchmarkFuseRegexp.selective2Regex LUCENE avgt 5 23.366 ± 0.795 ms/op BenchmarkFuseRegexp.selective2Regex NATIVE avgt 5 20.037 ± 0.601 ms/op BenchmarkFuseRegexp.selective2Regex null avgt 5 389.965 ± 7.284 ms/op ``` I've also changed `RegexpPatternConverterUtils` (and its relative test) to do not introduce useless `^.*` at the beginning of the expression nor `.*$` at the end. I've repeated the benchmark with that change and these are the results I've got. As you can see, most benchmarks didn't change that much, but the ones that use like and have no index are twice as fast, so they achieve the same numbers than their equivalent `optimalXRegex`, as expected. ``` Benchmark (_fstType) Mode Cnt Score Error Units BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 1365.403 ± 91.822 ms/op BenchmarkFuseRegexp.decreasing9Like LUCENE avgt 5 40.553 ± 1.394 ms/op BenchmarkFuseRegexp.decreasing9Like NATIVE avgt 5 39.057 ± 0.905 ms/op BenchmarkFuseRegexp.decreasing9Like null avgt 5 2046.090 ± 40.428 ms/op BenchmarkFuseRegexp.decreasing9Regex LUCENE avgt 5 42.584 ± 2.387 ms/op BenchmarkFuseRegexp.decreasing9Regex NATIVE avgt 5 61.583 ± 1.893 ms/op BenchmarkFuseRegexp.decreasing9Regex null avgt 5 2054.523 ± 75.452 ms/op BenchmarkFuseRegexp.increasing10Fusing null avgt 5 1305.249 ± 65.576 ms/op BenchmarkFuseRegexp.increasing10Like LUCENE avgt 5 43.328 ± 2.040 ms/op BenchmarkFuseRegexp.increasing10Like NATIVE avgt 5 61.580 ± 1.149 ms/op BenchmarkFuseRegexp.increasing10Like null avgt 5 2272.112 ± 99.393 ms/op BenchmarkFuseRegexp.increasing10Regex LUCENE avgt 5 54.319 ± 2.048 ms/op BenchmarkFuseRegexp.increasing10Regex NATIVE avgt 5 39.080 ± 1.817 ms/op BenchmarkFuseRegexp.increasing10Regex null avgt 5 2267.062 ± 49.525 ms/op BenchmarkFuseRegexp.optimal10 null avgt 5 187.012 ± 13.054 ms/op BenchmarkFuseRegexp.optimal10NotFound LUCENE avgt 5 0.210 ± 0.103 ms/op BenchmarkFuseRegexp.optimal10NotFound NATIVE avgt 5 0.201 ± 0.066 ms/op BenchmarkFuseRegexp.optimal10NotFound null avgt 5 242.423 ± 17.612 ms/op BenchmarkFuseRegexp.optimal1Like LUCENE avgt 5 16.395 ± 0.884 ms/op BenchmarkFuseRegexp.optimal1Like NATIVE avgt 5 15.506 ± 0.591 ms/op BenchmarkFuseRegexp.optimal1Like null avgt 5 209.389 ± 11.456 ms/op BenchmarkFuseRegexp.optimal1LikeNotFound LUCENE avgt 5 0.209 ± 0.129 ms/op BenchmarkFuseRegexp.optimal1LikeNotFound NATIVE avgt 5 0.172 ± 0.021 ms/op BenchmarkFuseRegexp.optimal1LikeNotFound null avgt 5 245.542 ± 35.323 ms/op BenchmarkFuseRegexp.optimal1Regex LUCENE avgt 5 15.421 ± 0.804 ms/op BenchmarkFuseRegexp.optimal1Regex NATIVE avgt 5 15.908 ± 0.491 ms/op BenchmarkFuseRegexp.optimal1Regex null avgt 5 204.505 ± 13.134 ms/op BenchmarkFuseRegexp.optimal1RegexNotFound LUCENE avgt 5 0.210 ± 0.089 ms/op BenchmarkFuseRegexp.optimal1RegexNotFound NATIVE avgt 5 0.203 ± 0.107 ms/op BenchmarkFuseRegexp.optimal1RegexNotFound null avgt 5 235.262 ± 8.533 ms/op BenchmarkFuseRegexp.selective2Fusing null avgt 5 517.545 ± 193.977 ms/op BenchmarkFuseRegexp.selective2Like LUCENE avgt 5 23.622 ± 1.178 ms/op BenchmarkFuseRegexp.selective2Like NATIVE avgt 5 20.547 ± 0.778 ms/op BenchmarkFuseRegexp.selective2Like null avgt 5 399.638 ± 54.746 ms/op BenchmarkFuseRegexp.selective2Regex LUCENE avgt 5 23.883 ± 0.802 ms/op BenchmarkFuseRegexp.selective2Regex NATIVE avgt 5 20.113 ± 0.858 ms/op BenchmarkFuseRegexp.selective2Regex null avgt 5 416.987 ± 28.789 ms/op ``` With this results, the optimization introduced in this PR doesn't seem to be amazing, but to be honest I'm quite worried about what is going on with regexp and Lucene/Native indexes. As said above, I hope that the benchmark is not correctly configuring the indexes, but even with that, it seems very dangerous that an index can be configured in such a way that it affects the semantics of the queries. @Jackie-Jiang @atris I would really appreciate if any of you can take a look at the benchmark trying to find what is going on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org