gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1155165270

   I was able to invest more time in this topic today and I have to say that my 
initial benchmark was completely useless. Apart from some typos in the queries 
(which may have some impact), there were two problems that completely fooled me:
   - The implicit limit introduced by Pinot. As only the first 10 elements were 
returned, queries that actually found elements could finish faster.
   - Something is wrong with Lucene and Native indexes. I guess I'm not 
configuring them correctly, but they simply return 0 results when the regexp is 
a little bit complex (like having groups).
   
   Therefore I've changed all the tests to return a count and each iteration is 
actually verifying whether the expected value is returned or not, in which case 
the benchmark is stopped. As a result, all fusing tests fail whenever there is 
an index. I hope it is something related to my test (ie I didn't configure the 
index properly) and not some semantic discrepancy these indexes have in 
relation how each engine evaluates regexp, as that could mean that there are 
customers receiving false results.
   
   These are the results I've got. Note that some combinations are not shown. 
For example, decreasing9Fusing and LUCENE or NATIVE. That means that these 
combinations did not return what it was expected (in all these particular cases 
it means that they return 0 rows). Note that apart from the fusing case, 
benchmarks like `optimal10`, which is `regexp_like(DOMAIN_NAMES, 'domain\d')` 
did also failed.
   
   ```
   Benchmark                                  (_fstType)  Mode  Cnt     Score   
   Error  Units
   BenchmarkFuseRegexp.decreasing9Fusing            null  avgt    5  1438.185 ± 
  73.963  ms/op
   BenchmarkFuseRegexp.decreasing9Like            LUCENE  avgt    5    41.927 ± 
   0.984  ms/op
   BenchmarkFuseRegexp.decreasing9Like            NATIVE  avgt    5    40.232 ± 
   5.597  ms/op
   BenchmarkFuseRegexp.decreasing9Like              null  avgt    5  3957.902 ± 
 154.228  ms/op
   BenchmarkFuseRegexp.decreasing9Regex           LUCENE  avgt    5    40.744 ± 
   2.059  ms/op
   BenchmarkFuseRegexp.decreasing9Regex           NATIVE  avgt    5    61.443 ± 
   1.301  ms/op
   BenchmarkFuseRegexp.decreasing9Regex             null  avgt    5  2116.747 ± 
 602.570  ms/op
   BenchmarkFuseRegexp.increasing10Fusing           null  avgt    5  1320.869 ± 
  45.509  ms/op
   BenchmarkFuseRegexp.increasing10Like           LUCENE  avgt    5    42.253 ± 
   0.695  ms/op
   BenchmarkFuseRegexp.increasing10Like           NATIVE  avgt    5    39.049 ± 
   1.829  ms/op
   BenchmarkFuseRegexp.increasing10Like             null  avgt    5  5071.207 ± 
3134.887  ms/op
   BenchmarkFuseRegexp.increasing10Regex          LUCENE  avgt    5    44.084 ± 
   2.737  ms/op
   BenchmarkFuseRegexp.increasing10Regex          NATIVE  avgt    5    39.650 ± 
   1.567  ms/op
   BenchmarkFuseRegexp.increasing10Regex            null  avgt    5  2318.479 ± 
  75.632  ms/op
   BenchmarkFuseRegexp.optimal10                    null  avgt    5   185.757 ± 
   3.426  ms/op
   BenchmarkFuseRegexp.optimal10NotFound          LUCENE  avgt    5     0.212 ± 
   0.136  ms/op
   BenchmarkFuseRegexp.optimal10NotFound          NATIVE  avgt    5     0.201 ± 
   0.140  ms/op
   BenchmarkFuseRegexp.optimal10NotFound            null  avgt    5   248.525 ± 
  60.819  ms/op
   BenchmarkFuseRegexp.optimal1Like               LUCENE  avgt    5    16.431 ± 
   3.533  ms/op
   BenchmarkFuseRegexp.optimal1Like               NATIVE  avgt    5    16.586 ± 
   0.828  ms/op
   BenchmarkFuseRegexp.optimal1Like                 null  avgt    5   468.731 ± 
 185.623  ms/op
   BenchmarkFuseRegexp.optimal1LikeNotFound       LUCENE  avgt    5     0.197 ± 
   0.062  ms/op
   BenchmarkFuseRegexp.optimal1LikeNotFound       NATIVE  avgt    5     0.170 ± 
   0.072  ms/op
   BenchmarkFuseRegexp.optimal1LikeNotFound         null  avgt    5   477.176 ± 
 129.445  ms/op
   BenchmarkFuseRegexp.optimal1Regex              LUCENE  avgt    5    15.485 ± 
   0.517  ms/op
   BenchmarkFuseRegexp.optimal1Regex              NATIVE  avgt    5    16.711 ± 
   0.857  ms/op
   BenchmarkFuseRegexp.optimal1Regex                null  avgt    5   205.676 ± 
   6.921  ms/op
   BenchmarkFuseRegexp.optimal1RegexNotFound      LUCENE  avgt    5     0.212 ± 
   0.123  ms/op
   BenchmarkFuseRegexp.optimal1RegexNotFound      NATIVE  avgt    5     0.186 ± 
   0.029  ms/op
   BenchmarkFuseRegexp.optimal1RegexNotFound        null  avgt    5   239.459 ± 
  13.846  ms/op
   BenchmarkFuseRegexp.selective2Fusing             null  avgt    5   541.530 ± 
 168.173  ms/op
   BenchmarkFuseRegexp.selective2Like             LUCENE  avgt    5    19.961 ± 
   0.937  ms/op
   BenchmarkFuseRegexp.selective2Like             NATIVE  avgt    5    20.387 ± 
   1.953  ms/op
   BenchmarkFuseRegexp.selective2Like               null  avgt    5   916.620 ± 
  96.702  ms/op
   BenchmarkFuseRegexp.selective2Regex            LUCENE  avgt    5    23.366 ± 
   0.795  ms/op
   BenchmarkFuseRegexp.selective2Regex            NATIVE  avgt    5    20.037 ± 
   0.601  ms/op
   BenchmarkFuseRegexp.selective2Regex              null  avgt    5   389.965 ± 
   7.284  ms/op
   ```
   
   I've also changed `RegexpPatternConverterUtils` (and its relative test) to 
do not introduce useless `^.*` at the beginning of the expression nor `.*$` at 
the end. I've repeated the benchmark with that change and these are the results 
I've got. As you can see, most benchmarks didn't change that much, but the ones 
that use like and have no index are twice as fast, so they achieve the same 
numbers than their equivalent `optimalXRegex`, as expected.
   
   ```
   Benchmark                                  (_fstType)  Mode  Cnt     Score   
  Error  Units
   BenchmarkFuseRegexp.decreasing9Fusing            null  avgt    5  1365.403 ± 
 91.822  ms/op
   BenchmarkFuseRegexp.decreasing9Like            LUCENE  avgt    5    40.553 ± 
  1.394  ms/op
   BenchmarkFuseRegexp.decreasing9Like            NATIVE  avgt    5    39.057 ± 
  0.905  ms/op
   BenchmarkFuseRegexp.decreasing9Like              null  avgt    5  2046.090 ± 
 40.428  ms/op
   BenchmarkFuseRegexp.decreasing9Regex           LUCENE  avgt    5    42.584 ± 
  2.387  ms/op
   BenchmarkFuseRegexp.decreasing9Regex           NATIVE  avgt    5    61.583 ± 
  1.893  ms/op
   BenchmarkFuseRegexp.decreasing9Regex             null  avgt    5  2054.523 ± 
 75.452  ms/op
   BenchmarkFuseRegexp.increasing10Fusing           null  avgt    5  1305.249 ± 
 65.576  ms/op
   BenchmarkFuseRegexp.increasing10Like           LUCENE  avgt    5    43.328 ± 
  2.040  ms/op
   BenchmarkFuseRegexp.increasing10Like           NATIVE  avgt    5    61.580 ± 
  1.149  ms/op
   BenchmarkFuseRegexp.increasing10Like             null  avgt    5  2272.112 ± 
 99.393  ms/op
   BenchmarkFuseRegexp.increasing10Regex          LUCENE  avgt    5    54.319 ± 
  2.048  ms/op
   BenchmarkFuseRegexp.increasing10Regex          NATIVE  avgt    5    39.080 ± 
  1.817  ms/op
   BenchmarkFuseRegexp.increasing10Regex            null  avgt    5  2267.062 ± 
 49.525  ms/op
   BenchmarkFuseRegexp.optimal10                    null  avgt    5   187.012 ± 
 13.054  ms/op
   BenchmarkFuseRegexp.optimal10NotFound          LUCENE  avgt    5     0.210 ± 
  0.103  ms/op
   BenchmarkFuseRegexp.optimal10NotFound          NATIVE  avgt    5     0.201 ± 
  0.066  ms/op
   BenchmarkFuseRegexp.optimal10NotFound            null  avgt    5   242.423 ± 
 17.612  ms/op
   BenchmarkFuseRegexp.optimal1Like               LUCENE  avgt    5    16.395 ± 
  0.884  ms/op
   BenchmarkFuseRegexp.optimal1Like               NATIVE  avgt    5    15.506 ± 
  0.591  ms/op
   BenchmarkFuseRegexp.optimal1Like                 null  avgt    5   209.389 ± 
 11.456  ms/op
   BenchmarkFuseRegexp.optimal1LikeNotFound       LUCENE  avgt    5     0.209 ± 
  0.129  ms/op
   BenchmarkFuseRegexp.optimal1LikeNotFound       NATIVE  avgt    5     0.172 ± 
  0.021  ms/op
   BenchmarkFuseRegexp.optimal1LikeNotFound         null  avgt    5   245.542 ± 
 35.323  ms/op
   BenchmarkFuseRegexp.optimal1Regex              LUCENE  avgt    5    15.421 ± 
  0.804  ms/op
   BenchmarkFuseRegexp.optimal1Regex              NATIVE  avgt    5    15.908 ± 
  0.491  ms/op
   BenchmarkFuseRegexp.optimal1Regex                null  avgt    5   204.505 ± 
 13.134  ms/op
   BenchmarkFuseRegexp.optimal1RegexNotFound      LUCENE  avgt    5     0.210 ± 
  0.089  ms/op
   BenchmarkFuseRegexp.optimal1RegexNotFound      NATIVE  avgt    5     0.203 ± 
  0.107  ms/op
   BenchmarkFuseRegexp.optimal1RegexNotFound        null  avgt    5   235.262 ± 
  8.533  ms/op
   BenchmarkFuseRegexp.selective2Fusing             null  avgt    5   517.545 ± 
193.977  ms/op
   BenchmarkFuseRegexp.selective2Like             LUCENE  avgt    5    23.622 ± 
  1.178  ms/op
   BenchmarkFuseRegexp.selective2Like             NATIVE  avgt    5    20.547 ± 
  0.778  ms/op
   BenchmarkFuseRegexp.selective2Like               null  avgt    5   399.638 ± 
 54.746  ms/op
   BenchmarkFuseRegexp.selective2Regex            LUCENE  avgt    5    23.883 ± 
  0.802  ms/op
   BenchmarkFuseRegexp.selective2Regex            NATIVE  avgt    5    20.113 ± 
  0.858  ms/op
   BenchmarkFuseRegexp.selective2Regex              null  avgt    5   416.987 ± 
 28.789  ms/op
   ```
   
   With this results, the optimization introduced in this PR doesn't seem to be 
amazing, but to be honest I'm quite worried about what is going on with regexp 
and Lucene/Native indexes. As said above, I hope that the benchmark is not 
correctly configuring the indexes, but even with that, it seems very dangerous 
that an index can be configured in such a way that it affects the semantics of 
the queries.
   
   @Jackie-Jiang @atris I would really appreciate if any of you can take a look 
at the benchmark trying to find what is going on.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to