gortiz opened a new pull request, #8979:
URL: https://github.com/apache/pinot/pull/8979

   This is a quite large PR that should be marked as a Draft. The improvements 
included here are important, but I also have to change several APIs and there 
may be better ways to do it. In fact it is not mandatory to change these APIs, 
but most of them try to add order in a bunch of casting and assumptions that 
were  I find very confusing the abusive usage of casting and sorting 
assumptions that were scattered in the code. I think they are now clearer, but 
I understand that it is a subjective issue. The code is not properly tested and 
that is something I would like to solve before merging this.
   
   The current order-by code have some inefficiencies that may not be trivial 
to fix, but will significantly increase the performance. I discovered that by 
reading #8837, but this PR doesn't fix the performance problem detected in that 
issue. Instead, it is focused on queries like:
   
   ```
   SELECT whatever FROM table WHERE whatever ORDER BY <some_sorted_column> ASC, 
<some_unsorted_column_or_expression> LIMIT <some_value> OFFSET 
<some_other_value>
   ```
   
   As proved by the included benchmark, the performance gain is huge:
   
   ```
   
   Benchmark                      (_numRows)  (_partialOrderBy)  
(_primaryRepetitions)                                                           
                        (_query)  (_scenario)  Mode  Cnt   Score   Error  Units
   BenchmarkOrderByQueries.query     1500000               true                 
     1  SELECT SORTED_COL FROM MyTable ORDER BY SORTED_COL, 
LOW_CARDINALITY_STRING_COL LIMIT 1052     EXP(0.5)  avgt    5   0.992 ± 0.041  
ms/op
   BenchmarkOrderByQueries.query     1500000               true                 
  1000  SELECT SORTED_COL FROM MyTable ORDER BY SORTED_COL, 
LOW_CARDINALITY_STRING_COL LIMIT 1052     EXP(0.5)  avgt    5   1.588 ± 0.128  
ms/op
   BenchmarkOrderByQueries.query     1500000              false                 
     1  SELECT SORTED_COL FROM MyTable ORDER BY SORTED_COL, 
LOW_CARDINALITY_STRING_COL LIMIT 1052     EXP(0.5)  avgt    5  62.553 ± 0.606  
ms/op
   BenchmarkOrderByQueries.query     1500000              false                 
  1000  SELECT SORTED_COL FROM MyTable ORDER BY SORTED_COL, 
LOW_CARDINALITY_STRING_COL LIMIT 1052     EXP(0.5)  avgt    5  59.222 ± 5.772  
ms/op
   ```
   
   Before this PR, this kind of queries were executed in `O(SlogL)` where S is 
the number of rows in the segment and L the effective limit (calculated as 
`min(max_limit, offeset + limit)`). That is a waste which is more evident in 
degraded scenarios like having a segment that only contains a rows that are all 
distinct in the sorting column. In that case, we the query above should be 
executed in `O(L)`, as we only need to read the first L rows that match the 
predicate and return them because they would be already sorted. Of course that 
is not valid in the general case, but we can use the fact data is already 
partially sorted by `<some_sorted_column>`, so we don't need to read the whole 
segment.
   
   What this PR does is to determine how many already sorted expressions are a 
_prefix_ in the list of order-by expressions. In other words: this optimization 
is only applied if there is a number `0 < i < orderByExpr.length : 
orderByExpr[i] is sorted`. Where `is sorted` means that the expression is 
either a constant (which is quite useless, but still) or if it is a column that 
is sorted and the order is ASC. Pinot already has an optimization when that `i` 
is exactly equal to `orderByExpr.length`. This PR adds another one when `0 < i 
< orderByExpr.length`.
   
   What this PR does is to partition the segment into P [mathematical 
partitions](https://en.wikipedia.org/wiki/Partition_of_a_set#:~:text=In%20mathematics%2C%20a%20partition%20of,included%20in%20exactly%20one%20subset.)
 where two rows are in the same partition if and only if they the value of 
their _prefix_ columns is equal.
   
   As we know that these _prefix_ of expressions is sorted, we know that the 
rows of each partition are going to be consecutives. These partitions will be 
also sorted. Which means that partitions that contains smaller elements 
(compared with the _prefix_ of sorted order by expressions) are going to be 
found before than any other partition that is higher.
   
   Therefore, we can add the results to an unbound List and stop when:
   
   - Either there are no more rows that matched the predicate
   - Or the List has more than L elements AND there are no more elements in the 
partition.
   
   As all elements in the same partition are consecutives, we can easily know 
that a partition has finished when a value that doesn't belong to that 
partition is found.
   
   Once we stop, we just need to sort (using all order-by expressions) the List 
(whose size will lower or equal to L + size of the last included partition) and 
return the first L elements there.
   
   There is another interesting property: as elements of different partitions 
are already sorted, we don't need to sort the whole List once we finish. We can 
simply sort each partition when a new one is found, reducing the cost from 
`O(SlogL)` to `O(pPlog(P))` where P is the size of the bigger partition 
included in the result and p the number of partitions included.
   
   This optimization is ideal if the number of partitions is very high (close 
to the number of rows), as the cost will tend to be `O(L)`. If the number of 
partitions is very small, the cost is tend to be `O(Slog(L))`, which is the 
same we had before.
   
   Right now the optimization is disabled by default and have to be activated 
with a new option called partial-order-by, which is used in the benchmark to 
compare results.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to