gortiz opened a new pull request, #8979: URL: https://github.com/apache/pinot/pull/8979
This is a quite large PR that should be marked as a Draft. The improvements included here are important, but I also have to change several APIs and there may be better ways to do it. In fact it is not mandatory to change these APIs, but most of them try to add order in a bunch of casting and assumptions that were I find very confusing the abusive usage of casting and sorting assumptions that were scattered in the code. I think they are now clearer, but I understand that it is a subjective issue. The code is not properly tested and that is something I would like to solve before merging this. The current order-by code have some inefficiencies that may not be trivial to fix, but will significantly increase the performance. I discovered that by reading #8837, but this PR doesn't fix the performance problem detected in that issue. Instead, it is focused on queries like: ``` SELECT whatever FROM table WHERE whatever ORDER BY <some_sorted_column> ASC, <some_unsorted_column_or_expression> LIMIT <some_value> OFFSET <some_other_value> ``` As proved by the included benchmark, the performance gain is huge: ``` Benchmark (_numRows) (_partialOrderBy) (_primaryRepetitions) (_query) (_scenario) Mode Cnt Score Error Units BenchmarkOrderByQueries.query 1500000 true 1 SELECT SORTED_COL FROM MyTable ORDER BY SORTED_COL, LOW_CARDINALITY_STRING_COL LIMIT 1052 EXP(0.5) avgt 5 0.992 ± 0.041 ms/op BenchmarkOrderByQueries.query 1500000 true 1000 SELECT SORTED_COL FROM MyTable ORDER BY SORTED_COL, LOW_CARDINALITY_STRING_COL LIMIT 1052 EXP(0.5) avgt 5 1.588 ± 0.128 ms/op BenchmarkOrderByQueries.query 1500000 false 1 SELECT SORTED_COL FROM MyTable ORDER BY SORTED_COL, LOW_CARDINALITY_STRING_COL LIMIT 1052 EXP(0.5) avgt 5 62.553 ± 0.606 ms/op BenchmarkOrderByQueries.query 1500000 false 1000 SELECT SORTED_COL FROM MyTable ORDER BY SORTED_COL, LOW_CARDINALITY_STRING_COL LIMIT 1052 EXP(0.5) avgt 5 59.222 ± 5.772 ms/op ``` Before this PR, this kind of queries were executed in `O(SlogL)` where S is the number of rows in the segment and L the effective limit (calculated as `min(max_limit, offeset + limit)`). That is a waste which is more evident in degraded scenarios like having a segment that only contains a rows that are all distinct in the sorting column. In that case, we the query above should be executed in `O(L)`, as we only need to read the first L rows that match the predicate and return them because they would be already sorted. Of course that is not valid in the general case, but we can use the fact data is already partially sorted by `<some_sorted_column>`, so we don't need to read the whole segment. What this PR does is to determine how many already sorted expressions are a _prefix_ in the list of order-by expressions. In other words: this optimization is only applied if there is a number `0 < i < orderByExpr.length : orderByExpr[i] is sorted`. Where `is sorted` means that the expression is either a constant (which is quite useless, but still) or if it is a column that is sorted and the order is ASC. Pinot already has an optimization when that `i` is exactly equal to `orderByExpr.length`. This PR adds another one when `0 < i < orderByExpr.length`. What this PR does is to partition the segment into P [mathematical partitions](https://en.wikipedia.org/wiki/Partition_of_a_set#:~:text=In%20mathematics%2C%20a%20partition%20of,included%20in%20exactly%20one%20subset.) where two rows are in the same partition if and only if they the value of their _prefix_ columns is equal. As we know that these _prefix_ of expressions is sorted, we know that the rows of each partition are going to be consecutives. These partitions will be also sorted. Which means that partitions that contains smaller elements (compared with the _prefix_ of sorted order by expressions) are going to be found before than any other partition that is higher. Therefore, we can add the results to an unbound List and stop when: - Either there are no more rows that matched the predicate - Or the List has more than L elements AND there are no more elements in the partition. As all elements in the same partition are consecutives, we can easily know that a partition has finished when a value that doesn't belong to that partition is found. Once we stop, we just need to sort (using all order-by expressions) the List (whose size will lower or equal to L + size of the last included partition) and return the first L elements there. There is another interesting property: as elements of different partitions are already sorted, we don't need to sort the whole List once we finish. We can simply sort each partition when a new one is found, reducing the cost from `O(SlogL)` to `O(pPlog(P))` where P is the size of the bigger partition included in the result and p the number of partitions included. This optimization is ideal if the number of partitions is very high (close to the number of rows), as the cost will tend to be `O(L)`. If the number of partitions is very small, the cost is tend to be `O(Slog(L))`, which is the same we had before. Right now the optimization is disabled by default and have to be activated with a new option called partial-order-by, which is used in the benchmark to compare results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org