Re: [DISCUSS] CEP-39: Cost Based Optimizer

2024-01-02 Thread Ariel Weisberg
Hi,

I am burying the lede, but it's important to keep an eye on runtime-adaptive vs 
planning time optimization as the cost/benefits vary greatly between the two 
and runtime adaptive can be a game changer. Basically CBO optimizes for query 
efficiency and startup time at the expense of not handling some queries well 
and runtime adaptive is cheap/free for expensive queries and can handle cases 
that CBO can't.

Generally speaking I am +1 on the introduction of a CBO, since it seems like 
there exists things that would benefit from it materially (and many of the 
associated refactors/cleanup) and it aligns with my north star that includes 
joins.

Do we all have the same north star that Cassandra should eventually support 
joins? Just curious if that is controversial.

I don't feel like this CEP in particular should need to really nail down 
exactly how distributed estimates work since we can start with using local 
estimates as a proxy for the entire cluster and then improve. If someone has 
bandwidth to do a separate CEP for that then sure that would be great, but this 
seems big enough in scope already.

RE testing, continuity of performance of queries is going to be really 
important. I would really like to see that we have a fuzzed the space 
deterministically and via a collection of hand rolled cases, and can compare 
performance between versions to catch queries that regress. Hopefully we can 
agree on a baseline for releasing where we know what prior release to compare 
to and what acceptable changes in performance are.

RE prepared statements - It feels to me like trying to send the plan blob back 
and forth to get more predictable, but not absolutely predictable, plans is not 
worth it? Feels like a lot for an incremental improvement over a baseline that 
doesn't exist yet, IOW it doesn't feel like something for V1. Maybe it ends up 
in YAGNI territory.

The north star of predictable behavior for queries is a *very* important one 
because it means the world to users, but CBO is going to make mistakes all over 
the place. It's simply unachievable even with accurate statistics because it's 
very hard to tell how predicates will behave on a column.

This segues nicely into the importance of adaptive execution :-) It's how you 
rescue the queries that CBO doesn't handle  well for any reason such as bugs, 
bad statistics, missing features. Re-ordering predicate evaluation, switching 
indexes, and re-ordering joins can all be done on the fly.

CBO is really a performance optimization since adaptive approaches will allow 
any query to complete with some wasted resources.

If my pager were waking me up at night and I wanted to stem the bleeding I 
would reach for runtime adaptive over CBO because I know it will catch more 
cases even if it is slower to execute up front.

What is the nature of the queries we are looking solve right now? Are they long 
running heavy hitters, or short queries that explode if run incorrectly, or a 
mix of both?

Ariel

On Tue, Dec 12, 2023, at 8:29 AM, Benjamin Lerer wrote:
> Hi everybody,
> 
> I would like to open the discussion on the introduction of a cost based 
> optimizer to allow Cassandra to pick the best execution plan based on the 
> data distribution.Therefore, improving the overall query performance.
> 
> This CEP should also lay the groundwork for the future addition of features 
> like joins, subqueries, OR/NOT and index ordering.
> 
> The proposal is here: 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-39%3A+Cost+Based+Optimizer
> 
> Thank you in advance for your feedback.


Re: Harry in-tree (Forked from "Long tests, Burn tests, Simulator tests, Fuzz tests - can we clarify the diffs?")

2024-01-02 Thread Ariel Weisberg
🥳🎉

Thanks for your work on this. Excited to have an easier way to write tests that 
leverage schema and data that also covers more.

Ariel
On Sat, Dec 23, 2023, at 9:17 AM, Alex Petrov wrote:
> Thanks everyone, Harry is now in tree! Of course, that's just a small 
> milestone, hope it'll prove as useful as I expect it to be.
> 
> https://github.com/apache/cassandra/commit/439d1b122af334bf68c159b82ef4e4879c210bd5
> 
> Happy holidays!
> --Alex
> 
> On Sat, Dec 23, 2023, at 11:10 AM, Mick Semb Wever wrote:
>>
>>   
>>> I strongly believe that bringing Harry in-tree will help to lower the 
>>> barrier for fuzz test and simplify co-development of Cassandra and Harry. 
>>> Previously, it has been rather difficult to debug edge cases because I had 
>>> to either re-compile an in-jvm dtest jar and bring it to Harry, or 
>>> re-compile a Harry jar and bring it to Cassandra, which is both tedious and 
>>> time consuming. Moreover, I believe we have missed at very least one RT 
>>> regression [2] because Harry was not in-tree, as its tests would've caught 
>>> the issue even with the model that existed.
>>> 
>>> For other recently found issues, I think having Harry in-tree would have 
>>> substantially lowered a turnaround time, and allowed me to share repros 
>>> with developers of corresponding features much quicker.
>> 
>> 
>> Agree, looking forward to getting to know and writing Harry tests.  Thank 
>> you Alex, happy holidays :) 
>> 
> 


Re: [DISCUSS] CEP-39: Cost Based Optimizer

2024-01-02 Thread Benedict
The CEP expressly includes an item for coordinated cardinality estimation, by 
producing whole cluster summaries. I’m not sure if you addressed this in your 
feedback, it’s not clear what you’re referring to with distributed estimates, 
but avoiding this was expressly the driver of my suggestion to instead include 
the plan as a payload (which offers users some additional facilities). 


> On 2 Jan 2024, at 21:26, Ariel Weisberg  wrote:
> 
> 
> Hi,
> 
> I am burying the lede, but it's important to keep an eye on runtime-adaptive 
> vs planning time optimization as the cost/benefits vary greatly between the 
> two and runtime adaptive can be a game changer. Basically CBO optimizes for 
> query efficiency and startup time at the expense of not handling some queries 
> well and runtime adaptive is cheap/free for expensive queries and can handle 
> cases that CBO can't.
> 
> Generally speaking I am +1 on the introduction of a CBO, since it seems like 
> there exists things that would benefit from it materially (and many of the 
> associated refactors/cleanup) and it aligns with my north star that includes 
> joins.
> 
> Do we all have the same north star that Cassandra should eventually support 
> joins? Just curious if that is controversial.
> 
> I don't feel like this CEP in particular should need to really nail down 
> exactly how distributed estimates work since we can start with using local 
> estimates as a proxy for the entire cluster and then improve. If someone has 
> bandwidth to do a separate CEP for that then sure that would be great, but 
> this seems big enough in scope already.
> 
> RE testing, continuity of performance of queries is going to be really 
> important. I would really like to see that we have a fuzzed the space 
> deterministically and via a collection of hand rolled cases, and can compare 
> performance between versions to catch queries that regress. Hopefully we can 
> agree on a baseline for releasing where we know what prior release to compare 
> to and what acceptable changes in performance are.
> 
> RE prepared statements - It feels to me like trying to send the plan blob 
> back and forth to get more predictable, but not absolutely predictable, plans 
> is not worth it? Feels like a lot for an incremental improvement over a 
> baseline that doesn't exist yet, IOW it doesn't feel like something for V1. 
> Maybe it ends up in YAGNI territory.
> 
> The north star of predictable behavior for queries is a *very* important one 
> because it means the world to users, but CBO is going to make mistakes all 
> over the place. It's simply unachievable even with accurate statistics 
> because it's very hard to tell how predicates will behave on a column.
> 
> This segues nicely into the importance of adaptive execution :-) It's how you 
> rescue the queries that CBO doesn't handle  well for any reason such as bugs, 
> bad statistics, missing features. Re-ordering predicate evaluation, switching 
> indexes, and re-ordering joins can all be done on the fly.
> 
> CBO is really a performance optimization since adaptive approaches will allow 
> any query to complete with some wasted resources.
> 
> If my pager were waking me up at night and I wanted to stem the bleeding I 
> would reach for runtime adaptive over CBO because I know it will catch more 
> cases even if it is slower to execute up front.
> 
> What is the nature of the queries we are looking solve right now? Are they 
> long running heavy hitters, or short queries that explode if run incorrectly, 
> or a mix of both?
> 
> Ariel
> 
>> On Tue, Dec 12, 2023, at 8:29 AM, Benjamin Lerer wrote:
>> Hi everybody,
>> 
>> I would like to open the discussion on the introduction of a cost based 
>> optimizer to allow Cassandra to pick the best execution plan based on the 
>> data distribution.Therefore, improving the overall query performance.
>> 
>> This CEP should also lay the groundwork for the future addition of features 
>> like joins, subqueries, OR/NOT and index ordering.
>> 
>> The proposal is here: 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-39%3A+Cost+Based+Optimizer
>> 
>> Thank you in advance for your feedback.
> 


Re: Harry in-tree (Forked from "Long tests, Burn tests, Simulator tests, Fuzz tests - can we clarify the diffs?")

2024-01-02 Thread Lorina Poland
Is there any user-facing documentation (for developers) that should be added? I 
note that you say there is "extensive documentation"; I presume that you are 
referring to the README.md in the repo?

If there is a desire to add documentation to the website, as opposed to the MD 
files in the repo, please reach out to me.

Thanks,
Lorina

On 2023/12/21 21:22:54 Alex Petrov wrote:
> Hey folks,
> 
> I am mostly done with a patch that brings Harry in-tree [1]. I will trigger 
> one more CI run overnight, and my intention was to merge it some time soon, 
> but I wanted to give a fair warning here, since this is a relatively large 
> patch. 
> 
> Good news for everyone that it: 
>   a) touches no production code whatsoever. Only test (in-jvm dtest namely) 
> code that was using Harry already.
>   b) the only tests that are changed are ones that used a duplicate version 
> of placement simulator we had both for testing TCM, and in Harry
>   c) in addition, I have converted 3 existing TCM tests to a new API to have 
> some base for examples/usage.
> 
> Since we were effectively relying on this code for a while now, and the 
> intention now is to converge to: 
>   a) fewer different generators, and have a shareable version of generators 
> for everyone to use accross the base
>   b) a testing tool that can be useful for both trivial cases, and complex 
> scenarios 
> myself and many other Cassandra contributors have expressed an opinion that 
> bringing Harry in-tree will be highly benefitial.
> 
> I strongly believe that bringing Harry in-tree will help to lower the barrier 
> for fuzz test and simplify co-development of Cassandra and Harry. Previously, 
> it has been rather difficult to debug edge cases because I had to either 
> re-compile an in-jvm dtest jar and bring it to Harry, or re-compile a Harry 
> jar and bring it to Cassandra, which is both tedious and time consuming. 
> Moreover, I believe we have missed at very least one RT regression [2] 
> because Harry was not in-tree, as its tests would've caught the issue even 
> with the model that existed.
> 
> For other recently found issues, I think having Harry in-tree would have 
> substantially lowered a turnaround time, and allowed me to share repros with 
> developers of corresponding features much quicker.
> 
> I do expect a slight learning curve for Harry, but my intention is to build a 
> web of simple tests (worked on some of them yesterday after conversation with 
> David already), which can follow the in-jvm-dtest pattern of 
> find-similar-test / copy / modify. There's already copious documentation, so 
> I do not believe not having docs for Harry was ever an issue, since there 
> have been plenty. 
> 
> You all are aware of my dedication to testing and quality of Apache 
> Cassandra, and I hope you also see the benefits of having a model checker 
> in-tree.
> 
> Thank you and happy upcoming holidays,
> --Alex
> 
> [1] https://issues.apache.org/jira/browse/CASSANDRA-19210
> [2] https://issues.apache.org/jira/browse/CASSANDRA-18932
>