Re: [DISCUSS] CEP-39: Cost Based Optimizer

Josh McKenzie Fri, 22 Dec 2023 07:39:58 -0800

> I don't even think we have to think about *new* SAI features to see where it 
> will benefit from further *local* optimization...
You make good points IMO. After Caleb's reasoning it makes sense to me to start 
working on query optimization w/even just our initial SAI feature-set given 
querying across multiple indices.


On Fri, Dec 22, 2023, at 8:42 AM, J. D. Jordan wrote:
> 
> The CEP-29 “rejected alternatives” section mentions one such use case.  Being 
> able to put NOT arbitrarily in a query.  Adding an OR operator is another 
> thing we are likely to want to do in the near future that would benefit from 
> this work, those benefit from the syntax tree and reordering parts of the 
> proposal.
> 
> But I think we already have enough complexity available to us to justify a 
> query optimizer in the fact of multi index queries today. Especially when you 
> have the new ANN OF operator in use combined with index queries.  Depending 
> on what order you query the indexes in, it can dramatically change the 
> performance of the query.  We are seeing and working through such issues in 
> Astra today.
> 
> -Jeremiah
> 
> 
>> On Dec 21, 2023, at 12:00 PM, Josh McKenzie <jmcken...@apache.org> wrote:
>> 
>>> we are already late. We have several features running in production that we 
>>> chose to not open source yet because implementing phase 1 of the CEP would 
>>> have heavily simplify their designs. The cost of developing them was much 
>>> higher than what it would have been if the CEP had already been 
>>> implemented. We are also currently working on some SAI features that need 
>>> cost based optimization.
>> Are there DISCUSS threads or CEP's for any of that work? For us to have a 
>> useful discussion about whether we're at a point in the project where a 
>> query optimizer is appropriate for the project this information would be 
>> vital.
>> 
>> On Thu, Dec 21, 2023, at 12:33 PM, Benjamin Lerer wrote:
>>> Hey German,
>>> 
>>> To clarify things, we intend to push cardinalities across nodes, not costs. 
>>> It will be up to the Cost Model to estimate cost based on those 
>>> cardinalities. We will implement some functionalities to collect costs on 
>>> query execution to be able to provide them as the output of EXPLAIN ANALYZE.
>>> 
>>> We will provide more details on how we will collect and distribute 
>>> cardinalities. We will probably not go into details on how we will estimate 
>>> costs before the patch for it is ready. The main reason being that there 
>>> are a lot of different parts that you need to account for and that it will 
>>> require significant testing and experimentation.
>>> 
>>> Regarding multi-tenancy, even if you use query cost, do not forget that you 
>>> will have to account also for background tasks such as compaction, repair, 
>>> backup, ... which is not included in this CEP.      
>>> 
>>> Le jeu. 21 déc. 2023 à 00:18, German Eichberger via dev 
>>> <dev@cassandra.apache.org> a écrit :
>>>> All,
>>>> 
>>>> very much agree with Scott's reasoning. 
>>>> 
>>>> It seems expedient given the advent of ACCORD transactions to be more like 
>>>> the other distributed SQL databases and just support SQL. But just because 
>>>> it's expedient it isn't right and we should work out the relational 
>>>> features in more detail before we embark on tying us to some query 
>>>> planning design.
>>>> 
>>>> The main problem in this space is pushing cost / across nodes based on 
>>>> data density. I understand that TCM will level out data density but the 
>>>> cost based optimizer proposal does a lot of hand waving when it comes to 
>>>> collecting/estimating costs for each node. I like to see more details on 
>>>> this since otherwise it will be fairly limiting.
>>>> 
>>>> I am less tied to ALLOW FILTERING - many of my customers find allowing 
>>>> filtering beneficial for their workloads so I think removing it makes 
>>>> sense to me (and yes we try to discourage them 🙂)
>>>> 
>>>> I am also intrigued by this proposal when I think about multi tenancy and 
>>>> resource governance: We have heard from several operator who run multiple 
>>>> internal teams on the same Cassandra cluster jut to optimize costs. Having 
>>>> a way to attribute those costs more fairly by adding up the costs the 
>>>> optimizer calculates might be hugely beneficial.  There could also be a 
>>>> way to have a "cost budget" on a keyspace to minimize the noisy neighbor 
>>>> problem and do more intelligent request throttling. 
>>>> 
>>>> In summary I support the proposal with the caveats raised above.
>>>> 
>>>> Thanks,
>>>> German
>>>> 
>>>> 
>>>> *From:* C. Scott Andreas <sc...@paradoxica.net>
>>>> *Sent:* Wednesday, December 20, 2023 8:15 AM
>>>> *To:* dev@cassandra.apache.org <dev@cassandra.apache.org>
>>>> *Cc:* dev@cassandra.apache.org <dev@cassandra.apache.org>
>>>> *Subject:* [EXTERNAL] Re: [DISCUSS] CEP-39: Cost Based Optimizer
>>>>  
>>>> 
>>>> You don't often get email from sc...@paradoxica.net. Learn why this is 
>>>> important 
>>>> <https://urldefense.com/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!PbtH5S7Ebw!c_MD0RyXN_XIPN5B2C2--bChBdBa13WInHO75rCG_5gAk6by8K8t8sXgskQFjr1aAFaW3mhiNYlQAurs4aEL$>
>>>> 
>>>> Thanks for this proposal and apologies for my delayed engagement during 
>>>> the Cassandra Summit last week. Benjamin, I appreciate your work on this 
>>>> and your engagement on this thread – I know it’s a lot of discussion to 
>>>> field.
>>>> 
>>>> On ALLOW FILTERING:
>>>> 
>>>> I share Chris Lohfink’s experience in operating clusters that have made 
>>>> heavy use of ALLOW FILTERING. It is a valuable guardrail for the database 
>>>> to require users specifically annotate queries that may cost 1000x+ that 
>>>> of a simple lookup for a primary key. For my own purposes, I’d actually 
>>>> like to go a step further and disable queries that require ALLOW FILTERING 
>>>> by default unless explicitly reviewed - but haven’t taken the step of 
>>>> adding such a guardrail yet.
>>>> 
>>>> CBOs, CQL, and SQL:
>>>> 
>>>> The CBO proposal cuts to the heart of one of the fundamental differences 
>>>> between SQL and CQL that I haven’t seen exercised yet.
>>>> 
>>>> SQL allows users to define schemas that provide structure to data and to 
>>>> issue queries over them based on a relational algebra. SQL’s purpose is to 
>>>> decouple the on-disk representation of data from the query language used 
>>>> to access and aggregate it. This produces a very flexible query language 
>>>> that can be used to ask a database anything - but at a cost of execution 
>>>> that may be effectively infinite (think recursive subqueries).
>>>> 
>>>> CQL is very different. While SQL is designed to decouple query language 
>>>> and on-disk representation, CQL is designed specifically to couple them. A 
>>>> CQL schema declares data placement, query routing, and disk serialization, 
>>>> and sorting to enable efficient retrieval. This is a very different design 
>>>> goal from a general-purpose query language. In time CQL may gain many 
>>>> SQL-like capabilities (and I hope it does!), but it will require careful 
>>>> work to do so without creating many footguns.
>>>> 
>>>> Feature evolution:
>>>> 
>>>> I agree that in the coming years, Cassandra is likely to gain 
>>>> semi-relational features via maturation of the byte-ordered partitioner 
>>>> (via range splitting, via TCM); the availability of SAI and its evolution 
>>>> (e.g., via new functionality enabled by Lucene libraries); potentially 
>>>> joins via BOP; and others. This is a really exciting future, and one that 
>>>> probably requires a planner and optimizer.
>>>> 
>>>> My general inclination is that a planner + optimizer seem valuable for 
>>>> Cassandra, but that the proposal feels a year or two early. The database 
>>>> doesn’t yet have a plan of record to add support for some of the 
>>>> semirelational constructs we’ve talked about, and I’m not aware of active 
>>>> CEPs that propose designs for features like these yet.
>>>> 
>>>> Like Jeff, I’d find this much easier to discuss in the context of a 
>>>> database gaining support for these features with specific designs 
>>>> available to discuss. The ALLOW FILTERING and constant folding examples 
>>>> are a little slim. Index selection is probably the best one I can think of 
>>>> right now - e.g., if we wanted to add the ability to issue 
>>>> partition-restricted queries over a base table with multiple indexes 
>>>> defined without users specifically declaring an index. I haven’t seen an 
>>>> at-scale use case that would be better served by planner-driven index 
>>>> selection vs. user-driven, but they might be out there.
>>>> 
>>>> It’s not my role to suggest changes in prioritization for work that isn’t 
>>>> mine. But I feel that the project could design better interfaces and a 
>>>> better planner/optimizer if that work were oriented toward improving 
>>>> particular features that are in wide use.
>>>> 
>>>> To summarize my thoughts:
>>>> 
>>>> – I agree that it is valuable for Apache Cassandra to gain a 
>>>> planner/optimizer.
>>>> – I disagree with removing or deprecating ALLOW FILTERING and see this as 
>>>> a necessary guardrail.
>>>> – I think the proposal surfaces the differences between the design goals 
>>>> of CQL and SQL, but I don’t feel that it quite addresses it.
>>>> – I think we could collectively build a stronger planner/optimizer once 
>>>> some of the features it’s meant to optimize are in place.
>>>> – I’m not quite sold on the need for the implementation to be bespoke 
>>>> based on discussion so far (vs. Calcite/Catalyst etc), but haven’t done 
>>>> the legwork to investigate this myself.
>>>> – I *love* the idea of capturing many of the execution and hotness 
>>>> statistics that are proposed in the CEP. It would be very valuable to 
>>>> surface query cost to users independent of a CBO. Stats like these would 
>>>> also be valuable toward retrofitting Cassandra for multitenancy by 
>>>> bounding or rate-limiting users on query cost. Tracking SSTable hotness 
>>>> would also be useful toward evaluating feasibility of tiered storage, too.
>>>> 
>>>> Thanks for this proposal and discussion so far — appreciate and enjoying 
>>>> it.
>>>> 
>>>> – Scott
>>>> 
>>>>> On Dec 20, 2023, at 7:52 AM, Benjamin Lerer <ble...@apache.org> wrote:
>>>>> 
>>>>> 
>>>>>> If we are to address that within the CEP itself then we should discuss 
>>>>>> it here, as I would like to fully understand the approach as well as how 
>>>>>> it relates to consistency of execution and the idea of triggering 
>>>>>> re-optimisation.
>>>>> 
>>>>> Sure, that was my plan.
>>>>> 
>>>>> 
>>>>>> I’m not sold on the proposed set of characteristics, and think my 
>>>>>> coupling an execution plan to a given prepared statement for clients to 
>>>>>> supply is perhaps simpler to implement and maintain, and has corollary 
>>>>>> benefits - such as providing a mechanism for users to specify their own 
>>>>>> execution plan.
>>>>>>  
>>>>>> Note, my proposal cuts across all of these elements of the CEP. There is 
>>>>>> no obvious need for a cross-cluster re-optimisation event or cross 
>>>>>> cluster statistic management.
>>>>>  
>>>>> I think that I am missing one part of your proposal. How do you plan to 
>>>>> build the initial execution plan for a prepared statement?
>>>>> 
>>>>> Le mer. 20 déc. 2023 à 14:05, Benedict <bened...@apache.org> a écrit :
>>>>>> 
>>>>>> If we are to address that within the CEP itself then we should discuss 
>>>>>> it here, as I would like to fully understand the approach as well as how 
>>>>>> it relates to consistency of execution and the idea of triggering 
>>>>>> re-optimisation. These ideas are all interrelated.
>>>>>> 
>>>>>> I’m not sold on the proposed set of characteristics, and think my 
>>>>>> coupling an execution plan to a given prepared statement for clients to 
>>>>>> supply is perhaps simpler to implement and maintain, and has corollary 
>>>>>> benefits - such as providing a mechanism for users to specify their own 
>>>>>> execution plan.
>>>>>> 
>>>>>> Note, my proposal cuts across all of these elements of the CEP. There is 
>>>>>> no obvious need for a cross-cluster re-optimisation event or cross 
>>>>>> cluster statistic management.
>>>>>> 
>>>>>> We still also need to discuss more concretely how the base statistics 
>>>>>> themselves will be derived, as there is little detail here today in the 
>>>>>> proposal.
>>>>>> 
>>>>>> 
>>>>>>> On 20 Dec 2023, at 12:58, Benjamin Lerer <b.le...@gmail.com> wrote:
>>>>>>> 
>>>>>>> After the second phase of the CEP, we will have two optimizer 
>>>>>>> implementations. One will be similar to what we have today and the 
>>>>>>> other one will be the CBO. As those implementations will be behind the 
>>>>>>> new Optimizer API interfaces they will both have support for EXPLAIN 
>>>>>>> and they will both benefit from the simplification/normalization rules. 
>>>>>>> Such as the ones that David mentioned.
>>>>>>> 
>>>>>>> Regarding functions, we are already able to determine which ones are 
>>>>>>> deterministic 
>>>>>>> (https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/functions/Function.java#L55).
>>>>>>>  We simply do not take advantage of it.
>>>>>>> 
>>>>>>> I removed the ALLOW FILTERING part and will open a discussion about it 
>>>>>>> at the beginning of next year.
>>>>>>> 
>>>>>>> Regarding the statistics management part, I would like to try to 
>>>>>>> address it within the CEP itself, if feasible. If it turns out to be 
>>>>>>> too complicated, I will separate it into its own CEP.
>>>>>>> 
>>>>>>> Le mar. 19 déc. 2023 à 22:23, David Capwell <dcapw...@apple.com> a 
>>>>>>> écrit :
>>>>>>>>> even if the only outcome of all this work were to tighten up 
>>>>>>>>> inconsistencies in our grammar and provide more robust EXPLAIN and 
>>>>>>>>> EXPLAIN ANALYZE functionality to our end users, I think that would be 
>>>>>>>>> highly valuable
>>>>>>>> 
>>>>>>>> In my mental model a no-op optimizer just becomes what we have today 
>>>>>>>> (since all new features really should be disabled by default, I would 
>>>>>>>> hope we support this), so we benefit from having a logical AST + 
>>>>>>>> ability to mutate it before we execute it and we can use this to make 
>>>>>>>> things nicer for users (as you are calling out)
>>>>>>>> 
>>>>>>>> Here is one example that stands out to me in accord
>>>>>>>> 
>>>>>>>> LET a = (select * from tbl where pk=0);
>>>>>>>> Insert into tbl2 (pk, …) values (a.pk 
>>>>>>>> <https://urldefense.com/v3/__http://a.pk/__;!!PbtH5S7Ebw!c_MD0RyXN_XIPN5B2C2--bChBdBa13WInHO75rCG_5gAk6by8K8t8sXgskQFjr1aAFaW3mhiNYlQArzRaSAP$>,
>>>>>>>>  …); — this is not allowed as we don’t know the primary key… but this 
>>>>>>>> could trivially be written to replace a.pk 
>>>>>>>> <https://urldefense.com/v3/__http://a.pk/__;!!PbtH5S7Ebw!c_MD0RyXN_XIPN5B2C2--bChBdBa13WInHO75rCG_5gAk6by8K8t8sXgskQFjr1aAFaW3mhiNYlQArzRaSAP$>
>>>>>>>>  with 0…
>>>>>>>> 
>>>>>>>> With this work we could also rethink what functions are deterministic 
>>>>>>>> and which ones are not (not trying to bike shed)… simple example is 
>>>>>>>> “now” (select now() from tbl; — each row will have a different 
>>>>>>>> timestamp), if we make this deterministic we can avoid calling it for 
>>>>>>>> each row and instead just replace it with a constant for the query… 
>>>>>>>> 
>>>>>>>> Even if the CBO is dropped in favor of no-op (what we do today), I 
>>>>>>>> still see value in this work.
>>>>>>>> 
>>>>>>>> I do think that the CBO really doesn’t solve the fact some features 
>>>>>>>> don’t work well, if anything it could just mask it until it’s too 
>>>>>>>> late….  If user builds an app using filtering and everything is going 
>>>>>>>> well in QA, but once they see a spike in traffic in prod we start 
>>>>>>>> rejecting… this is a bad user experience IMO… we KNOW you must think 
>>>>>>>> about this before you go this route, so a CBO letting you ignore it 
>>>>>>>> till you hit a wall I don’t think is the best (not saying ALLOW 
>>>>>>>> FILTERING is the solution to this… but it at least is a signal to 
>>>>>>>> users to think through their data model). 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Dec 15, 2023, at 6:38 PM, Josh McKenzie <jmcken...@apache.org> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Goals
>>>>>>>>>>  • Introduce a Cascades(2) query optimizer with rules easily 
>>>>>>>>>> extendable 
>>>>>>>>>>  • Improve query performance for most common queries
>>>>>>>>>>  • Add support for EXPLAIN and EXPLAIN ANALYZE to help with query 
>>>>>>>>>> optimization and troubleshooting
>>>>>>>>>>  • Lay the groundwork for the addition of features like joins, 
>>>>>>>>>> subqueries, OR/NOT and index ordering
>>>>>>>>>>  • Put in place some performance benchmarks to validate query 
>>>>>>>>>> optimizations
>>>>>>>>> I think these are sensible goals. We're possibly going to face a 
>>>>>>>>> chicken-or-egg problem with a feature like this that so heavily 
>>>>>>>>> intersects with other as-yet written features where much of the value 
>>>>>>>>> is in the intersection of them; if we continue down the current "one 
>>>>>>>>> heuristic to rule them all" query planning approach we have now, 
>>>>>>>>> we'll struggle to meaningfully explore or conceptualize the value of 
>>>>>>>>> potential alternatives different optimizers could present us. Flip 
>>>>>>>>> side, to Benedict's point, until SAI hits and/or some other potential 
>>>>>>>>> future things we've all talked about, this cbo would likely fall 
>>>>>>>>> directly into the same path that we effectively have hard-coded today 
>>>>>>>>> (primary index path only).
>>>>>>>>> 
>>>>>>>>> One thing I feel pretty strongly about: even if the only outcome of 
>>>>>>>>> all this work were to tighten up inconsistencies in our grammar and 
>>>>>>>>> provide more robust EXPLAIN and EXPLAIN ANALYZE functionality to our 
>>>>>>>>> end users, I think that would be highly valuable. This path of "only" 
>>>>>>>>> would be predicated on us not having successful introduction of a 
>>>>>>>>> robust secondary index implementation and a variety of other things 
>>>>>>>>> we have a lot of interest in, so I find it unlikely, but worth 
>>>>>>>>> calling out.
>>>>>>>>> 
>>>>>>>>> re: the removal of ALLOW FILTERING - is there room for compromise 
>>>>>>>>> here and instead converting it to a guardrail that defaults to being 
>>>>>>>>> enabled? That could theoretically give us a more gradual path to 
>>>>>>>>> migration to a cost-based guardrail for instance, and would preserve 
>>>>>>>>> the current robustness of the system while making it at least a touch 
>>>>>>>>> more configurable.
>>>>>>>>> 
>>>>>>>>> On Fri, Dec 15, 2023, at 11:03 AM, Chris Lohfink wrote:
>>>>>>>>>> Thanks for time in addressing concerns. At least with initial 
>>>>>>>>>> versions, as long as there is a way to replace it with noop or 
>>>>>>>>>> disable it I would be happy. This is pretty standard practice with 
>>>>>>>>>> features nowadays but I wanted to highlight it as this might require 
>>>>>>>>>> some pretty tight coupling.
>>>>>>>>>> 
>>>>>>>>>> Chris
>>>>>>>>>> 
>>>>>>>>>> On Fri, Dec 15, 2023 at 7:57 AM Benjamin Lerer <ble...@apache.org> 
>>>>>>>>>> wrote:
>>>>>>>>>>> Hey Chris,
>>>>>>>>>>> You raise some valid points.
>>>>>>>>>>> 
>>>>>>>>>>> I believe that there are 3 points that you mentioned:
>>>>>>>>>>> 1) CQL restrictions are some form of safety net and should be kept
>>>>>>>>>>> 2) A lot of Cassandra features do not scale and/or are too easy to 
>>>>>>>>>>> use in a wrong way that can make the whole system collapse. We 
>>>>>>>>>>> should not add more to that list. Especially not joins.
>>>>>>>>>>> 
>>>>>>>>>>> 3) Should we not start to fix features like secondary index rather 
>>>>>>>>>>> than adding new ones? Which is heavily linked to 2).
>>>>>>>>>>> 
>>>>>>>>>>> Feel free to correct me if I got them wrong or missed one.
>>>>>>>>>>> 
>>>>>>>>>>> Regarding 1), I believe that you refer to the "Removing unnecessary 
>>>>>>>>>>> CQL query limitations and inconsistencies" section. We are not 
>>>>>>>>>>> planning to remove any safety net here.
>>>>>>>>>>> What we want to remove is a certain amount of limitations which 
>>>>>>>>>>> make things confusing for a user trying to write a query for no 
>>>>>>>>>>> good reason. Like "why can I define a column alias but not use it 
>>>>>>>>>>> anywhere in my query?" or "Why can I not create a list with 2 bind 
>>>>>>>>>>> parameters?". While refactoring some CQL code, I kept on finding 
>>>>>>>>>>> those types of exceptions that we can easily remove while 
>>>>>>>>>>> simplifying the code at the same time.
>>>>>>>>>>> 
>>>>>>>>>>> For 2), I agree that at a certain scale or for some scenarios, some 
>>>>>>>>>>> features simply do not scale or catch users by surprise. The goal 
>>>>>>>>>>> of the CEP is to improve things in 2 ways. One is by making 
>>>>>>>>>>> Cassandra smarter in the way it chooses how to process queries, 
>>>>>>>>>>> hopefully improving its overall scalability. The other by being 
>>>>>>>>>>> transparent about how Cassandra will execute the queries through 
>>>>>>>>>>> the use of EXPLAIN. One problem of GROUP BY for example is that 
>>>>>>>>>>> most users do not realize what is actually happening under the hood 
>>>>>>>>>>> and therefore its limitations. I do not believe that EXPLAIN will 
>>>>>>>>>>> change everything but it will help people to get a better 
>>>>>>>>>>> understanding of the limitations of some features.
>>>>>>>>>>> 
>>>>>>>>>>> I do not know which features will be added in the future to C*. 
>>>>>>>>>>> That will be discussed through some future CEPs. Nevertheless, I do 
>>>>>>>>>>> not believe that it makes sense to write a CEP for a query 
>>>>>>>>>>> optimizer without taking into account that we might at some point 
>>>>>>>>>>> add some level of support for joins or subqueries. We have been too 
>>>>>>>>>>> often delivering features without looking at what could be the 
>>>>>>>>>>> possible evolutions which resulted in code where adding new 
>>>>>>>>>>> features was more complex than it should have been. I do not want 
>>>>>>>>>>> to make the same mistake. I want to create an optimizer that can be 
>>>>>>>>>>> improved easily and considering joins or other features simply help 
>>>>>>>>>>> to build things in a more generic way.
>>>>>>>>>>> 
>>>>>>>>>>> Regarding feature stabilization, I believe that it is happening. I 
>>>>>>>>>>> have heard plans of how to solve MVs, range queries, hot 
>>>>>>>>>>> partitions, ... and there was a lot of thinking behind those plans. 
>>>>>>>>>>> Secondary indexes are being worked on. We hope that the optimizer 
>>>>>>>>>>> will also help with some index queries.
>>>>>>>>>>> 
>>>>>>>>>>> It seems to me that this proposal is going toward the direction 
>>>>>>>>>>> that you want without introducing new problems for scalability.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> Le jeu. 14 déc. 2023 à 16:47, Chris Lohfink <clohfin...@gmail.com> 
>>>>>>>>>>> a écrit :
>>>>>>>>>>>> I don't wanna be a blocker for this CEP or anything but did want 
>>>>>>>>>>>> to put my 2 cents in. This CEP is horrifying to me.
>>>>>>>>>>>> 
>>>>>>>>>>>> I have seen thousands of clusters across multiple companies and 
>>>>>>>>>>>> helped them get working successfully. A vast majority of that 
>>>>>>>>>>>> involved blocking the use of MVs, GROUP BY, secondary indexes, and 
>>>>>>>>>>>> even just simple _range queries_. The "unncessary restrictions of 
>>>>>>>>>>>> cql" are not only necessary IMHO, more restrictions are necessary 
>>>>>>>>>>>> to be successful at scale. The idea of just opening up CQL to 
>>>>>>>>>>>> general purpose relational queries and lines like "supporting 
>>>>>>>>>>>> queries with joins in an efficient way" ... I would really like us 
>>>>>>>>>>>> to make secondary indexes be a viable option before we start 
>>>>>>>>>>>> opening up floodgates on stuff like this.
>>>>>>>>>>>> 
>>>>>>>>>>>> Chris
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Dec 14, 2023 at 9:37 AM Benedict <bened...@apache.org> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> > So yes, this physical plan is the structure that you have in 
>>>>>>>>>>>>> > mind but the idea of sharing it is not part of the CEP.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I think it should be. This should form a major part of the API on 
>>>>>>>>>>>>> which any CBO is built.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> > It seems that there is a difference between the goal of your 
>>>>>>>>>>>>> > proposal and the one of the CEP. The goal of the CEP is first 
>>>>>>>>>>>>> > to ensure optimal performance. It is ok to change the execution 
>>>>>>>>>>>>> > plan for one that delivers better performance. What we want to 
>>>>>>>>>>>>> > minimize is having a node performing queries in an inefficient 
>>>>>>>>>>>>> > way for a long period of time.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You have made a goal of the CEP synchronising summary statistics 
>>>>>>>>>>>>> across the whole cluster in order to achieve some degree of 
>>>>>>>>>>>>> uniformity of query plan. So this is explicitly a goal of the 
>>>>>>>>>>>>> CEP, and synchronising summary statistics is a hard problem and 
>>>>>>>>>>>>> won’t provide strong guarantees.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> > The client side proposal targets consistency for a given query 
>>>>>>>>>>>>> > on a given driver instance. In practice, it would be possible 
>>>>>>>>>>>>> > to have 2 similar queries with 2 different execution plans on 
>>>>>>>>>>>>> > the same driver
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This would only be possible if the driver permitted it. A driver 
>>>>>>>>>>>>> could (and should) enforce that it only permits one query plan 
>>>>>>>>>>>>> per query.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The opposite is true for your proposal: some queries may begin 
>>>>>>>>>>>>> degrading because they touch specific replicas that optimise the 
>>>>>>>>>>>>> query differently, and this will be hard to debug.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 14 Dec 2023, at 15:30, Benjamin Lerer <b.le...@gmail.com> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The binding of the parser output to the schema (what is today 
>>>>>>>>>>>>>> the Raw.prepare call) will create the logical plan, expressed as 
>>>>>>>>>>>>>> a tree of relational operators. Simplification and normalization 
>>>>>>>>>>>>>> will happen on that tree to produce a new equivalent logical 
>>>>>>>>>>>>>> plan. That logical plan will be used as input to the optimizer. 
>>>>>>>>>>>>>> The output will be a physical plan producing the output 
>>>>>>>>>>>>>> specified by the logical plan. A tree of physical operators 
>>>>>>>>>>>>>> specifying how the operations should be performed.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> That physical plan will be stored as part of the statements 
>>>>>>>>>>>>>> (SelectStatement, ModificationStatement, ...) in the prepared 
>>>>>>>>>>>>>> statement cache. Upon execution, variables will be bound and the 
>>>>>>>>>>>>>> RangeCommands/Mutations will be created based on the physical 
>>>>>>>>>>>>>> plan.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The string representation of a physical plan will effectively 
>>>>>>>>>>>>>> represent the output of an EXPLAIN statement but outside of that 
>>>>>>>>>>>>>> the physical plan will stay encapsulated within the statement 
>>>>>>>>>>>>>> classes.   
>>>>>>>>>>>>>> Hints will be parameters provided to the optimizer to enforce 
>>>>>>>>>>>>>> some specific choices. Like always using an Index Scan instead 
>>>>>>>>>>>>>> of a Table Scan, ignoring the cost comparison.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So yes, this physical plan is the structure that you have in 
>>>>>>>>>>>>>> mind but the idea of sharing it is not part of the CEP. I did 
>>>>>>>>>>>>>> not document it because it will simply be a tree of physical 
>>>>>>>>>>>>>> operators used internally.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> My proposal is that the execution plan of the coordinator that 
>>>>>>>>>>>>>>> prepares a query gets serialised to the client, which then 
>>>>>>>>>>>>>>> provides the execution plan to all future coordinators, and 
>>>>>>>>>>>>>>> coordinators provide it to replicas as necessary. 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This means it is not possible for any conflict to arise for a 
>>>>>>>>>>>>>>> single client. It would guarantee consistency of execution for 
>>>>>>>>>>>>>>> any single client (and avoid any drift over the client’s 
>>>>>>>>>>>>>>> sessions), without necessarily guaranteeing consistency for all 
>>>>>>>>>>>>>>> clients.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  It seems that there is a difference between the goal of your 
>>>>>>>>>>>>>> proposal and the one of the CEP. The goal of the CEP is first to 
>>>>>>>>>>>>>> ensure optimal performance. It is ok to change the execution 
>>>>>>>>>>>>>> plan for one that delivers better performance. What we want to 
>>>>>>>>>>>>>> minimize is having a node performing queries in an inefficient 
>>>>>>>>>>>>>> way for a long period of time.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The client side proposal targets consistency for a given query 
>>>>>>>>>>>>>> on a given driver instance. In practice, it would be possible to 
>>>>>>>>>>>>>> have 2 similar queries with 2 different execution plans on the 
>>>>>>>>>>>>>> same driver making things really confusing. Identifying the 
>>>>>>>>>>>>>> source of an inefficient query will also be pretty hard.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Interestingly, having 2 nodes with 2 different execution plans 
>>>>>>>>>>>>>> might not be a serious problem. It simply means that based on 
>>>>>>>>>>>>>> cardinality at t1, the optimizer on node 1 chose plan 1 while 
>>>>>>>>>>>>>> the one on node 2 chose plan 2 at t2. In practice if the cost 
>>>>>>>>>>>>>> estimates reflect properly the actual cost those 2 plans should 
>>>>>>>>>>>>>> have pretty similar efficiency. The problem is more about the 
>>>>>>>>>>>>>> fact that you would ideally want a uniform behavior around your 
>>>>>>>>>>>>>> cluster.
>>>>>>>>>>>>>> Changes of execution plans should only occur at certain points. 
>>>>>>>>>>>>>> So the main problematic scenario is when the data distribution 
>>>>>>>>>>>>>> is around one of those points. Which is also the point where the 
>>>>>>>>>>>>>> change should have the least impact.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Le jeu. 14 déc. 2023 à 11:38, Benedict <bened...@apache.org> a 
>>>>>>>>>>>>>> écrit :
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> There surely needs to be a more succinct and abstract 
>>>>>>>>>>>>>>> representation in order to perform transformations on the query 
>>>>>>>>>>>>>>> plan? You don’t intend to manipulate the object graph directly 
>>>>>>>>>>>>>>> as you apply any transformations when performing simplification 
>>>>>>>>>>>>>>> or cost based analysis? This would also (I expect) be the form 
>>>>>>>>>>>>>>> used to support EXPLAIN functionality, and probably also HINTs 
>>>>>>>>>>>>>>> etc. This would ideally *not* be coupled to the CBO itself, and 
>>>>>>>>>>>>>>> would ideally be succinctly serialised.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I would very much expect the query plan to be represented 
>>>>>>>>>>>>>>> abstractly as part of this work, and for there to be a 
>>>>>>>>>>>>>>> mechanism that translates this abstract representation into the 
>>>>>>>>>>>>>>> object graph that executes it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If I’m incorrect, could you please elaborate more specifically 
>>>>>>>>>>>>>>> how you intend to go about this?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 14 Dec 2023, at 10:33, Benjamin Lerer <b.le...@gmail.com> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I mean that an important part of this work - not specified in 
>>>>>>>>>>>>>>>>> the CEP (AFAICT) - should probably be to define some standard 
>>>>>>>>>>>>>>>>> execution model, that we can manipulate and serialise, for 
>>>>>>>>>>>>>>>>> use across (and without) optimisers.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I am confused because for me an execution model defines how 
>>>>>>>>>>>>>>>> operations are executed within the database in a conceptual 
>>>>>>>>>>>>>>>> way, which is not something that this CEP intends to change. 
>>>>>>>>>>>>>>>> Do you mean the physical/execution plan?
>>>>>>>>>>>>>>>> Today this plan is somehow represented for reads by the 
>>>>>>>>>>>>>>>> SelectStatement and its components (Selections, 
>>>>>>>>>>>>>>>> StatementRestrictions, ...) it is then converted at execution 
>>>>>>>>>>>>>>>> time after parameter binding into a ReadCommand which is sent 
>>>>>>>>>>>>>>>> to the replicas.
>>>>>>>>>>>>>>>> We plan to refactor SelectStatement and its components but the 
>>>>>>>>>>>>>>>> ReadCommands change should be relatively small. What you are 
>>>>>>>>>>>>>>>> proposing is not part of the scope of this CEP.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Le jeu. 14 déc. 2023 à 10:24, Benjamin Lerer 
>>>>>>>>>>>>>>>> <b.le...@gmail.com> a écrit :
>>>>>>>>>>>>>>>>>> Can you share the reasons why Apache Calcite is not suitable 
>>>>>>>>>>>>>>>>>> for this case and why it was rejected
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> My understanding is that Calcite was made for two main 
>>>>>>>>>>>>>>>>> things: to help with optimizing SQL-like languages and to let 
>>>>>>>>>>>>>>>>> people query different kinds of data sources together.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We could think about using it for our needs, but there are 
>>>>>>>>>>>>>>>>> some big problems:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  1. CQL is not SQL. There are significant differences between 
>>>>>>>>>>>>>>>>> the 2 languages
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  2. Cassandra has its own specificities that will influence 
>>>>>>>>>>>>>>>>> the cost model and the way we deal with optimizations: 
>>>>>>>>>>>>>>>>> partitions, replication factors, consistency levels, LSM tree 
>>>>>>>>>>>>>>>>> storage, ...
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  3. Every framework comes with its own limitations and 
>>>>>>>>>>>>>>>>> additional cost
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> From my view, there are too many big differences between what 
>>>>>>>>>>>>>>>>> Calcite does and what we need in Cassandra. If we used 
>>>>>>>>>>>>>>>>> Calcite, it would also mean relying a lot on another system 
>>>>>>>>>>>>>>>>> that everyone would have to learn and adjust to. The problems 
>>>>>>>>>>>>>>>>> and extra work this would bring don't seem worth the benefits 
>>>>>>>>>>>>>>>>> we might get
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>   
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Le mer. 13 déc. 2023 à 18:06, Benjamin Lerer 
>>>>>>>>>>>>>>>>> <b.le...@gmail.com> a écrit :
>>>>>>>>>>>>>>>>>> One thing that I did not mention is the fact that this CEP 
>>>>>>>>>>>>>>>>>> is only a high level proposal. There will be deeper 
>>>>>>>>>>>>>>>>>> discussions on the dev list around the different parts of 
>>>>>>>>>>>>>>>>>> this proposal when we reach those parts and have enough 
>>>>>>>>>>>>>>>>>> details to make those discussions more meaningful.    
>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>> The maintenance and distribution of summary statistics in 
>>>>>>>>>>>>>>>>>>> particular is worthy of its own CEP, and it might be 
>>>>>>>>>>>>>>>>>>> preferable to split it out.
>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>> For maintaining node statistics the idea is to re-use the 
>>>>>>>>>>>>>>>>>> current Memtable/SSTable mechanism and relies on mergeable 
>>>>>>>>>>>>>>>>>> statistics. That will allow us to easily build node level 
>>>>>>>>>>>>>>>>>> statistics for a given table by merging all the statistics 
>>>>>>>>>>>>>>>>>> of its memtable and SSTables. For the distribution of these 
>>>>>>>>>>>>>>>>>> node statistics we are still exploring different options. We 
>>>>>>>>>>>>>>>>>> can come back with a precise proposal once we have hammered 
>>>>>>>>>>>>>>>>>> all the details. 
>>>>>>>>>>>>>>>>>> Is it for you a blocker for this CEP or do you just want to 
>>>>>>>>>>>>>>>>>> make sure that this part is discussed in deeper details 
>>>>>>>>>>>>>>>>>> before we implement it?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>> The proposal also seems to imply we are aiming for 
>>>>>>>>>>>>>>>>>>> coordinators to all make the same decision for a query, 
>>>>>>>>>>>>>>>>>>> which I think is challenging, and it would be worth 
>>>>>>>>>>>>>>>>>>> fleshing out the design here a little (perhaps just in 
>>>>>>>>>>>>>>>>>>> Jira).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The goal is that the large majority of nodes preparing a 
>>>>>>>>>>>>>>>>>> query at a given point in time should make the same decision 
>>>>>>>>>>>>>>>>>> and that over time all nodes should converge toward the same 
>>>>>>>>>>>>>>>>>> decision. This part is dependent on the node statistics 
>>>>>>>>>>>>>>>>>> distribution, the cost model and the triggers for 
>>>>>>>>>>>>>>>>>> re-optimization (that will require some experimentation).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> There’s also not much discussion of the execution model: I 
>>>>>>>>>>>>>>>>>>> think it would make most sense for this to be independent 
>>>>>>>>>>>>>>>>>>> of any cost and optimiser models (though they might want to 
>>>>>>>>>>>>>>>>>>> operate on them), so that EXPLAIN and hints can work across 
>>>>>>>>>>>>>>>>>>> optimisers (a suitable hint might essentially bypass the 
>>>>>>>>>>>>>>>>>>> optimiser, if the optimiser permits it, by providing a 
>>>>>>>>>>>>>>>>>>> standard execution model)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> It is not clear to me what you mean by "a standard execution 
>>>>>>>>>>>>>>>>>> model"? Otherwise, we were not planning to have the 
>>>>>>>>>>>>>>>>>> execution model or the hints depending on the optimizer. 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I think it would be worth considering providing the 
>>>>>>>>>>>>>>>>>>> execution plan to the client as part of query preparation, 
>>>>>>>>>>>>>>>>>>> as an opaque payload to supply to coordinators on first 
>>>>>>>>>>>>>>>>>>> contact, as this might simplify the problem of ensuring 
>>>>>>>>>>>>>>>>>>> queries behave the same without adopting a lot of 
>>>>>>>>>>>>>>>>>>> complexity for synchronising statistics (which will never 
>>>>>>>>>>>>>>>>>>> provide strong guarantees). Of course, re-preparing a query 
>>>>>>>>>>>>>>>>>>> might lead to a new plan, though any coordinators with the 
>>>>>>>>>>>>>>>>>>> query in their cache should be able to retrieve it cheaply. 
>>>>>>>>>>>>>>>>>>> If the execution model is efficiently serialised this might 
>>>>>>>>>>>>>>>>>>> have the ancillary benefit of improving the occupancy of 
>>>>>>>>>>>>>>>>>>> our prepared query cache.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I am not sure that I understand your proposal. If 2 nodes 
>>>>>>>>>>>>>>>>>> build a different execution plan how do you solve that 
>>>>>>>>>>>>>>>>>> conflict?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Le mer. 13 déc. 2023 à 09:55, Benedict <bened...@apache.org> 
>>>>>>>>>>>>>>>>>> a écrit :
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> A CBO can only make worse decisions than the status quo for 
>>>>>>>>>>>>>>>>>>> what I presume are the majority of queries - i.e. those 
>>>>>>>>>>>>>>>>>>> that touch only primary indexes. In general, there are 
>>>>>>>>>>>>>>>>>>> plenty of use cases that prefer determinism. So I agree 
>>>>>>>>>>>>>>>>>>> that there should at least be a CBO implementation that 
>>>>>>>>>>>>>>>>>>> makes the same decisions as the status quo, 
>>>>>>>>>>>>>>>>>>> deterministically.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I do support the proposal, but would like to see some 
>>>>>>>>>>>>>>>>>>> elements discussed in more detail. The maintenance and 
>>>>>>>>>>>>>>>>>>> distribution of summary statistics in particular is worthy 
>>>>>>>>>>>>>>>>>>> of its own CEP, and it might be preferable to split it out. 
>>>>>>>>>>>>>>>>>>> The proposal also seems to imply we are aiming for 
>>>>>>>>>>>>>>>>>>> coordinators to all make the same decision for a query, 
>>>>>>>>>>>>>>>>>>> which I think is challenging, and it would be worth 
>>>>>>>>>>>>>>>>>>> fleshing out the design here a little (perhaps just in 
>>>>>>>>>>>>>>>>>>> Jira).
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> While I’m not a fan of ALLOW FILTERING, I’m not convinced 
>>>>>>>>>>>>>>>>>>> that this CEP deprecates it. It is a concrete qualitative 
>>>>>>>>>>>>>>>>>>> guard rail, that I expect some users will prefer to a 
>>>>>>>>>>>>>>>>>>> cost-based guard rail. Perhaps this could be left to the 
>>>>>>>>>>>>>>>>>>> CBO to decide how to treat.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> There’s also not much discussion of the execution model: I 
>>>>>>>>>>>>>>>>>>> think it would make most sense for this to be independent 
>>>>>>>>>>>>>>>>>>> of any cost and optimiser models (though they might want to 
>>>>>>>>>>>>>>>>>>> operate on them), so that EXPLAIN and hints can work across 
>>>>>>>>>>>>>>>>>>> optimisers (a suitable hint might essentially bypass the 
>>>>>>>>>>>>>>>>>>> optimiser, if the optimiser permits it, by providing a 
>>>>>>>>>>>>>>>>>>> standard execution model)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I think it would be worth considering providing the 
>>>>>>>>>>>>>>>>>>> execution plan to the client as part of query preparation, 
>>>>>>>>>>>>>>>>>>> as an opaque payload to supply to coordinators on first 
>>>>>>>>>>>>>>>>>>> contact, as this might simplify the problem of ensuring 
>>>>>>>>>>>>>>>>>>> queries behave the same without adopting a lot of 
>>>>>>>>>>>>>>>>>>> complexity for synchronising statistics (which will never 
>>>>>>>>>>>>>>>>>>> provide strong guarantees). Of course, re-preparing a query 
>>>>>>>>>>>>>>>>>>> might lead to a new plan, though any coordinators with the 
>>>>>>>>>>>>>>>>>>> query in their cache should be able to retrieve it cheaply. 
>>>>>>>>>>>>>>>>>>> If the execution model is efficiently serialised this might 
>>>>>>>>>>>>>>>>>>> have the ancillary benefit of improving the occupancy of 
>>>>>>>>>>>>>>>>>>> our prepared query cache.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On 13 Dec 2023, at 00:44, Jon Haddad <j...@jonhaddad.com> 
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I think it makes sense to see what the actual overhead is 
>>>>>>>>>>>>>>>>>>>> of CBO before making the assumption it'll be so high that 
>>>>>>>>>>>>>>>>>>>> we need to have two code paths.  I'm happy to provide 
>>>>>>>>>>>>>>>>>>>> thorough benchmarking and analysis when it reaches a 
>>>>>>>>>>>>>>>>>>>> testing phase.  
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I'm excited to see where this goes.  I think it sounds 
>>>>>>>>>>>>>>>>>>>> very forward looking and opens up a lot of possibilities.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Jon
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Tue, Dec 12, 2023 at 4:25 PM guo Maxwell 
>>>>>>>>>>>>>>>>>>>> <cclive1...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>> Nothing expresses my thoughts better than +1
>>>>>>>>>>>>>>>>>>>>> ，It feels like it means a lot to Cassandra.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I have a question. Is it easy to turn off cbo's optimizer 
>>>>>>>>>>>>>>>>>>>>> or by pass in some way? Because some simple read and 
>>>>>>>>>>>>>>>>>>>>> write requests will have better performance without cbo, 
>>>>>>>>>>>>>>>>>>>>> which is also the advantage of Cassandra compared to some 
>>>>>>>>>>>>>>>>>>>>> rdbms.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> David Capwell <dcapw...@apple.com>于2023年12月13日 周三上午3:37写道：
>>>>>>>>>>>>>>>>>>>>>> Overall LGTM.  
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Dec 12, 2023, at 5:29 AM, Benjamin Lerer 
>>>>>>>>>>>>>>>>>>>>>>> <ble...@apache.org> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi everybody,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I would like to open the discussion on the introduction 
>>>>>>>>>>>>>>>>>>>>>>> of a cost based optimizer to allow Cassandra to pick 
>>>>>>>>>>>>>>>>>>>>>>> the best execution plan based on the data 
>>>>>>>>>>>>>>>>>>>>>>> distribution.Therefore, improving the overall query 
>>>>>>>>>>>>>>>>>>>>>>> performance.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> This CEP should also lay the groundwork for the future 
>>>>>>>>>>>>>>>>>>>>>>> addition of features like joins, subqueries, OR/NOT and 
>>>>>>>>>>>>>>>>>>>>>>> index ordering.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> The proposal is here: 
>>>>>>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-39%3A+Cost+Based+Optimizer
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thank you in advance for your feedback.
>>>> 
>>

Re: [DISCUSS] CEP-39: Cost Based Optimizer

Reply via email to