Re: Capabilities

2024-12-21 Thread Jordan West
I tend to lean towards Josh's perspective. Gossip was poorly tested and
implemented. I dont think it's a good parallel or at least I hope it's not.
Taken to the extreme we shouldn't touch the database at all otherwise,
which isn't practical. That said, anything touching important subsystems
needs more care, testing, and time to bake. I think we're mostly discussing
"being careful" of which I am totally on board with. I don't think Benedict
ever said "don't use TCM", in fact he's said the opposite, but emphasized
the care that is required when we do, which is totally reasonable.

Back to capabilities, Riak built them on an eventually consistent subsystem
and they worked fine. If you have a split brain you likely dont want to
communicate agreement as is (or have already learned about agreement and
its not an issue). That said, I don't think we have an EC layer in C* I
would want to rely on outside of distributed tables. So in the context of
what we have existing I think TCM is a better fit. I still need to dig a
little more to be convinced and plan to do that as I draft the CEP.

Jordan

On Sat, Dec 21, 2024 at 5:51 AM Benedict  wrote:

> I’m not saying we need to tease out bugs from TCM. I’m saying every time
> someone touches something this central to correctness we introduce a risk
> of breaking it, and that we should exercise that risk judiciously. This has
> zero to do with the amount of data we’re pushing through it, and 100% to do
> with writing bad code.
>
> We treated gossip carefully in part because it was hard to work with, but
> in part because getting it wrong was particularly bad. We should retain the
> latter reason for caution.
>
> We also absolutely do not need TCM for consistency. We have consistent
> database functionality for that. TCM is special because it cannot rely on
> the database mechanisms, as it underpins them. That is the whole point of
> why we should treat it carefully.
>
> On 21 Dec 2024, at 13:43, Josh McKenzie  wrote:
>
> 
> To play the devil's advocate - the more we exercise TCM the more bugs we
> suss out. To Jon's point, the volume of information we're talking about
> here in terms of capabilities dissemination shouldn't stress TCM at all.
>
> I think a reasonable heuristic for relying on TCM for something is whether
> there's a big difference in UX on something being eventually consistent vs.
> strongly consistent. Exposing features to clients based on whether the
> entire cluster supports them seems like the kind of thing that could cause
> pain if we're in a split-brain, cluster-is-settling-on-agreement kind of
> paradigm.
>
> On Fri, Dec 20, 2024, at 3:17 PM, Benedict wrote:
>
>
> Mostly conceptual; the problem with a linearizable history is that if you
> lose some of it (eg because some logic bug prevents you from processing
> some epoch) you stop the world until an operator can step in to perform
> surgery about what the history should be.
>
> I do know of one recent bug to schema changes in cep-15 that broke TCM in
> this way. That particular avenue will be hardened, but the fewer places we
> risk this the better IMO.
>
> Of course, there are steps we could take to expose a limited API targeting
> these use cases, as well as using a separate log for ancillary
> functionality, that might better balance risk:reward. But equally I’m not
> sure it makes sense to TCM all the things, and maybe dogfooding our own
> database features and developing functionality that enables our own use
> cases could be better where it isn’t necessary 🤷‍♀️
>
>
> On 20 Dec 2024, at 19:22, Jordan West  wrote:
>
> 
> On Fri, Dec 20, 2024 at 11:06 AM Benedict  wrote:
>
>
> If TCM breaks we all have a really bad time, much worse than if any one of
> these features individually has problems. If you break TCM in the right way
> the cluster could become inoperable, or operations like topology changes
> may be prevented.
>
>
> Benedict, when you say this are you speaking hypothetically (in the sense
> that by using TCM more we increase the probability of using it "wrong" and
> hitting an unknown edge case) or are there known ways today that TCM
> "breaks"?
>
> Jordan
>
>
> This means that even a parallel log has some risk if we end up modifying
> shared functionality.
>
>
>
>
> On 20 Dec 2024, at 18:47, Štefan Miklošovič 
> wrote:
>
> 
> I stand corrected. C in TCM is "cluster" :D Anyway. Configuration is super
> reasonable to be put there.
>
> On Fri, Dec 20, 2024 at 7:42 PM Štefan Miklošovič 
> wrote:
>
> I am super hesitant to base distributed guardrails or any configuration
> for that matter on anything but TCM. Does not "C" in TCM stand for
> "configuration" anyway? So rename it to TSM like "schema" then if it is
> meant to be just for that. It seems to be quite ridiculous to code tables
> with caches on top when we have way more effective tooling thanks to CEP-21
> to deal with that with clear advantages of getting rid of all of that old
> mechanism we have in place.
>
> I have not seen a

Re: [DISCUSS] Index selection syntax for CASSANDRA-18112

2024-12-21 Thread Ekaterina Dimitrova
Naming is hard but to me providing what Caleb mentioned through something
like WITH OPTIONS sounds reasonable. Thanks for bringing it up.

On Sat, 21 Dec 2024 at 2:46, Joel Shepherd  wrote:

> WITH INDEX (or something equivalent) seems really useful.
>
> Less opinionated on the specific syntax, but I think there is a lot of
> value in the form of predictable, controllable performance, in giving
> developers more direct control over query execution, whether that's
> index selection or even lower-level decisions. If you've experienced the
> thrill of operating a database with a cost-based planner that abruptly
> selects a new, sub-optimal plan due to a change in statistics or
> configuration, you'll appreciate language features that yield some
> planning control back to you. It does increase the burden on the
> developer to understand how best to execute the query, but it makes
> their intent much more obvious, and easier to adjust as the system changes.
>
> -- Joel.
>
> On 12/20/2024 12:28 PM, Caleb Rackliffe wrote:
> > Some of your are probably familiar with work in the DS fork to improve
> > the selection of indexes for SAI queries in
> >
> https://github.com/datastax/cassandra/commit/eeb33dd62b9b74ecf818a263fd73dbe6714b0df0#diff-2830028723b7f4af5ec7450fae2c206aeefa5a2c3455eff6f4a0734a85cb5424.
>
> >
> >
> > While I'm eagerly anticipating working on that in the new year, I'm
> > also wondering whether we think some simple CQL extensions to manually
> > control index selection would be helpful. Maxwell proposed this a
> > while back in CASSANDRA-18112, and I'd like to propose a syntax:
> >
> >
> > ex. Do not use the specified index during the query.
> >
> > SELECT ... FROM ... WHERE ... WITHOUT INDEX 
> >
> > This could be helpful for intersection queries where one of the
> > provided clauses is not very selective and could simply be handled via
> > post-filtering.
> >
> > ex. Require the specified index to be used.
> >
> > SELECT ... FROM ... WHERE ... WITH INDEX 
> >
> > This could be helpful in scenarios where multiple indexes exist on a
> > column and was the primary motivation for CASSANDRA-18112.
> >
> > Thoughts?
>


Re: Capabilities

2024-12-21 Thread Benedict
I’m not saying we need to tease out bugs from TCM. I’m saying every time someone touches something this central to correctness we introduce a risk of breaking it, and that we should exercise that risk judiciously. This has zero to do with the amount of data we’re pushing through it, and 100% to do with writing bad code.We treated gossip carefully in part because it was hard to work with, but in part because getting it wrong was particularly bad. We should retain the latter reason for caution.We also absolutely do not need TCM for consistency. We have consistent database functionality for that. TCM is special because it cannot rely on the database mechanisms, as it underpins them. That is the whole point of why we should treat it carefully.On 21 Dec 2024, at 13:43, Josh McKenzie  wrote:To play the devil's advocate - the more we exercise TCM the more bugs we suss out. To Jon's point, the volume of information we're talking about here in terms of capabilities dissemination shouldn't stress TCM at all.I think a reasonable heuristic for relying on TCM for something is whether there's a big difference in UX on something being eventually consistent vs. strongly consistent. Exposing features to clients based on whether the entire cluster supports them seems like the kind of thing that could cause pain if we're in a split-brain, cluster-is-settling-on-agreement kind of paradigm.On Fri, Dec 20, 2024, at 3:17 PM, Benedict wrote:Mostly conceptual; the problem with a linearizable history is that if you lose some of it (eg because some logic bug prevents you from processing some epoch) you stop the world until an operator can step in to perform surgery about what the history should be.I do know of one recent bug to schema changes in cep-15 that broke TCM in this way. That particular avenue will be hardened, but the fewer places we risk this the better IMO. Of course, there are steps we could take to expose a limited API targeting these use cases, as well as using a separate log for ancillary functionality, that might better balance risk:reward. But equally I’m not sure it makes sense to TCM all the things, and maybe dogfooding our own database features and developing functionality that enables our own use cases could be better where it isn’t necessary 🤷‍♀️On 20 Dec 2024, at 19:22, Jordan West  wrote:On Fri, Dec 20, 2024 at 11:06 AM Benedict  wrote:If TCM breaks we all have a really bad time, much worse than if any one of these features individually has problems. If you break TCM in the right way the cluster could become inoperable, or operations like topology changes may be prevented. Benedict, when you say this are you speaking hypothetically (in the sense that by using TCM more we increase the probability of using it "wrong" and hitting an unknown edge case) or are there known ways today that TCM "breaks"?  Jordan This means that even a parallel log has some risk if we end up modifying shared functionality.On 20 Dec 2024, at 18:47, Štefan Miklošovič  wrote:I stand corrected. C in TCM is "cluster" :D Anyway. Configuration is super reasonable to be put there.On Fri, Dec 20, 2024 at 7:42 PM Štefan Miklošovič  wrote:I am super hesitant to base distributed guardrails or any configuration for that matter on anything but TCM. Does not "C" in TCM stand for "configuration" anyway? So rename it to TSM like "schema" then if it is meant to be just for that. It seems to be quite ridiculous to code tables with caches on top when we have way more effective tooling thanks to CEP-21 to deal with that with clear advantages of getting rid of all of that old mechanism we have in place.I have not seen any concrete examples of risks why using TCM should be just for what it is currently for. Why not put the configuration meant to be cluster-wide into that? What is it ... performance? What does even the term "additional complexity" mean? Complex in what? Do you think that putting there 3 types of transformations in case of guardrails which flip some booleans and numbers would suddenly make TCM way more complex? Come on ...This has nothing to do with what Jordan is trying to introduce. I think we all agree he knows what he is doing and if he evaluates that TCM is too much for his use case (or it is not a good fit) that is perfectly fine. On Fri, Dec 20, 2024 at 7:22 PM Paulo Motta  wrote:> It should be possible to use distributed system tables just fine for capabilities, config and guardrails.I have been thinking about this recently and I agree we should be wary about introducing new TCM states and create additional complexity that can be serviced by existing data dissemination mechanisms (gossip/system tables). I would prefer that we take a more phased and incremental approach to introduce new TCM states.As a way to accomplish that, I have thought about introducing a new generic TCM state "In Maintenance", where schema or membership changes are "frozen/

Re: Capabilities

2024-12-21 Thread Josh McKenzie
To play the devil's advocate - the more we exercise TCM the more bugs we suss 
out. To Jon's point, the volume of information we're talking about here in 
terms of capabilities dissemination shouldn't stress TCM at all.

I think a reasonable heuristic for relying on TCM for something is whether 
there's a big difference in UX on something being eventually consistent vs. 
strongly consistent. Exposing features to clients based on whether the entire 
cluster supports them seems like the kind of thing that could cause pain if 
we're in a split-brain, cluster-is-settling-on-agreement kind of paradigm.

On Fri, Dec 20, 2024, at 3:17 PM, Benedict wrote:
> 
> Mostly conceptual; the problem with a linearizable history is that if you 
> lose some of it (eg because some logic bug prevents you from processing some 
> epoch) you stop the world until an operator can step in to perform surgery 
> about what the history should be.
> 
> I do know of one recent bug to schema changes in cep-15 that broke TCM in 
> this way. That particular avenue will be hardened, but the fewer places we 
> risk this the better IMO. 
> 
> Of course, there are steps we could take to expose a limited API targeting 
> these use cases, as well as using a separate log for ancillary functionality, 
> that might better balance risk:reward. But equally I’m not sure it makes 
> sense to TCM all the things, and maybe dogfooding our own database features 
> and developing functionality that enables our own use cases could be better 
> where it isn’t necessary 🤷‍♀️
> 
> 
>> On 20 Dec 2024, at 19:22, Jordan West  wrote:
>> 
>> On Fri, Dec 20, 2024 at 11:06 AM Benedict  wrote:
>>> 
>>> If TCM breaks we all have a really bad time, much worse than if any one of 
>>> these features individually has problems. If you break TCM in the right way 
>>> the cluster could become inoperable, or operations like topology changes 
>>> may be prevented. 
>> 
>> Benedict, when you say this are you speaking hypothetically (in the sense 
>> that by using TCM more we increase the probability of using it "wrong" and 
>> hitting an unknown edge case) or are there known ways today that TCM 
>> "breaks"?  
>> 
>> Jordan
>>  
>>> This means that even a parallel log has some risk if we end up modifying 
>>> shared functionality.
>>> 
>>> 
>>> 
>>> 
 On 20 Dec 2024, at 18:47, Štefan Miklošovič  wrote:
 
 I stand corrected. C in TCM is "cluster" :D Anyway. Configuration is super 
 reasonable to be put there.
 
 On Fri, Dec 20, 2024 at 7:42 PM Štefan Miklošovič  
 wrote:
> I am super hesitant to base distributed guardrails or any configuration 
> for that matter on anything but TCM. Does not "C" in TCM stand for 
> "configuration" anyway? So rename it to TSM like "schema" then if it is 
> meant to be just for that. It seems to be quite ridiculous to code tables 
> with caches on top when we have way more effective tooling thanks to 
> CEP-21 to deal with that with clear advantages of getting rid of all of 
> that old mechanism we have in place.
> 
> I have not seen any concrete examples of risks why using TCM should be 
> just for what it is currently for. Why not put the configuration meant to 
> be cluster-wide into that? 
> 
> What is it ... performance? What does even the term "additional 
> complexity" mean? Complex in what? Do you think that putting there 3 
> types of transformations in case of guardrails which flip some booleans 
> and numbers would suddenly make TCM way more complex? Come on ...
> 
> This has nothing to do with what Jordan is trying to introduce. I think 
> we all agree he knows what he is doing and if he evaluates that TCM is 
> too much for his use case (or it is not a good fit) that is perfectly 
> fine. 
> 
> On Fri, Dec 20, 2024 at 7:22 PM Paulo Motta  wrote:
>> > It should be possible to use distributed system tables just fine for 
>> > capabilities, config and guardrails.
>> 
>> I have been thinking about this recently and I agree we should be wary 
>> about introducing new TCM states and create additional complexity that 
>> can be serviced by existing data dissemination mechanisms (gossip/system 
>> tables). I would prefer that we take a more phased and incremental 
>> approach to introduce new TCM states.
>> 
>> As a way to accomplish that, I have thought about introducing a new 
>> generic TCM state "In Maintenance", where schema or membership changes 
>> are "frozen/disallowed" while an external operation is taking place. 
>> This "external operation" could mean many things:
>> - Upgrade
>> - Downgrade
>> - Migration
>> - Capability Enablement/Disablement
>> 
>> These could be sub-states of the "Maintenance" TCM state, that could be 
>> managed externally (via cache/gossip/system tables/sidecar). Once these 
>> sub-states are validated thourou