Re: Capabilities

2025-01-30 Thread Štefan Miklošovič
On Wed, Jan 29, 2025 at 8:15 PM David Capwell wrote: > One motivating case for TCM vs non-TCM… When accord processes the user > request, we can make sure to use the configs as they were at the execution > epoch… by splitting this out it makes all configs non-deterministic from > Accord’s point of

Re: Capabilities

2025-01-29 Thread Benedict
This is a real problem that David points to. Anything that affects query execution needs to be deterministic for Accord to produce the same outcome on all nodes. If one node refuses to execute a write and another doesn’t they will behave inconsistently.There are other ways around this though, such

Re: Capabilities

2025-01-29 Thread Paulo Motta
> Simple example, lets say you add a global config that says you can’t write more than X bytes, when this is outside of TCM accord can have multiple different values while executing the query (assuming a user changed it)… Couldn’t we have an eventually consistent barrier, so that a new configurati

Re: Capabilities

2025-01-29 Thread David Capwell
One motivating case for TCM vs non-TCM… When accord processes the user request, we can make sure to use the configs as they were at the execution epoch… by splitting this out it makes all configs non-deterministic from Accord’s point of view… Simple example, lets say you add a global config th

Re: Capabilities

2025-01-29 Thread Paulo Motta
> Using TCM to distribute this information across the cluster vs. using some other LWT-ish distributed CP solution higher in the stack should effectively have the same UX guarantees to us and our users right? So I think it's still quite viable, even if we're just LWT'ing things into distributed ta

Re: Capabilities

2025-01-29 Thread David Capwell
To be explicit about my concerns in the previous comments… TCM vs new table, I don’t care too much. I prefer TCM over new table, but its a preference My comment before were more about the UX of global configs. As long as we “could” (maybe per config, not every config likely needs this) allow

Re: Capabilities

2025-01-29 Thread Josh McKenzie
Using TCM to distribute this information across the cluster vs. using some other LWT-ish distributed CP solution higher in the stack should effectively have the same UX guarantees to us and our users right? So I think it's still quite viable, even if we're just LWT'ing things into distributed ta

Re: Capabilities

2025-01-29 Thread Štefan Miklošovič
I want to ask about this ticket in particular, I know I am somehow hijacking this thread but taking recent discussion into account where we kind of rejected the idea of using TCM log for storing configuration, what does this mean for tickets like this? Is this still viable or we need to completely

Re: Capabilities

2025-01-07 Thread Štefan Miklošovič
It would be cool if it was acting like this, then the whole plugin would become irrelevant when it comes to the migrations. https://github.com/instaclustr/cassandra-everywhere-strategy https://github.com/instaclustr/cassandra-everywhere-strategy?tab=readme-ov-file#motivation On Mon, Jan 6, 2025 a

Re: Capabilities

2025-01-07 Thread Štefan Miklošovič
One more point about JMX and overriding the local configuration ... Isn't it true that we have system_views.settings vtable? I think there were already some ideas how to make this table mutable so when updated, it would, in runtime, set that configuration of that respective node.I think Maxim is/wa

Re: Capabilities

2025-01-06 Thread Štefan Miklošovič
Very well written, David. Good points. I am happy we are talking about all of this stuff and where the discussion is going. TCM or not, I find it important we finally go through it all, irrelevant what we eventually end up with. I would like to add a few points, especially to your last two parag

Re: Capabilities

2025-01-06 Thread Jon Haddad
Netflix solved the heterogeneous configuration issue really well. Definitely the best I've seen. With Spinnaker, you could set a config and override it at a DC, AZ or node level. It would generate the C* yaml for you and plop it on the box, do all the restarts etc. It was really convenient for t

Re: Capabilities

2025-01-06 Thread David Capwell
> Stefan, global configuration and capabilities do have some overlap but not > full overlap. For example, you may want to set globally that a cluster > enables feature X or control the threshold for a guardrail but you still need > to know if all nodes support feature X or have that guardrail, t

Re: Capabilities

2025-01-06 Thread Jon Haddad
What about finally adding a much desired EverywhereStrategy? It wouldn't just be useful for config - system_auth bites a lot of people today. As much as I don't like to suggest row cache, it might be a good fit here as well. We could remove the custom code around auth cache in the process. Jon

Re: Capabilities

2025-01-06 Thread Benedict Elliott Smith
The more we talk about this, the more my position crystallises against this approach. The feature we’re discussing here should be easy to implement on top of user facing functionality; we aren’t the only people who want functionality like this. We should be dogfooding our own UX for this kind of

Re: Capabilities

2025-01-06 Thread Blake Eggleston
TCM was designed with a couple of very specific correctness-critical use cases in mind, not as a generic mechanism for everyone to extend. Its initial scope was for those use cases, but it’s potential for enabling more sophisticated functionality was one of its selling points and is l

Re: Capabilities

2025-01-06 Thread Aleksey Yeshchenko
> Would you mind elaborating on what makes it unsuitable? I don’t have a good > mental model on its properties, so i assumed that it could be used to > disseminate arbitrary key value pairs like config fairly easily. It’s more than *capable* of disseminating arbitrary-ish key-value pairs - it

Re: Capabilities

2025-01-06 Thread Aleksey Yeshchenko
I agree that this would be useful, yes. An LWT/Accord variant plus a plain writes eventually consistent variant. A generic-by-design internal-only per-table mechanism with optional caching + optional write notifications issued to non-replicas. > On 6 Jan 2025, at 14:26, Josh McKenzie wrote: >

Re: Capabilities

2025-01-06 Thread Josh McKenzie
> I think if we go down the route of pushing configs around with LWT + caching > instead, we should have that be a generic system that is designed for > everyone to use. Agreed. Otherwise we end up with the same problem Aleksey's speaking about above, where we build something for a specific pur

Re: Capabilities

2025-01-06 Thread Jon Haddad
Would you mind elaborating on what makes it unsuitable? I don’t have a good mental model on its properties, so i assumed that it could be used to disseminate arbitrary key value pairs like config fairly easily. Somewhat humorously, i think that same assumption was made when putting sai metadata in

Re: Capabilities

2025-01-06 Thread Aleksey Yeshchenko
TCM was designed with a couple of very specific correctness-critical use cases in mind, not as a generic mechanism for everyone to extend. It might be *convenient* to employ TCM for some other features, which makes it tempting to abuse TCM for an unintended purpose, but we shouldn’t do what's c

Re: Capabilities

2024-12-21 Thread Jordan West
I tend to lean towards Josh's perspective. Gossip was poorly tested and implemented. I dont think it's a good parallel or at least I hope it's not. Taken to the extreme we shouldn't touch the database at all otherwise, which isn't practical. That said, anything touching important subsystems needs m

Re: Capabilities

2024-12-21 Thread Benedict
I’m not saying we need to tease out bugs from TCM. I’m saying every time someone touches something this central to correctness we introduce a risk of breaking it, and that we should exercise that risk judiciously. This has zero to do with the amount of data we’re pushing through it, and 100% to do

Re: Capabilities

2024-12-21 Thread Josh McKenzie
To play the devil's advocate - the more we exercise TCM the more bugs we suss out. To Jon's point, the volume of information we're talking about here in terms of capabilities dissemination shouldn't stress TCM at all. I think a reasonable heuristic for relying on TCM for something is whether th

Re: Capabilities

2024-12-20 Thread Benedict
Mostly conceptual; the problem with a linearizable history is that if you lose some of it (eg because some logic bug prevents you from processing some epoch) you stop the world until an operator can step in to perform surgery about what the history should be.I do know of one recent bug to schema ch

Re: Capabilities

2024-12-20 Thread Jordan West
On Fri, Dec 20, 2024 at 11:06 AM Benedict wrote: > If TCM breaks we all have a really bad time, much worse than if any one of > these features individually has problems. If you break TCM in the right way > the cluster could become inoperable, or operations like topology changes > may be prevented

Re: Capabilities

2024-12-20 Thread Benedict
If TCM breaks we all have a really bad time, much worse than if any one of these features individually has problems. If you break TCM in the right way the cluster could become inoperable, or operations like topology changes may be prevented. So, we want to keep its responsibilities scoped sensibly,

Re: Capabilities

2024-12-20 Thread Jon Haddad
I don’t know the details and limits of TCM well enough to comment on what it can do, but i think its fair to say that if we can’t put a few hundred configuration options in taking up maybe a few MB, there’s a fundamental problem with it, and we need to seriously reconsider if it’s ready for product

Re: Capabilities

2024-12-20 Thread Paulo Motta
Apologies I missed the forked thread "Re: Capabilities" before commenting on this. I think the TCM-lite suggestion there is not incompatible with the generic "In Maintenance" TCM state that I am proposing, since while in this state each individual feature could also have their

Re: Capabilities

2024-12-20 Thread Štefan Miklošovič
I stand corrected. C in TCM is "cluster" :D Anyway. Configuration is super reasonable to be put there. On Fri, Dec 20, 2024 at 7:42 PM Štefan Miklošovič wrote: > I am super hesitant to base distributed guardrails or any configuration > for that matter on anything but TCM. Does not "C" in TCM sta

Re: Capabilities

2024-12-20 Thread Štefan Miklošovič
I am super hesitant to base distributed guardrails or any configuration for that matter on anything but TCM. Does not "C" in TCM stand for "configuration" anyway? So rename it to TSM like "schema" then if it is meant to be just for that. It seems to be quite ridiculous to code tables with caches on

Re: Capabilities

2024-12-20 Thread Paulo Motta
> It should be possible to use distributed system tables just fine for capabilities, config and guardrails. I have been thinking about this recently and I agree we should be wary about introducing new TCM states and create additional complexity that can be serviced by existing data dissemination m

Re: Capabilities

2024-12-20 Thread Jordan West
One minor clarification: ETS is entirely in memory (unless you explicitly dump it to disk or use DETS) so the equivalence to a local system table is only partially accurate but I think the parallel is fine in the case of what I was describing. Jordan On Fri, Dec 20, 2024 at 09:07 Jordan West wr

Re: Capabilities

2024-12-20 Thread Jordan West
Benedict, I agree with you TCM might be overkill for capabilities. It’s truly something that’s fine to be eventually consistent. Riaks implementation used a local ETS table (ETS is built into Erlang - equivalent for us would a local only system table) and an efficient and reliable gossip protocol.

Re: Capabilities

2024-12-20 Thread Štefan Miklošovič
Having a parallel and feature focused TCM log as you suggested seems perfectly reasonable to me. On Fri, Dec 20, 2024 at 11:33 AM Benedict wrote: > Guardrails are broadly the same as Auth which works this way, but with > less criticality. It’s fine if guardrails are updated slowly. > > But, agai

Re: Capabilities

2024-12-20 Thread Benedict
Guardrails are broadly the same as Auth which works this way, but with less criticality. It’s fine if guardrails are updated slowly.But, again, TCM is a fine target for this. It would however be nice to have an in-between capability though, TCM-lite if you will, for these features. Perhaps even jus

Re: Capabilities

2024-12-20 Thread Štefan Miklošovič
What do you mean by a distributed table? You mean these in system_distributed keyspace? If so, imagine we introduce a table system_distributed.guardrails where each row would hold what a guardrail would be set to, hence on guardrails evaluation in runtime (and there are a bunch of them to consider

Re: Capabilities

2024-12-20 Thread Benedict
If you perform a read from a distributed table on startup you will find the latest information. What catchup are you thinking of? I don’t think any of the features we talked about need a log, only the latest information.We can (and should) probably introduce event listeners for distributed tables,

Re: Capabilities

2024-12-20 Thread Štefan Miklošovič
I find TCM way more comfortable to work with. The capability of log being replayed on restart and catching up with everything else automatically is god-sent. If we had that on "good old distributed tables", then is it not true that we would need to take extra care of that, e.g. we would need to rep

Re: Capabilities

2024-12-20 Thread Benedict
TCM is a perfectly valid basis for this, but TCM is only really *necessary* to solve meta config problems where we can’t rely on the rest of the database working. Particularly placement issues, which is why schema and membership need to live there.It should be possible to use distributed system tab

Re: Capabilities

2024-12-20 Thread Štefan Miklošovič
Jordan, I also think that having it on TCM would be ideal and we should explore this path first before doing anything custom. Regarding my idea about the guardrails in TCM, when I prototyped that and wanted to make it happen, there was a little bit of a pushback (1) (even though super reasonable

Re: Capabilities

2024-12-19 Thread Jordan West
Firstly, glad to see the support and enthusiasm here and in the recent Slack discussion. I think there is enough for me to start drafting a CEP. Stefan, global configuration and capabilities do have some overlap but not full overlap. For example, you may want to set globally that a cluster enables

Re: Capabilities

2024-12-19 Thread Štefan Miklošovič
Hi Jordan, what would this look like from the implementation perspective? I was experimenting with transactional guardrails where an operator would control the content of a virtual table which would be backed by TCM so whatever guardrail we would change, this would be automatically and transparent

Re: Capabilities

2024-12-19 Thread Francisco Guerrero
+1 and happy to help as well on this effort On 2024/12/19 20:21:08 Doug Rohrer wrote: > +1 (nb) and will be happy to help, especially providing input from the > Analytics side. > > Thanks Jordan! > > > On Dec 19, 2024, at 12:00 PM, Paulo Motta wrote: > > > > Nice stuff! I support this proposa

Re: Capabilities

2024-12-19 Thread Doug Rohrer
+1 (nb) and will be happy to help, especially providing input from the Analytics side. Thanks Jordan! > On Dec 19, 2024, at 12:00 PM, Paulo Motta wrote: > > Nice stuff! I support this proposal and would be happy to help on this. > > On Wed, Dec 18, 2024 at 6:00 PM Jordan West

Re: Capabilities

2024-12-19 Thread Paulo Motta
Nice stuff! I support this proposal and would be happy to help on this. On Wed, Dec 18, 2024 at 6:00 PM Jordan West wrote: > In a recent discussion on the pains of upgrading one topic that came up is > a feature that Riak had called Capabilities [1]. A major pain with upgrades > is that each nod

Re: Capabilities

2024-12-19 Thread Jeremy Hanna
+1 (nb) to improving the upgrade experience for Cassandra including opting into features and making rollbacks easier. At a higher level, I know of multiple users of Cassandra with large clusters that have run into these awkward situations where you're half upgraded and there are a variety of

Re: Capabilities

2024-12-19 Thread Jon Haddad
Love it. Big +1 On Thu, Dec 19, 2024 at 8:41 AM Bernardo Botella < conta...@bernardobotella.com> wrote: > +1 to the positive sentiment of such a feature. Huge benefit towards > reducing risks. > > > On Dec 19, 2024, at 8:31 AM, Patrick McFadin wrote: > > > > Thanks for bringing this back, Jord

Re: Capabilities

2024-12-19 Thread Bernardo Botella
+1 to the positive sentiment of such a feature. Huge benefit towards reducing risks. > On Dec 19, 2024, at 8:31 AM, Patrick McFadin wrote: > > Thanks for bringing this back, Jordan. I had completely forgotten > about Riak's Capabilities support. That was a fan favorite for > operators, along wi

Re: Capabilities

2024-12-19 Thread Patrick McFadin
Thanks for bringing this back, Jordan. I had completely forgotten about Riak's Capabilities support. That was a fan favorite for operators, along with a couple other interesting ways to control the upgrade process. +1 on a CEP from me. On Thu, Dec 19, 2024 at 7:38 AM Josh McKenzie wrote: > > Str

Re: Capabilities

2024-12-19 Thread Josh McKenzie
Strong +1. Much like having repair scheduling built in to the ecosystem, this feels like table stakes for having a self-contained, usable distributed database. On Wed, Dec 18, 2024, at 6:11 PM, Dinesh Joshi wrote: > Hi Jordan, > > Thank you for starting this thread. This is a great idea. From a

Re: Capabilities

2024-12-18 Thread Dinesh Joshi
Hi Jordan, Thank you for starting this thread. This is a great idea. From an ecosystem perspective this is absolutely critical. I'm a big +1 on working towards building this into Cassandra and the surrounding ecosystem. This would a step in the right direction to derisk upgrades. Dinesh On Wed,