Re: Summary of 4.0 Large Features/Breaking Changes (Was: Rough roadmap for 4.0)

Dave Brosius Sun, 20 Nov 2016 09:03:07 -0800

>> We fully intend to "engineer and test the snot out of" the changeswe are working on as the whole point of us working on them is so we*can* run them in production, at our scale.

I'm not sure how the apache team does this. Perhaps individual engineerscan run some modern version at a company of theirs, altho that seemsunlikely, but as an Apache org, i just don't see how that happens.

To me it seems like the Apache Cassandra infrastructure itself needs tostand up a multinode live instance running some 'real-world' examplethat is getting pounded, so that we can stage feature branches to reallytest them.

Otherwise we will forever be basing versions on the poor test saps whodecide they are willing to risk all to upgrade to the cutting edge, andwhy, everyone believes in the adage, don't upgrade until at least .6


--dave


On 11/20/2016 09:50 AM, Jason Brown wrote:

Hey all,

One of the goals on my team, when working on large patches, is to get
community feedback on these initiatives before throwing them into prod.
This gets us a wider net of feedback (see Sylvain's continuing excellent
rounds of feedback to my work on CASSANDRA-8457), as well as making sure we
don't go too far off the deep end in terms of straying from the community
version. The latter point is crucial because if we make too many
incompatible changes to, for example, the internode messaging protocol or
the CQL protocol or the sstable file format, and deploy that, it may be
very difficult, if not impossible, to rectify with future, in-development
versions of cassandra.

We fully intend to "engineer and test the snot out of" the changes we are
working on as the whole point of us working on them is so we *can* run them
in production, at our scale. We aren't expecting others in the community to
dog food it for us. There will be a delay between committing something
upstream, and us backporting it to a current version we run in production
and actually deploying it. However, you can be sure that any bugs we find
will be fixed ASAP; we have many users counting on it.

Thanks for listening,

-Jason


On Sat, Nov 19, 2016 at 11:04 AM, Blake Eggleston <[email protected]>
wrote:

I think Ed's just using gossip 2.0 as a hypothetical example. His point is
that we should only commit things when we have a high degree of confidence
that they work correctly, not with the expectation that they don't.


On November 19, 2016 at 10:52:38 AM, Michael Kjellman (
[email protected]) wrote:

Jason has asked for review and feedback many times. Maybe be constructive
and review his code instead of just complaining (once again)?

Sent from my iPhone

On Nov 19, 2016, at 1:49 PM, Edward Capriolo <[email protected]>

wrote:

I would say start with a mindset like 'people will run this in

production'

not like 'why would you expect this to work'.

Now how does this logic effect feature develement? Maybe use gossip 2.0

as

an example.

I will play my given debby downer role. I could imagine 1 or 2 dtests and
the logic of 'dont expect it to work' unleash 4.0 onto hords of nubes

with

twitter announce of the release let bugs trickle in.

One could also do something comprehensive like test on clusters of 2 to
1000 nodes. Test with jepsen to see what happens during partitions,

inject

things like jvm pauses and account for behaivor. Log convergence times
after given events.

Take a stand and say look "we engineered and beat the crap out of this
feature. I deployed this release feature at my company and eat my

dogfood.

You are not my crash test dummy."

On Saturday, November 19, 2016, Jeff Jirsa <[email protected]> wrote:

Any proposal to solve the problem you describe?

--
Jeff Jirsa

On Nov 19, 2016, at 8:50 AM, Edward Capriolo <[email protected]

<;>> wrote:

This is especially relevant if people wish to focus on removing things.

For example, gossip 2.0 sounds great, but seems geared toward huge

clusters

which is not likely a majority of users. For those with a 20 node

cluster

are the indirect benefits woth it?

Also there seems to be a first push to remove things like compact

storage

or thrift. Fine great. But what is the realistic update path for

someone.

If the big players are running 2.1 and maintaining backports, the

average

shop without a dedicated team is going to be stuck saying (great

features

in 4.0 that improve performance, i would probably switch but its not

stable

and we have that one compact storage cf and who knows what is going to
happen performance wise when)

We really need to lose this realease wont be stable for 6 minor

versions

concept.

On Saturday, November 19, 2016, Edward Capriolo <[email protected]

<;>>

wrote:


On Friday, November 18, 2016, Jeff Jirsa <[email protected]

<;>

<_e(%7B%7D,'cvml','[email protected] <;>');>>

wrote:

We should assume that we’re ditching tick/tock. I’ll post a thread on
4.0-and-beyond here in a few minutes.

The advantage of a prod release every 6 months is fewer incentive to

push

unfinished work into a release.
The disadvantage of a prod release every 6 months is then we either

have

a very short lifespan per-release, or we have to maintain lots of

active

releases.

2.1 has been out for over 2 years, and a lot of people (including us)

are

running it in prod – if we have a release every 6 months, that means

we’d

be supporting 4+ releases at a time, just to keep parity with what we

have

now? Maybe that’s ok, if we’re very selective about ‘support’ for 2+

year

old branches.


On 11/18/16, 3:10 PM, "[email protected] <;> on behalf

of Blake

Eggleston" <[email protected] <;>> wrote:

While stability is important if we push back large "core" changes

until later we're just setting ourselves up to face the same issues

later on

In theory, yes. In practice, when incomplete features are earmarked

for

a certain release, those features are often rushed out, and not

always

fully baked.

In any case, I don’t think it makes sense to spend too much time

planning what goes into 4.0, and what goes into the next major

release

with

so many release strategy related decisions still up in the air. Are

we

going to ditch tick-tock? If so, what will it’s replacement look

like?

Specifically, when will the next “production” release happen? Without
knowing that, it's hard to say if something should go in 4.0, or 4.5,

or

5.0, or whatever.

The reason I suggested a production release every 6 months is

because

(in my mind) it’s frequent enough that people won’t be tempted to

rush

features to hit a given release, but not so frequent that it’s not
practical to support. It wouldn’t be the end of the world if some of

these

tickets didn’t make it into 4.0, because 4.5 would fine.

On November 18, 2016 at 1:57:21 PM, kurt Greaves (

[email protected] <;>)

wrote:

On 18 November 2016 at 18:25, Jason Brown <[email protected]

<;>> wrote:

#11559 (enhanced node representation) - decided it's *not*

something

we

need wrt #7544 storage port configurable per node, so we are

punting

on

#12344 - Forward writes to replacement node with same address during

replace

depends on #11559. To be honest I'd say #12344 is pretty important,
otherwise it makes it difficult to replace nodes without potentially
requiring client code/configuration changes. It would be nice to get

#12344

in for 4.0. It's marked as an improvement but I'd consider it a bug

and

thus think it could be included in a later minor release.

Introducing all of these in a single release seems pretty risky. I

think

it

would be safer to spread these out over a few 4.x releases (as

they’re

finished) and give them time to stabilize before including them in

an

LTS

release. The downside would be having to maintain backwards

compatibility

across the 4.x versions, but that seems preferable to delaying the

release

of 4.0 to include these, and having another big bang release.


I don't think anyone expects 4.0.0 to be stable. It's a major

version

change with lots of new features; in the production world people

don't

normally move to a new major version until it has been out for quite

some

time and several minor releases have passed. Really, most people are

only

migrating to 3.0.x now. While stability is important if we push back

large

"core" changes until later we're just setting ourselves up to face

the

same

issues later on. There should be enough uptake on the early releases

of

4.0

from new users to help test and get it to a production-ready state.


Kurt Greaves
[email protected] <;>

I don't think anyone expects 4.0.0 to be stable

Someone previously described 3.0 as the "break everything release".

We know that many people are still 2.1 and 3.0. Cassandra will always

be

maintaining 3 or 4 active branches and have adoption issues if

releases

are

not stable and usable.

Being that cassandra was 1.0 years ago I expect things to be stable.

Half

working features , or added this broke that are not appealing to me.



--
Sorry this was sent from mobile. Will do less grammar and spell check

than

usual.


--
Sorry this was sent from mobile. Will do less grammar and spell check

than

usual.


--
Sorry this was sent from mobile. Will do less grammar and spell check

than

usual.

Re: Summary of 4.0 Large Features/Breaking Changes (Was: Rough roadmap for 4.0)

Reply via email to