On Tue, Nov 29, 2011 at 6:16 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com>wrote:
> I'd like to start a discussion about ideas to improve release quality for > Cassandra. Specifically I wonder if the community can do more to help the > project as a whole become more solid. Cassandra has an active and vibrant > community using Cassandra for a variety of things. If we all pitch in a > little bit, it seems like we can make a difference here. > > Release quality is difficult, especially for a distributed system like > Cassandra. The core devs have done an amazing job with this considering > how complicated it is. Currently, there are several things in place to > make sure that a release is generally usable: > - review-then-commit > - 72 hour voting period > - at least 3 binding +1 votes > - unit tests > - integration tests > Then there is the personal responsibility aspect - testing a release in a > staging environment before pushing it to production. > > I wonder if more could be done here to give more confidence in releases. > I wanted to see if there might be ways that the community could help out > without being too burdensome on either the core devs or the community. > > Some ideas: > More automation: run YCSB and stress with various setups. Maybe people > can rotate donating cloud instances (or simply money for them) but have a > common set of scripts to do this in the source. > > Dedicated distributed test suite: I know there has been work done on > various distributed test suites (which is great!) but none have really > caught on so far. > > I know what the apache guidelines say, but what if the community could > help out with the testing effort in a more formal way. For example, for > each release to be finalized, what if there needed to be 3 community > members that needed to try it out in their own environment? > > What if there was a post release +1 vote for the community to sign off on > - sort of a "works for me" kind of thing to reassure others that it's safe > to try. So when the release email gets posted to the user list, start a > tradition of people saying +1 in reply if they've tested it out and it > works for them. That's happening informally now when there are problems, > but it might be nice to see a vote of confidence. Just another idea. > > Any other ideas or variations? I am no software engineering guru, but whenever I +1 a hive release I actually do checkout the code and run a couple queries. Mostly I find that because there is just so many things not unit testable like those gosh darn bash scripts that launch Java applications. There have been times when even after multiple patch revisions and passing unit tests something just does not work in the real world. So I never +1 a binary release I don't spend an hour with and if possible I try twisting the knobs on any new feature or at least just trying the basics.Hive is aiming for something like quarterly releases. So possibly better to have Cassandra do time based releases. It does not have to be quarterly but if people want bleeding edge features (something committed 2 days ago) really they should go out and build something from trunk. It seems like Cassandra devs have the voting and releasing down to a science but from my world the types of bugs I worry about are data file corruption, and any weird bug that would result in data faults like read_repair not working or writes not going to the write nodes, or bloom filters giving a faulty result. New features are great and I love seeing them but I can wait for those. Updates now even trivial ones get political, you just never want to be the guy that champions a update and then not have it go well :) Most users of Cassandra are going to have large clusters and really the project should not outstrip the common users ability to stay up to date. You have to figure that a large cluster like 20 nodes with maybe 200Gb data/node, doing a rolling restart without degrading performance is going to take some time. This is more then 'yum update cassandra' /etc/init.d/cassandra restart' and with risk of something going wrong people need time to QA and time for ops. This type of person does not like to fall many releases behind and likewise can not be updating too often either. I have never had to roll back a release but I do wait usually for a month before running one to make sure there is not following soon.