20200217 4.0 Status Update

2020-02-17 Thread Jon Meredith
My turn to give an update on 4.0 status. The 4.0 board created by Josh can
be found at


https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355.


We have 94 unresolved tickets marked against the 4.0 release. [1]


Things seem to have settled into a phase of working to resolve issues, with
few new issues added.


2 new tickets opened (that are marked against 4.0)

11 tickets closed (including one of the newly opened ones)

39 tickets received updates to JIRA of some kind in the last week


Cumulative flow over the last couple of weeks shows todo reducing and done
increasing as it should as we continue to close out work for the release.


https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355&projectKey=CASSANDRA&view=reporting&chart=cumulativeFlowDiagram&swimlane=939&swimlane=936&swimlane=931&column=1505&column=1506&column=1514&column=1509&column=1512&column=1507&days=14


Notables

 - Python 3 support for cqlsh has been committed (thank you all who
persevered on this)

 - Some activity on Windows support - perhaps not dead yet.

 - Lots of movement on documentation

 - Lots of activity on flaky tests.

 - Oldest ticket with a patch award goes to CASSANDRA-2848


There are 18 tickets marked as patch available (easy access from the
Dashboard [2], apologies if they're already picked up for review)


CASSANDRA-15567 Allow EXTRA_CLASSPATH to work in tarball/source
installations

CASSANDRA-15553 Preview repair should include sstables from finalized
incremental repair sessions

CASSANDRA-15550 Fix flaky test
org.apache.cassandra.streaming.StreamTransferTaskTest
testFailSessionDuringTransferShouldNotReleaseReferences

CASSANDRA-15488/CASSANDRA-15353 Configuration file

CASSANDRA-15484/CASSANDRA-15353 Read Repair

CASSANDRA-15482/CASSANDRA-15353 Guarantees

CASSANDRA-15481/CASSANDRA-15353 Data Modeling

CASSANDRA-15393/CASSANDRA-15387 Add byte array backed cells

CASSANDRA-15391/CASSANDRA-15387 Reduce heap footprint of commonly allocated
objects

CASSANDRA-15367 Memtable memory allocations may deadlock

CASSANDRA-15308 Fix flakey testAcquireReleaseOutbound -
org.apache.cassandra.net.ConnectionTest

CASSANDRA-1530 5Fix multi DC nodetool status output

CASSANDRA-14973 Bring v5 driver out of beta, introduce v6 before 4.0
release is cut

CASSANDRA-14939 fix some operational holes in incremental repair

CASSANDRA-14904 SSTableloader doesn't understand listening for CQL
connections on multiple ports

CASSANDRA-14842 SSL connection problems when upgrading to 4.0 when
upgrading from 3.0.x

CASSANDRA-14761 Rename speculative_retry to match additional_write_policy

CASSANDRA-2848 Make the Client API support passing down timeouts


*LHF / Failing Tests*: We have 7 unassigned test failures that are all

great candidates to pick up and get involved in:

https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355&projectKey=CASSANDRA&quickFilter=1660&quickFilter=1661&quickFilter=1658


Thanks again to everybody for all the contributions. It's really good to
see the open issue count start dropping.


Feedback on whether this information is useful and how it can be improved
is both welcome and appreciated.


Cheers, Jon


[1] Unresolved 4.0 tickets
https://issues.apache.org/jira/browse/CASSANDRA-15567?filter=12347782&jql=project%20%3D%20cassandra%20AND%20fixversion%20in%20(4.0%2C%204.0.0%2C%204.0-alpha%2C%204.0-beta)%20AND%20status%20!%3D%20Resolved

[2] Patch Available
https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12334910


Re: 20200217 4.0 Status Update

2020-02-17 Thread Jeff Jirsa



Hard to see an argument for CASSANDRA-2848 being in scope for 4.0 (beyond the 
client proto change being painful for anything other than major releases).



> On Feb 17, 2020, at 12:43 PM, Jon Meredith  wrote:
> 
> My turn to give an update on 4.0 status. The 4.0 board created by Josh can
> be found at
> 
> 
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355.
> 
> 
> We have 94 unresolved tickets marked against the 4.0 release. [1]
> 
> 
> Things seem to have settled into a phase of working to resolve issues, with
> few new issues added.
> 
> 
> 2 new tickets opened (that are marked against 4.0)
> 
> 11 tickets closed (including one of the newly opened ones)
> 
> 39 tickets received updates to JIRA of some kind in the last week
> 
> 
> Cumulative flow over the last couple of weeks shows todo reducing and done
> increasing as it should as we continue to close out work for the release.
> 
> 
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355&projectKey=CASSANDRA&view=reporting&chart=cumulativeFlowDiagram&swimlane=939&swimlane=936&swimlane=931&column=1505&column=1506&column=1514&column=1509&column=1512&column=1507&days=14
> 
> 
> Notables
> 
> - Python 3 support for cqlsh has been committed (thank you all who
> persevered on this)
> 
> - Some activity on Windows support - perhaps not dead yet.
> 
> - Lots of movement on documentation
> 
> - Lots of activity on flaky tests.
> 
> - Oldest ticket with a patch award goes to CASSANDRA-2848
> 
> 
> There are 18 tickets marked as patch available (easy access from the
> Dashboard [2], apologies if they're already picked up for review)
> 
> 
> CASSANDRA-15567 Allow EXTRA_CLASSPATH to work in tarball/source
> installations
> 
> CASSANDRA-15553 Preview repair should include sstables from finalized
> incremental repair sessions
> 
> CASSANDRA-15550 Fix flaky test
> org.apache.cassandra.streaming.StreamTransferTaskTest
> testFailSessionDuringTransferShouldNotReleaseReferences
> 
> CASSANDRA-15488/CASSANDRA-15353 Configuration file
> 
> CASSANDRA-15484/CASSANDRA-15353 Read Repair
> 
> CASSANDRA-15482/CASSANDRA-15353 Guarantees
> 
> CASSANDRA-15481/CASSANDRA-15353 Data Modeling
> 
> CASSANDRA-15393/CASSANDRA-15387 Add byte array backed cells
> 
> CASSANDRA-15391/CASSANDRA-15387 Reduce heap footprint of commonly allocated
> objects
> 
> CASSANDRA-15367 Memtable memory allocations may deadlock
> 
> CASSANDRA-15308 Fix flakey testAcquireReleaseOutbound -
> org.apache.cassandra.net.ConnectionTest
> 
> CASSANDRA-1530 5Fix multi DC nodetool status output
> 
> CASSANDRA-14973 Bring v5 driver out of beta, introduce v6 before 4.0
> release is cut
> 
> CASSANDRA-14939 fix some operational holes in incremental repair
> 
> CASSANDRA-14904 SSTableloader doesn't understand listening for CQL
> connections on multiple ports
> 
> CASSANDRA-14842 SSL connection problems when upgrading to 4.0 when
> upgrading from 3.0.x
> 
> CASSANDRA-14761 Rename speculative_retry to match additional_write_policy
> 
> CASSANDRA-2848 Make the Client API support passing down timeouts
> 
> 
> *LHF / Failing Tests*: We have 7 unassigned test failures that are all
> 
> great candidates to pick up and get involved in:
> 
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355&projectKey=CASSANDRA&quickFilter=1660&quickFilter=1661&quickFilter=1658
> 
> 
> Thanks again to everybody for all the contributions. It's really good to
> see the open issue count start dropping.
> 
> 
> Feedback on whether this information is useful and how it can be improved
> is both welcome and appreciated.
> 
> 
> Cheers, Jon
> 
> 
> [1] Unresolved 4.0 tickets
> https://issues.apache.org/jira/browse/CASSANDRA-15567?filter=12347782&jql=project%20%3D%20cassandra%20AND%20fixversion%20in%20(4.0%2C%204.0.0%2C%204.0-alpha%2C%204.0-beta)%20AND%20status%20!%3D%20Resolved
> 
> [2] Patch Available
> https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12334910

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-17 Thread Jeremy Hanna
I just wanted to close the loop on this if possible.  After some discussion
in slack about various topics, I would like to see if people are okay with
num_tokens=8 by default (as it's not much different operationally than
16).  Joey brought up a few small changes that I can put on the ticket.  It
also requires some documentation for things like decommission order and
skew.

Are people okay with this change moving forward like this?  If so, I'll
comment on the ticket and we can move forward.

Thanks,

Jeremy

On Tue, Feb 4, 2020 at 8:45 AM Jon Haddad  wrote:

> I think it's a good idea to take a step back and get a high level view of
> the problem we're trying to solve.
>
> First, high token counts result in decreased availability as each node has
> data overlap with with more nodes in the cluster.  Specifically, a node can
> share data with RF-1 * 2 * num_tokens.  So a 256 token cluster at RF=3 is
> going to almost always share data with every other node in the cluster that
> isn't in the same rack, unless you're doing something wild like using more
> than a thousand nodes in a cluster.  We advertise
>
> With 16 tokens, that is vastly improved, but you still have up to 64 nodes
> each node needs to query against, so you're again, hitting every node
> unless you go above ~96 nodes in the cluster (assuming 3 racks / AZs).  I
> wouldn't use 16 here, and I doubt any of you would either.  I've advocated
> for 4 tokens because you'd have overlap with only 16 nodes, which works
> well for small clusters as well as large.  Assuming I was creating a new
> cluster for myself (in a hypothetical brand new application I'm building) I
> would put this in production.  I have worked with several teams where I
> helped them put 4 token clusters in prod and it has worked very well.  We
> didn't see any wild imbalance issues.
>
> As Mick's pointed out, our current method of using random token assignment
> for the default number of problematic for 4 tokens.  I fully agree with
> this, and I think if we were to try to use 4 tokens, we'd want to address
> this in tandem.  We can discuss how to better allocate tokens by default
> (something more predictable than random), but I'd like to avoid the
> specifics of that for the sake of this email.
>
> To Alex's point, repairs are problematic with lower token counts due to
> over streaming.  I think this is a pretty serious issue and I we'd have to
> address it before going all the way down to 4.  This, in my opinion, is a
> more complex problem to solve and I think trying to fix it here could make
> shipping 4.0 take even longer, something none of us want.
>
> For the sake of shipping 4.0 without adding extra overhead and time, I'm ok
> with moving to 16 tokens, and in the process adding extensive documentation
> outlining what we recommend for production use.  I think we should also try
> to figure out something better than random as the default to fix the data
> imbalance issues.  I've got a few ideas here I've been noodling on.
>
> As long as folks are fine with potentially changing the default again in C*
> 5.0 (after another discussion / debate), 16 is enough of an improvement
> that I'm OK with the change, and willing to author the docs to help people
> set up their first cluster.  For folks that go into production with the
> defaults, we're at least not setting them up for total failure once their
> clusters get large like we are now.
>
> In future versions, we'll probably want to address the issue of data
> imbalance by building something in that shifts individual tokens around.  I
> don't think we should try to do this in 4.0 either.
>
> Jon
>
>
>
> On Fri, Jan 31, 2020 at 2:04 PM Jeremy Hanna 
> wrote:
>
> > I think Mick and Anthony make some valid operational and skew points for
> > smaller/starting clusters with 4 num_tokens. There’s an arbitrary line
> > between small and large clusters but I think most would agree that most
> > clusters are on the small to medium side. (A small nuance is afaict the
> > probabilities have to do with quorum on a full token range, ie it has to
> do
> > with the size of a datacenter not the full cluster
> >
> > As I read this discussion I’m personally more inclined to go with 16 for
> > now. It’s true that if we could fix the skew and topology gotchas for
> those
> > starting things up, 4 would be ideal from an availability perspective.
> > However we’re still in the brainstorming stage for how to address those
> > challenges. I think we should create tickets for those issues and go with
> > 16 for 4.0.
> >
> > This is about an out of the box experience. It balances availability,
> > operations (such as skew and general bootstrap friendliness and
> > streaming/repair), and cluster sizing. Balancing all of those, I think
> for
> > now I’m more comfortable with 16 as the default with docs on
> considerations
> > and tickets to unblock 4 as the default for all users.
> >
> > >>> On Feb 1, 2020, at 6:30 AM, Jeff Jirsa  wrote:
> > >> On Fri, J

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-17 Thread Erick Ramirez
+1 on 8 tokens. I'd personally like us to be able to move this along pretty
quickly as it's confusing for users looking for direction. Cheers!

On Tue, 18 Feb 2020, 9:14 am Jeremy Hanna, 
wrote:

> I just wanted to close the loop on this if possible.  After some discussion
> in slack about various topics, I would like to see if people are okay with
> num_tokens=8 by default (as it's not much different operationally than
> 16).  Joey brought up a few small changes that I can put on the ticket.  It
> also requires some documentation for things like decommission order and
> skew.
>
> Are people okay with this change moving forward like this?  If so, I'll
> comment on the ticket and we can move forward.
>
> Thanks,
>
> Jeremy
>


Re: 20200217 4.0 Status Update

2020-02-17 Thread Dinesh Joshi
> On Feb 17, 2020, at 12:52 PM, Jeff Jirsa  wrote:
> 
> Hard to see an argument for CASSANDRA-2848 being in scope for 4.0 (beyond the 
> client proto change being painful for anything other than major releases).
> 

Even if it doesn't affect v4 protocol?

Dinesh


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-17 Thread Rahul Singh
+1 on 8

rahul.xavier.si...@gmail.com

http://cassandra.link
The Apache Cassandra Knowledge Base.
On Feb 17, 2020, 5:20 PM -0500, Erick Ramirez , 
wrote:
> +1 on 8 tokens. I'd personally like us to be able to move this along pretty
> quickly as it's confusing for users looking for direction. Cheers!
>
> On Tue, 18 Feb 2020, 9:14 am Jeremy Hanna, 
> wrote:
>
> > I just wanted to close the loop on this if possible. After some discussion
> > in slack about various topics, I would like to see if people are okay with
> > num_tokens=8 by default (as it's not much different operationally than
> > 16). Joey brought up a few small changes that I can put on the ticket. It
> > also requires some documentation for things like decommission order and
> > skew.
> >
> > Are people okay with this change moving forward like this? If so, I'll
> > comment on the ticket and we can move forward.
> >
> > Thanks,
> >
> > Jeremy
> >


Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-17 Thread Mick Semb Wever
-1

Discussions here and on slack have brought up a number of important
concerns. I think those concerns need to be summarised here before any
informal vote.

It was my understanding that some of those concerns may even be blockers to
a move to 16. That is we have to presume the worse case scenario where all
tokens get randomly generated.

Can we ask for some analysis and data against the risks different
num_tokens choices present. We shouldn't rush into a new default, and such
background information and data is operator value added. Maybe I missed any
info/experiments that have happened?



On Mon., 17 Feb. 2020, 11:14 pm Jeremy Hanna, 
wrote:

> I just wanted to close the loop on this if possible.  After some discussion
> in slack about various topics, I would like to see if people are okay with
> num_tokens=8 by default (as it's not much different operationally than
> 16).  Joey brought up a few small changes that I can put on the ticket.  It
> also requires some documentation for things like decommission order and
> skew.
>
> Are people okay with this change moving forward like this?  If so, I'll
> comment on the ticket and we can move forward.
>
> Thanks,
>
> Jeremy
>
> On Tue, Feb 4, 2020 at 8:45 AM Jon Haddad  wrote:
>
> > I think it's a good idea to take a step back and get a high level view of
> > the problem we're trying to solve.
> >
> > First, high token counts result in decreased availability as each node
> has
> > data overlap with with more nodes in the cluster.  Specifically, a node
> can
> > share data with RF-1 * 2 * num_tokens.  So a 256 token cluster at RF=3 is
> > going to almost always share data with every other node in the cluster
> that
> > isn't in the same rack, unless you're doing something wild like using
> more
> > than a thousand nodes in a cluster.  We advertise
> >
> > With 16 tokens, that is vastly improved, but you still have up to 64
> nodes
> > each node needs to query against, so you're again, hitting every node
> > unless you go above ~96 nodes in the cluster (assuming 3 racks / AZs).  I
> > wouldn't use 16 here, and I doubt any of you would either.  I've
> advocated
> > for 4 tokens because you'd have overlap with only 16 nodes, which works
> > well for small clusters as well as large.  Assuming I was creating a new
> > cluster for myself (in a hypothetical brand new application I'm
> building) I
> > would put this in production.  I have worked with several teams where I
> > helped them put 4 token clusters in prod and it has worked very well.  We
> > didn't see any wild imbalance issues.
> >
> > As Mick's pointed out, our current method of using random token
> assignment
> > for the default number of problematic for 4 tokens.  I fully agree with
> > this, and I think if we were to try to use 4 tokens, we'd want to address
> > this in tandem.  We can discuss how to better allocate tokens by default
> > (something more predictable than random), but I'd like to avoid the
> > specifics of that for the sake of this email.
> >
> > To Alex's point, repairs are problematic with lower token counts due to
> > over streaming.  I think this is a pretty serious issue and I we'd have
> to
> > address it before going all the way down to 4.  This, in my opinion, is a
> > more complex problem to solve and I think trying to fix it here could
> make
> > shipping 4.0 take even longer, something none of us want.
> >
> > For the sake of shipping 4.0 without adding extra overhead and time, I'm
> ok
> > with moving to 16 tokens, and in the process adding extensive
> documentation
> > outlining what we recommend for production use.  I think we should also
> try
> > to figure out something better than random as the default to fix the data
> > imbalance issues.  I've got a few ideas here I've been noodling on.
> >
> > As long as folks are fine with potentially changing the default again in
> C*
> > 5.0 (after another discussion / debate), 16 is enough of an improvement
> > that I'm OK with the change, and willing to author the docs to help
> people
> > set up their first cluster.  For folks that go into production with the
> > defaults, we're at least not setting them up for total failure once their
> > clusters get large like we are now.
> >
> > In future versions, we'll probably want to address the issue of data
> > imbalance by building something in that shifts individual tokens
> around.  I
> > don't think we should try to do this in 4.0 either.
> >
> > Jon
> >
> >
> >
> > On Fri, Jan 31, 2020 at 2:04 PM Jeremy Hanna  >
> > wrote:
> >
> > > I think Mick and Anthony make some valid operational and skew points
> for
> > > smaller/starting clusters with 4 num_tokens. There’s an arbitrary line
> > > between small and large clusters but I think most would agree that most
> > > clusters are on the small to medium side. (A small nuance is afaict the
> > > probabilities have to do with quorum on a full token range, ie it has
> to
> > do
> > > with the size of a datacenter not the full cluster
> >