About MergeColumnFamilies when creating table
Hi, Recently I am checking the performance of creating tables. I found that MigrationStage thread spent most of the time in " mergeColumnFamilies(oldColumnFamilies, newColumnFamilies);" method. Can someone explain the purpose of merging oldCFs and newCFs to find the diff between old and new when C* already knew which table should be created? Is it expected behaviour? If C* directly applied the new table CFMetate without merging-function, the performance of creating table is improved significantly. Mechina: OSX, Intel I5, 12GB Ram, SSD, One Cassandra Node. 2.1.14 creating 1500 table: 21m. 2.1.14 creating 1500 table without mergingCF : 9 m. When table number grows, the time difference gets larger. The attachment is the CPU time of MigrationStage in "mergeColumnFamilies( oldColumnFamilies, newColumnFamilies);". Thank you. Jasonstack [image: cpu_time.png]
Support Multi-Tenant in Cassandra
Hi, May I ask is there any plan of extending functionalities related to Multi-Tenant? Our current approach is to define an extra PartitionKey called "tenant_id". In my use cases, all tenants will have the same table schemas. * For security isolation: we customized GRANT statement to be able to restrict user query based on the "tenant_id" partition. * For getting all data of single tenant, we customized SELECT statement to support allow filtering on "tenant_id" partition key. * For server resource isolation, I have no idea how to. * For per-tenant backup restore, I was trying a tenant_base_compaction_strategy to split sstables based on tenant_id. it turned out to be very inefficient. What's community's opinion about submitting those patches to Cassandra? It will be great if you guys can share the ideal Multi-Tenant architecture for Cassandra? jasonstack
Re: Support Multi-Tenant in Cassandra
We consider splitting by Keypspace or tables before, but Cassandra's table is a costly structure(more cpu, flush, memory..). In our use case, it's expected to have more than 50 tenants on same cluster. > As it was already mentioned in the ticket itself, filtering is a highly > inefficient operation. I totally agree, but it's to good to have data filtered on server sider, rather than client side.. How about adding a logical tenant concept in Cassandra? all logical tenants will share the same table schemas, but queries/storage are separated? Oleksandr Petrov 于2016年7月15日周五 下午4:28写道: > There's a ticket on filtering (#11031), although I would not count on > filtering in production. > > As it was already mentioned in the ticket itself, filtering is a highly > inefficient operation. it was thought as aid for people who're exploring > data and/or can structure query in such a way that it will at least be > local (for example, with IN or EQ query on the partition key and filtering > out results from the small partition). However, filtering on the Partition > Key assumes that _every_ replica has to be queried for the results, as we > do not know which partitions are going to be holding the data. Having every > query in your system to rely on filtering, big amount of data and high load > will eventually have substantial negative impact on performance. > > I'm not sure what's the amount of tenants you're working with, although > I've seen setups where tenancy was solved by using multiple keyspaces, > which helps to completely isolate the data, avoid filtering. Given that > you've tried splitting sstables on tenant_id, that might be solved by using > multiple keyspaces. This will also help with server resource isolation and > most of the issues you've raised. > > > On Fri, Jul 15, 2016 at 10:10 AM Romain Hardouin > wrote: > > > I don't use C* in such a context but out of curiosity did you set > > the request_scheduler to RoundRobin or did you implement your own > scheduler? > > Romain > > Le Vendredi 15 juillet 2016 8h39, jason zhao yang < > > zhaoyangsingap...@gmail.com> a écrit : > > > > > > Hi, > > > > May I ask is there any plan of extending functionalities related to > > Multi-Tenant? > > > > Our current approach is to define an extra PartitionKey called > "tenant_id". > > In my use cases, all tenants will have the same table schemas. > > > > * For security isolation: we customized GRANT statement to be able to > > restrict user query based on the "tenant_id" partition. > > > > * For getting all data of single tenant, we customized SELECT statement > to > > support allow filtering on "tenant_id" partition key. > > > > * For server resource isolation, I have no idea how to. > > > > * For per-tenant backup restore, I was trying a > > tenant_base_compaction_strategy to split sstables based on tenant_id. it > > turned out to be very inefficient. > > > > What's community's opinion about submitting those patches to Cassandra? > It > > will be great if you guys can share the ideal Multi-Tenant architecture > for > > Cassandra? > > > > jasonstack > > > > > > > > -- > Alex Petrov >
Re: Support Multi-Tenant in Cassandra
Hi Romain, Thanks for the reply. > request_scheduler it is a legacy feature which only works for thrift api.. It will be great to have some sort of scheduling per user/role, but scheduling on the request will only provide limit isolation..if JVM crashes due to one tenant's invalid request(eg. insert a blo to collection column), it will be awful. Thank you. jason zhao yang 于2016年8月6日周六 下午12:33写道: > We consider splitting by Keypspace or tables before, but Cassandra's table > is a costly structure(more cpu, flush, memory..). > > In our use case, it's expected to have more than 50 tenants on same > cluster. > > > As it was already mentioned in the ticket itself, filtering is a highly > > inefficient > operation. > I totally agree, but it's to good to have data filtered on server sider, > rather than client side.. > > How about adding a logical tenant concept in Cassandra? all logical > tenants will share the same table schemas, but queries/storage are > separated? > > > Oleksandr Petrov 于2016年7月15日周五 下午4:28写道: > >> There's a ticket on filtering (#11031), although I would not count on >> filtering in production. >> >> As it was already mentioned in the ticket itself, filtering is a highly >> inefficient operation. it was thought as aid for people who're exploring >> data and/or can structure query in such a way that it will at least be >> local (for example, with IN or EQ query on the partition key and filtering >> out results from the small partition). However, filtering on the Partition >> Key assumes that _every_ replica has to be queried for the results, as we >> do not know which partitions are going to be holding the data. Having >> every >> query in your system to rely on filtering, big amount of data and high >> load >> will eventually have substantial negative impact on performance. >> >> I'm not sure what's the amount of tenants you're working with, although >> I've seen setups where tenancy was solved by using multiple keyspaces, >> which helps to completely isolate the data, avoid filtering. Given that >> you've tried splitting sstables on tenant_id, that might be solved by >> using >> multiple keyspaces. This will also help with server resource isolation and >> most of the issues you've raised. >> >> >> On Fri, Jul 15, 2016 at 10:10 AM Romain Hardouin >> wrote: >> >> > I don't use C* in such a context but out of curiosity did you set >> > the request_scheduler to RoundRobin or did you implement your own >> scheduler? >> > Romain >> > Le Vendredi 15 juillet 2016 8h39, jason zhao yang < >> > zhaoyangsingap...@gmail.com> a écrit : >> > >> > >> > Hi, >> > >> > May I ask is there any plan of extending functionalities related to >> > Multi-Tenant? >> > >> > Our current approach is to define an extra PartitionKey called >> "tenant_id". >> > In my use cases, all tenants will have the same table schemas. >> > >> > * For security isolation: we customized GRANT statement to be able to >> > restrict user query based on the "tenant_id" partition. >> > >> > * For getting all data of single tenant, we customized SELECT statement >> to >> > support allow filtering on "tenant_id" partition key. >> > >> > * For server resource isolation, I have no idea how to. >> > >> > * For per-tenant backup restore, I was trying a >> > tenant_base_compaction_strategy to split sstables based on tenant_id. it >> > turned out to be very inefficient. >> > >> > What's community's opinion about submitting those patches to Cassandra? >> It >> > will be great if you guys can share the ideal Multi-Tenant architecture >> for >> > Cassandra? >> > >> > jasonstack >> > >> > >> > >> >> -- >> Alex Petrov >> >
Re: Rough roadmap for 4.0
Hi, Will we still use tick-tock release for 4.x and 4.0.x ? Stefan Podkowinski 于2016年11月16日周三 下午4:52写道: > From my understanding, this will also effect EOL dates of other branches. > > "We will maintain the 2.2 stability series until 4.0 is released, and 3.0 > for six months after that.". > > > On Wed, Nov 16, 2016 at 5:34 AM, Nate McCall wrote: > > > Agreed. As long as we have a goal I don't see why we have to adhere to > > arbitrary date for 4.0. > > > > On Nov 16, 2016 1:45 PM, "Aleksey Yeschenko" > wrote: > > > > > I’ll comment on the broader issue, but right now I want to elaborate on > > > 3.11/January/arbitrary cutoff date. > > > > > > Doesn’t matter what the original plan was. We should continue with 3.X > > > until all the 4.0 blockers have been > > > committed - and there are quite a few of them remaining yet. > > > > > > So given all the holidays, and the tickets remaining, I’ll personally > be > > > surprised if 4.0 comes out before > > > February/March and 3.13/3.14. Nor do I think it’s an issue. > > > > > > — > > > AY > > > > > > On 16 November 2016 at 00:39:03, Mick Semb Wever ( > m...@thelastpickle.com > > ) > > > wrote: > > > > > > On 4 November 2016 at 13:47, Nate McCall wrote: > > > > > > > Specifically, this should be "new stuff that could/will break things" > > > > given we are upping > > > > the major version. > > > > > > > > > > > > > How does this co-ordinate with the tick-tock versioning¹ leading up to > > the > > > 4.0 release? > > > > > > To just stop tick-tock and then say yeehaa let's jam in all the > breaking > > > changes we really want seems to be throwing away some of the learnt > > wisdom, > > > and not doing a very sane transition from tick-tock to > > > features/testing/stable². I really hope all this is done in a way that > > > continues us down the path towards a stable-master. > > > > > > For example, are we fixing the release of 4.0 to November? or > continuing > > > tick-tocks until we complete the 4.0 roadmap? or starting the > > > features/testing/stable branching approach with 3.11? > > > > > > > > > Background: > > > ¹) Sylvain wrote in an earlier thread titled "A Home for 4.0" > > > > > > > And as 4.0 was initially supposed to come after 3.11, which is > coming, > > > it's probably time to have a home for those tickets. > > > > > > ²) The new versioning scheme slated for 4.0, per the "Proposal - 3.5.1" > > > thread > > > > > > > three branch plan with “features”, “testing”, and “stable” starting > > with > > > 4.0? > > > > > > > > > Mick > > > > > >
Re: [VOTE] Ask Infra to move github notification emails to pr@
+1 On Tue, 21 Mar 2017 at 9:36 AM, Jonathan Haddad wrote: > +1 > On Mon, Mar 20, 2017 at 6:33 PM Jason Brown wrote: > > > +1 > > On Mon, Mar 20, 2017 at 18:21 Anthony Grasso > > wrote: > > > > > +1 > > > > > > On 21 March 2017 at 09:32, Jeff Jirsa wrote: > > > > > > > There's no reason for the dev list to get spammed everytime there's a > > > > github PR. We know most of the time we prefer JIRAs for real code > PRs, > > > but > > > > with docs being in tree and low barrier to entry, we may want to > accept > > > > docs through PRs ( see https://issues.apache.org/ > > > > jira/browse/CASSANDRA-13256 > > > > , and comment on it if you disagree). > > > > > > > > To make that viable, we should make it not spam dev@ with every > > comment. > > > > Therefore I propose we move github PR comments/actions to pr@ so as > > > > not to clutter the dev@ list. > > > > > > > > Voting to remain open for 72 hours. > > > > > > > > - Jeff > > > > > > > > > >