About MergeColumnFamilies when creating table

2016-05-07 Thread jason zhao yang
Hi,

Recently I am checking the performance of creating tables.

I found that MigrationStage thread spent most of the time in "
mergeColumnFamilies(oldColumnFamilies, newColumnFamilies);"  method.

Can someone explain the purpose of merging oldCFs and newCFs to find the
diff between old and new when C* already knew which table should be created?

Is it expected behaviour?

If C* directly applied the new table CFMetate without merging-function, the
performance of creating table is improved significantly.

Mechina: OSX, Intel I5, 12GB Ram, SSD, One Cassandra Node.

2.1.14 creating 1500 table:  21m.
2.1.14 creating 1500 table without mergingCF :  9 m.

When table number grows, the time difference gets larger.

The attachment is the CPU time of MigrationStage in "mergeColumnFamilies(
oldColumnFamilies, newColumnFamilies);".

Thank you.
Jasonstack

[image: cpu_time.png]


Support Multi-Tenant in Cassandra

2016-07-14 Thread jason zhao yang
Hi,

May I ask is there any plan of extending functionalities related to
Multi-Tenant?

Our current approach is to define an extra PartitionKey called "tenant_id".
In my use cases, all tenants will have the same table schemas.

* For security isolation: we customized GRANT statement to be able to
restrict user query based on the "tenant_id" partition.

* For getting all data of single tenant, we customized SELECT statement to
support allow filtering on "tenant_id" partition key.

* For server resource isolation, I have no idea how to.

* For per-tenant backup restore, I was trying a
tenant_base_compaction_strategy to split sstables based on tenant_id. it
turned out to be very inefficient.

What's community's opinion about submitting those patches to Cassandra? It
will be great if you guys can share the ideal Multi-Tenant architecture for
Cassandra?

jasonstack


Re: Support Multi-Tenant in Cassandra

2016-08-05 Thread jason zhao yang
We consider splitting by Keypspace or tables before, but Cassandra's table
is a costly structure(more cpu, flush, memory..).

In our use case, it's expected to have more than 50 tenants on same cluster.

> As it was already mentioned in the ticket itself, filtering is a highly 
> inefficient
operation.
I totally agree, but it's to good to have data filtered on server sider,
rather than client side..

How about adding a logical tenant concept in Cassandra?  all logical
tenants will share the same table schemas, but queries/storage are
separated?


Oleksandr Petrov 于2016年7月15日周五 下午4:28写道:

> There's a ticket on filtering (#11031), although I would not count on
> filtering in production.
>
> As it was already mentioned in the ticket itself, filtering is a highly
> inefficient operation. it was thought as aid for people who're exploring
> data and/or can structure query in such a way that it will at least be
> local (for example, with IN or EQ query on the partition key and filtering
> out results from the small partition). However, filtering on the Partition
> Key assumes that _every_ replica has to be queried for the results, as we
> do not know which partitions are going to be holding the data. Having every
> query in your system to rely on filtering, big amount of data and high load
> will eventually have substantial negative impact on performance.
>
> I'm not sure what's the amount of tenants you're working with, although
> I've seen setups where tenancy was solved by using multiple keyspaces,
> which helps to completely isolate the data, avoid filtering. Given that
> you've tried splitting sstables on tenant_id, that might be solved by using
> multiple keyspaces. This will also help with server resource isolation and
> most of the issues you've raised.
>
>
> On Fri, Jul 15, 2016 at 10:10 AM Romain Hardouin
>  wrote:
>
> > I don't use C* in such a context but out of curiosity did you set
> > the request_scheduler to RoundRobin or did you implement your own
> scheduler?
> > Romain
> > Le Vendredi 15 juillet 2016 8h39, jason zhao yang <
> > zhaoyangsingap...@gmail.com> a écrit :
> >
> >
> >  Hi,
> >
> > May I ask is there any plan of extending functionalities related to
> > Multi-Tenant?
> >
> > Our current approach is to define an extra PartitionKey called
> "tenant_id".
> > In my use cases, all tenants will have the same table schemas.
> >
> > * For security isolation: we customized GRANT statement to be able to
> > restrict user query based on the "tenant_id" partition.
> >
> > * For getting all data of single tenant, we customized SELECT statement
> to
> > support allow filtering on "tenant_id" partition key.
> >
> > * For server resource isolation, I have no idea how to.
> >
> > * For per-tenant backup restore, I was trying a
> > tenant_base_compaction_strategy to split sstables based on tenant_id. it
> > turned out to be very inefficient.
> >
> > What's community's opinion about submitting those patches to Cassandra?
> It
> > will be great if you guys can share the ideal Multi-Tenant architecture
> for
> > Cassandra?
> >
> > jasonstack
> >
> >
> >
>
> --
> Alex Petrov
>


Re: Support Multi-Tenant in Cassandra

2016-09-09 Thread jason zhao yang
Hi Romain,

Thanks for the reply.

> request_scheduler

it is a legacy feature which only works for thrift api..

It will be great to have some sort of scheduling per user/role, but
scheduling on the request will only provide limit isolation..if JVM crashes
due to one tenant's invalid request(eg. insert a blo to collection column),
it will be awful.


Thank you.

jason zhao yang 于2016年8月6日周六 下午12:33写道:

> We consider splitting by Keypspace or tables before, but Cassandra's table
> is a costly structure(more cpu, flush, memory..).
>
> In our use case, it's expected to have more than 50 tenants on same
> cluster.
>
> > As it was already mentioned in the ticket itself, filtering is a highly 
> > inefficient
> operation.
> I totally agree, but it's to good to have data filtered on server sider,
> rather than client side..
>
> How about adding a logical tenant concept in Cassandra?  all logical
> tenants will share the same table schemas, but queries/storage are
> separated?
>
>
> Oleksandr Petrov 于2016年7月15日周五 下午4:28写道:
>
>> There's a ticket on filtering (#11031), although I would not count on
>> filtering in production.
>>
>> As it was already mentioned in the ticket itself, filtering is a highly
>> inefficient operation. it was thought as aid for people who're exploring
>> data and/or can structure query in such a way that it will at least be
>> local (for example, with IN or EQ query on the partition key and filtering
>> out results from the small partition). However, filtering on the Partition
>> Key assumes that _every_ replica has to be queried for the results, as we
>> do not know which partitions are going to be holding the data. Having
>> every
>> query in your system to rely on filtering, big amount of data and high
>> load
>> will eventually have substantial negative impact on performance.
>>
>> I'm not sure what's the amount of tenants you're working with, although
>> I've seen setups where tenancy was solved by using multiple keyspaces,
>> which helps to completely isolate the data, avoid filtering. Given that
>> you've tried splitting sstables on tenant_id, that might be solved by
>> using
>> multiple keyspaces. This will also help with server resource isolation and
>> most of the issues you've raised.
>>
>>
>> On Fri, Jul 15, 2016 at 10:10 AM Romain Hardouin
>>  wrote:
>>
>> > I don't use C* in such a context but out of curiosity did you set
>> > the request_scheduler to RoundRobin or did you implement your own
>> scheduler?
>> > Romain
>> > Le Vendredi 15 juillet 2016 8h39, jason zhao yang <
>> > zhaoyangsingap...@gmail.com> a écrit :
>> >
>> >
>> >  Hi,
>> >
>> > May I ask is there any plan of extending functionalities related to
>> > Multi-Tenant?
>> >
>> > Our current approach is to define an extra PartitionKey called
>> "tenant_id".
>> > In my use cases, all tenants will have the same table schemas.
>> >
>> > * For security isolation: we customized GRANT statement to be able to
>> > restrict user query based on the "tenant_id" partition.
>> >
>> > * For getting all data of single tenant, we customized SELECT statement
>> to
>> > support allow filtering on "tenant_id" partition key.
>> >
>> > * For server resource isolation, I have no idea how to.
>> >
>> > * For per-tenant backup restore, I was trying a
>> > tenant_base_compaction_strategy to split sstables based on tenant_id. it
>> > turned out to be very inefficient.
>> >
>> > What's community's opinion about submitting those patches to Cassandra?
>> It
>> > will be great if you guys can share the ideal Multi-Tenant architecture
>> for
>> > Cassandra?
>> >
>> > jasonstack
>> >
>> >
>> >
>>
>> --
>> Alex Petrov
>>
>


Re: Rough roadmap for 4.0

2016-11-17 Thread jason zhao yang
Hi,

Will we still use tick-tock release for 4.x and 4.0.x ?

Stefan Podkowinski 于2016年11月16日周三 下午4:52写道:

> From my understanding, this will also effect EOL dates of other branches.
>
> "We will maintain the 2.2 stability series until 4.0 is released, and 3.0
> for six months after that.".
>
>
> On Wed, Nov 16, 2016 at 5:34 AM, Nate McCall  wrote:
>
> > Agreed. As long as we have a goal I don't see why we have to adhere to
> > arbitrary date for 4.0.
> >
> > On Nov 16, 2016 1:45 PM, "Aleksey Yeschenko" 
> wrote:
> >
> > > I’ll comment on the broader issue, but right now I want to elaborate on
> > > 3.11/January/arbitrary cutoff date.
> > >
> > > Doesn’t matter what the original plan was. We should continue with 3.X
> > > until all the 4.0 blockers have been
> > > committed - and there are quite a few of them remaining yet.
> > >
> > > So given all the holidays, and the tickets remaining, I’ll personally
> be
> > > surprised if 4.0 comes out before
> > > February/March and 3.13/3.14. Nor do I think it’s an issue.
> > >
> > > —
> > > AY
> > >
> > > On 16 November 2016 at 00:39:03, Mick Semb Wever (
> m...@thelastpickle.com
> > )
> > > wrote:
> > >
> > > On 4 November 2016 at 13:47, Nate McCall  wrote:
> > >
> > > > Specifically, this should be "new stuff that could/will break things"
> > > > given we are upping
> > > > the major version.
> > > >
> > >
> > >
> > > How does this co-ordinate with the tick-tock versioning¹ leading up to
> > the
> > > 4.0 release?
> > >
> > > To just stop tick-tock and then say yeehaa let's jam in all the
> breaking
> > > changes we really want seems to be throwing away some of the learnt
> > wisdom,
> > > and not doing a very sane transition from tick-tock to
> > > features/testing/stable². I really hope all this is done in a way that
> > > continues us down the path towards a stable-master.
> > >
> > > For example, are we fixing the release of 4.0 to November? or
> continuing
> > > tick-tocks until we complete the 4.0 roadmap? or starting the
> > > features/testing/stable branching approach with 3.11?
> > >
> > >
> > > Background:
> > > ¹) Sylvain wrote in an earlier thread titled "A Home for 4.0"
> > >
> > > > And as 4.0 was initially supposed to come after 3.11, which is
> coming,
> > > it's probably time to have a home for those tickets.
> > >
> > > ²) The new versioning scheme slated for 4.0, per the "Proposal - 3.5.1"
> > > thread
> > >
> > > > three branch plan with “features”, “testing”, and “stable” starting
> > with
> > > 4.0?
> > >
> > >
> > > Mick
> > >
> >
>


Re: [VOTE] Ask Infra to move github notification emails to pr@

2017-03-20 Thread jason zhao yang
+1
On Tue, 21 Mar 2017 at 9:36 AM, Jonathan Haddad  wrote:

> +1
> On Mon, Mar 20, 2017 at 6:33 PM Jason Brown  wrote:
>
> > +1
> > On Mon, Mar 20, 2017 at 18:21 Anthony Grasso 
> > wrote:
> >
> > > +1
> > >
> > > On 21 March 2017 at 09:32, Jeff Jirsa  wrote:
> > >
> > > > There's no reason for the dev list to get spammed everytime there's a
> > > > github PR. We know most of the time we prefer JIRAs for real code
> PRs,
> > > but
> > > > with docs being in tree and low barrier to entry, we may want to
> accept
> > > > docs through PRs ( see https://issues.apache.org/
> > > > jira/browse/CASSANDRA-13256
> > > > , and comment on it if you disagree).
> > > >
> > > > To make that viable, we should make it not spam dev@ with every
> > comment.
> > > > Therefore I propose we move github PR comments/actions to pr@ so as
> > > > not to clutter the dev@ list.
> > > >
> > > > Voting to remain open for 72 hours.
> > > >
> > > > - Jeff
> > > >
> > >
> >
>