RE: Cassandra Mutation object decoding

2016-11-23 Thread Jacques-Henri Berthemet
I worked on a custom PerRowSecondaryIndex to synchronize data to another system 
and there was no way to know the difference between an insert and an update at 
this level (I'm using Cassandra 2.2). The solution for me was simply to us an 
upsert operation in the target system.

There is also the QueryHandler interface that would allow you to intercept the 
queries, but I found it difficult to use because you have to make you how query 
interpreter.

However, with Cassandra 3.x the Index interface was refactored and you now have 
insert and update methods, you even have the old and new rows in the update!
Check the SASI index implementation: 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/SASIIndex.java

If I'm not mistaken the Mutation are only used with thrift and don't have the 
notion of insert or updates with columnFamilies. Currently CQL queries use the 
native protocol: 
https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=doc/native_protocol_v3.spec.
 If you want to stay at protocol level you need to check native protocol 
implementation and how to decode it.


--
Jacques-Henri Berthemet

-Original Message-
From: J. D. Jordan [mailto:jeremiah.jor...@gmail.com] 
Sent: mercredi 23 novembre 2016 07:13
To: dev@cassandra.apache.org
Subject: Re: Cassandra Mutation object decoding

You may also want to look at the triggers interface.

> On Nov 22, 2016, at 7:45 PM, Chris Lohfink  wrote:
> 
> There are different kinds of tombstones, a partition tombstone is held in
> the MutableDeletionInfo of the PartitionUpdate that you can get from
> deletionInfo() method which returns the private deletionInfo field from the
> holder. There are also row and cell deletions so you have to handle each of
> those. It can be a little non-trivial to work backwards but all the
> information is there in the Mutation (if you have full access to cfmetadata
> etc), may be easier to go directly to whatever output your looking for.
> 
> A lot of the metadata is driven from system though, you need to be either
> be on a Cassandra node or rebuild them with the same cfids. In sstable-tools
>  we rebuild the tables from the
> source cluster to parse the Mutations for replaying/viewing hints and
> commitlogs. But that had to be a bit (massively) hackier since its for 3.7
> before CASSANDRA-8844 .
> Its definitely possible but not easy (which is probably why it hasn't been
> added in yet).
> 
> Chris
> 
> On Tue, Nov 22, 2016 at 6:59 PM, Sanal Vasudevan 
> wrote:
> 
>> Hi Bejamin,
>> 
>> Nice to hear from you.
>> 
>> My goal is to reconstruct the CQL operation from the Mutation object.
>> So that I can trigger the same action on another NoSQL target like MongoDB.
>> 
>> Please let me know know if you have ideas?
>> 
>> Many thanks.
>> Sanal
>> 
>> 
>> On Tue, Nov 22, 2016 at 7:28 PM, Benjamin Lerer <
>> benjamin.le...@datastax.com
>>> wrote:
>> 
>>> Hi Sanal,
>>> 
>>> What you want to do is not an easy stuff and it might break with new
>> major
>>> releases.
>>> 
>>> My question would be: why do you want to do that? There might be another
>>> way to reach the same goal.
>>> 
>>> Benjamin
>>> 
>>> On Mon, Nov 21, 2016 at 7:14 PM, Sanal Vasudevan 
>>> wrote:
>>> 
 Thank you Vladimir.
 Anyone else has any other ideas as to how this can be done?
 
 
 Many thanks,
 Sanal
 
 
 On Sun, Nov 20, 2016 at 4:46 AM, Vladimir Yudovin <
>> vla...@winguzone.com>
 wrote:
 
> Hi Sanal,
> 
> 
> 
> >do we have metadata inside Mutation object to decode whether the
>>> CQL
> was an INSERT or UPDATE operation?
> 
> I'm not sure it's possible to distinguish them - both of them just
>> add
> data to SSTable.
> 
> 
> 
> 
> 
> Best regards, Vladimir Yudovin,
> 
> Winguzone - Hosted Cloud Cassandra
> Launch your cluster in minutes.
> 
> 
> 
> 
> 
>  On Fri, 18 Nov 2016 15:55:00 -0500Sanal Vasudevan &
> lt;get2sa...@gmail.com> wrote 
> 
> 
> 
> 
> Hi there,
> 
> 
> 
> I am trying to read the Commit logs to decode the original CQL which
 used.
> 
> I get to the point an implemention of CommitLogReadHandler is able to
 push
> 
> back Mutation objects from the Commit logs.
> 
> 
> 
> Questions:
> 
> 1) CQL: delete from myks.mytable where key1 = 1;
> 
> For the above CQL, the Mutation object has zero objects of
> 
> org.apache.cassandra.db.rows.Row inside ParitionUpdate object.
> 
> Is this the only way to detect a DELETE operation? or we have any
>> other
> 
> metadata to indicate a DELETE operation?
> 
> mutation.getPartitionUpdates().forEach(rows -> {
>>> if(rows.isEmpty())
> 
> System.out.println("May be a DELETE operation") }

Re: Cassandra Mutation object decoding

2016-11-23 Thread Benjamin Lerer
>
> My goal is to reconstruct the CQL operation from the Mutation object.
> So that I can trigger the same action on another NoSQL target like MongoDB.
>

There are different way of keeping your 2 database in sync. Unfortunatly,
they all have some trade offs (as always ;-))


   1. If you have controle on the client side, you could wrap the driver
   and add some code that convert the query and write it to the other database
   at the same time. The main problem with that approach is that a write can
   succeed on one of the database but not on the other. Which means that you
   will need a mechanism to resolve those problems.
   2. On the Cassandra side you could, as Nate suggested, extends the
   QueryProcessor in order to log the mutations to a log file. As the
   QueryProcessor has access to the prepared statement cache and to the bind
   parameter you should be able to extract the information you need. Some of
   the problems of that approach are:
  1. You cannot reprocess already inserted data
  2. You will probably have to use a replication log to deal with the
  cases where the other database is unreachable
  3. It might slow down your query processing and take some of your
  band width at critical time (heavy write)
  3. Use a fake index as Jacques-Henri suggested. It will allow to
   easily reprocess already inserted data so you will not need some
   replication logs (at the same time having to rebuild the index might slow
   down your database). The main issues for that solution are:
   1. All the tables that you want to replicate will have to have that
  index and you cannot automatically update the schemas on your
other database
  2. It might slow down your query processing and take some of your
  band width at critical time (heavy write)
   4. Read the commitlogs to recreate the mutation statements (your initial
   approach). The main problem is that it is simply not easy to do and might
   break up with new major releases. You will also have to make sure that the
   files do not disappear before you have processed them.
   5. Try a Datawarehouse/ETL approach to synchronized your data.
   CASSANDRA-8844 added support for CDC (Change Data Capture) which might help
   you there. Unfortunatly, I have not really worked on it so I cannot help
   you much there.

There might be some other approach that are worth considering but they did
not come to my mind.

Hope it helps

Benjamin

PS: MongoDB ... Seriously ??? ;-)


Re: data not replicated on new node

2016-11-23 Thread Oleksandr Shulgin
On Tue, Nov 22, 2016 at 5:23 PM, Bertrand Brelier <
bertrand.brel...@gmail.com> wrote:

> Hello Shalom.
>
> No I really went from 3.1.1 to 3.0.9 .
>
So you've just installed the 3.0.9 version and re-started with it?  I
wonder if it's really supported?

Regards,
--
Alex


Re: data not replicated on new node

2016-11-23 Thread Malte Pickhan
Not sure if it's really related, but we experienced something similar last
friday. I summarized it in the following Issue:

https://issues.apache.org/jira/browse/CASSANDRA-12947

Best,

Malte
2016-11-23 10:21 GMT+01:00 Oleksandr Shulgin :

> On Tue, Nov 22, 2016 at 5:23 PM, Bertrand Brelier <
> bertrand.brel...@gmail.com> wrote:
>
>> Hello Shalom.
>>
>> No I really went from 3.1.1 to 3.0.9 .
>>
> So you've just installed the 3.0.9 version and re-started with it?  I
> wonder if it's really supported?
>
> Regards,
> --
> Alex
>
>


Wiki Contributor

2016-11-23 Thread Thomas Brown
Hi,

Could I please be added as a Wiki Contributor

Username: Thomas Brown

Thank you!

-- 
*Thomas Brown*
*Digital Marketing Associate *
m: +61427864972




MANAGED.  SUPPORT.
 CONSULTING.


   

Read our latest technical blog posts here
.

This email has been sent on behalf of Instaclustr Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.


STCS in L0 behaviour

2016-11-23 Thread Marcus Olsson

Hi everyone,

TL;DR
Should LCS be changed to always prefer an STCS compaction in L0 if it's 
falling behind? Assuming that STCS in L0 is enabled.
Currently LCS seems to check if there is a possible L0->L1 compaction 
before checking if it's falling behind, which in our case used between 
15-30% of the compaction thread CPU.

TL;DR

So first some background:
We have a Apache Cassandra 2.2 cluster running with a high load. In that 
cluster there is a table with a moderate amount of writes per second 
that is using LeveledCompactionStrategy. The test was to run repair on 
that table while we monitored the cluster through JMC and with Flight 
Recordings enabled. This resulted in a large amount of sstables for that 
table, which I assume others have experienced as well. In this case I 
think it was between 15-20k.


From the Flight Recording one thing we saw was that 15-30% of the CPU 
time in each of the compaction threads was spent on 
"getNextBackgroundTask()" which retrieves the next compaction job. With 
some further investigation this seems to mostly be when it's checking 
for overlap in L0 sstables before performing an L0->L1 compaction. There 
is a JIRA which seems to be related to this 
https://issues.apache.org/jira/browse/CASSANDRA-11571 which we 
backported to 2.2 and tested. In our testing it seemed to improve the 
situation but it was still using noticeable CPU.


My interpretation of the current logic of LCS is (if STCS in L0 is enabled):
1. Check each level (L1+)
 - If a L1+ compaction is needed check if L0 is behind and do STCS if 
that's the case, otherwise do the L1+ compaction.
2. Check L0 -> L1 compactions and if none is needed/possible check for 
STCS in L0.


My proposal is to change this behavior to always check if L0 is far 
behind first and do a STCS compaction in that case. This would avoid the 
overlap check for L0 -> L1 compactions when L0 is behind and I think it 
makes sense since we already prefer STCS to L1+ compactions. This would 
not solve the repair situation, but it would lower some of the impact 
that repair has on LCS.


For what version this could get in I think trunk would be enough since 
compaction is pluggable.



--
Fwd: Footer

Ericsson 

*MARCUS OLSSON *
Software Developer

*Ericsson*
Sweden
marcus.ols...@ericsson.com 
www.ericsson.com 



Re: STCS in L0 behaviour

2016-11-23 Thread Jeff Jirsa
Without yet reading the code, what you describe sounds like a reasonable 
optimization / fix, suitable for 3.0+ (probably not 2.2, definitely not 2.1)

-- 
Jeff Jirsa


> On Nov 23, 2016, at 7:52 AM, Marcus Olsson  wrote:
> 
> Hi everyone,
> 
> TL;DR
> Should LCS be changed to always prefer an STCS compaction in L0 if it's 
> falling behind? Assuming that STCS in L0 is enabled.
> Currently LCS seems to check if there is a possible L0->L1 compaction before 
> checking if it's falling behind, which in our case used between 15-30% of the 
> compaction thread CPU.
> TL;DR
> 
> So first some background:
> We have a Apache Cassandra 2.2 cluster running with a high load. In that 
> cluster there is a table with a moderate amount of writes per second that is 
> using LeveledCompactionStrategy. The test was to run repair on that table 
> while we monitored the cluster through JMC and with Flight Recordings 
> enabled. This resulted in a large amount of sstables for that table, which I 
> assume others have experienced as well. In this case I think it was between 
> 15-20k.
> 
> From the Flight Recording one thing we saw was that 15-30% of the CPU time in 
> each of the compaction threads was spent on "getNextBackgroundTask()" which 
> retrieves the next compaction job. With some further investigation this seems 
> to mostly be when it's checking for overlap in L0 sstables before performing 
> an L0->L1 compaction. There is a JIRA which seems to be related to this 
> https://issues.apache.org/jira/browse/CASSANDRA-11571 which we backported to 
> 2.2 and tested. In our testing it seemed to improve the situation but it was 
> still using noticeable CPU.
> 
> My interpretation of the current logic of LCS is (if STCS in L0 is enabled):
> 1. Check each level (L1+)
>  - If a L1+ compaction is needed check if L0 is behind and do STCS if that's 
> the case, otherwise do the L1+ compaction.
> 2. Check L0 -> L1 compactions and if none is needed/possible check for STCS 
> in L0.
> 
> My proposal is to change this behavior to always check if L0 is far behind 
> first and do a STCS compaction in that case. This would avoid the overlap 
> check for L0 -> L1 compactions when L0 is behind and I think it makes sense 
> since we already prefer STCS to L1+ compactions. This would not solve the 
> repair situation, but it would lower some of the impact that repair has on 
> LCS.
> 
> For what version this could get in I think trunk would be enough since 
> compaction is pluggable.
> 
> 
> -- 
>  
> 
> 
> MARCUS OLSSON 
> Software Developer
> 
> Ericsson
> Sweden
> marcus.ols...@ericsson.com
> www.ericsson.com


Re: STCS in L0 behaviour

2016-11-23 Thread Jeff Jirsa
What you’re describing seems very close to what’s discussed in  
https://issues.apache.org/jira/browse/CASSANDRA-10979 - worth reading that 
ticket a bit. 

 

There does seem to be a check for STCS in L0 before it tries higher levels: 

https://github.com/apache/cassandra/blob/cassandra-2.2/src/java/org/apache/cassandra/db/compaction/LeveledManifest.java#L324-L326

 

Why it’s doing that within the for loop 
(https://github.com/apache/cassandra/blob/cassandra-2.2/src/java/org/apache/cassandra/db/compaction/LeveledManifest.java#L310
 ) is unexpected to me, though – Carl / Marcus, any insight into why it’s 
within the loop instead of before it? 

 

 

From: Marcus Olsson 
Organization: Ericsson AB
Reply-To: "dev@cassandra.apache.org" 
Date: Wednesday, November 23, 2016 at 7:52 AM
To: "dev@cassandra.apache.org" 
Subject: STCS in L0 behaviour

 

Hi everyone,


TL;DR
Should LCS be changed to always prefer an STCS compaction in L0 if it's falling 
behind? Assuming that STCS in L0 is enabled.
Currently LCS seems to check if there is a possible L0->L1 compaction before 
checking if it's falling behind, which in our case used between 15-30% of the 
compaction thread CPU.
TL;DR

So first some background:
We have a Apache Cassandra 2.2 cluster running with a high load. In that 
cluster there is a table with a moderate amount of writes per second that is 
using LeveledCompactionStrategy. The test was to run repair on that table while 
we monitored the cluster through JMC and with Flight Recordings enabled. This 
resulted in a large amount of sstables for that table, which I assume others 
have experienced as well. In this case I think it was between 15-20k.

>From the Flight Recording one thing we saw was that 15-30% of the CPU time in 
>each of the compaction threads was spent on "getNextBackgroundTask()" which 
>retrieves the next compaction job. With some further investigation this seems 
>to mostly be when it's checking for overlap in L0 sstables before performing 
>an L0->L1 compaction. There is a JIRA which seems to be related to this 
>https://issues.apache.org/jira/browse/CASSANDRA-11571 which we backported to 
>2.2 and tested. In our testing it seemed to improve the situation but it was 
>still using noticeable CPU.

My interpretation of the current logic of LCS is (if STCS in L0 is enabled):
1. Check each level (L1+)
 - If a L1+ compaction is needed check if L0 is behind and do STCS if that's 
the case, otherwise do the L1+ compaction.
2. Check L0 -> L1 compactions and if none is needed/possible check for STCS in 
L0.

My proposal is to change this behavior to always check if L0 is far behind 
first and do a STCS compaction in that case. This would avoid the overlap check 
for L0 -> L1 compactions when L0 is behind and I think it makes sense since we 
already prefer STCS to L1+ compactions. This would not solve the repair 
situation, but it would lower some of the impact that repair has on LCS.

For what version this could get in I think trunk would be enough since 
compaction is pluggable.

-- 

  

MARCUS OLSSON 
Software Developer

Ericsson
Sweden
marcus.ols...@ericsson.com
www.ericsson.com 



smime.p7s
Description: S/MIME cryptographic signature


Re: Wiki Contributor

2016-11-23 Thread Dave Brosius

try now


On 11/22/2016 06:38 PM, Thomas Brown wrote:

Hi,

Could I please be added as a Wiki Contributor

Username: Thomas Brown

Thank you!





FOSDEM 2017 HPC, Bigdata and Data Science DevRoom CFP is closing soon

2016-11-23 Thread Roman Shaposhnik
Hi!

apologies for the extra wide distribution (this exhausts my once
a year ASF mail-to-all-bigdata-projects quota ;-)) but I wanted
to suggest that all of you should consider submitting talks
to FOSDEM 2017 HPC, Bigdata and Data Science DevRoom:
https://hpc-bigdata-fosdem17.github.io/

It was a great success this year and we hope to make it an even
bigger success in 2017.

Besides -- FOSDEM is the biggest gathering of open source
developers on the face of the earth -- don't miss it!

Thanks,
Roman.

P.S. If you have any questions -- please email me directly and
see you all in Brussels!


Re: Cassandra Mutation object decoding

2016-11-23 Thread Sanal Vasudevan
I must say that it is really encouraging to get your thoughts.
Thanks a ton Benjamin, Jacques-Henri, Jordan Nate and Chris.

I do not have access on the client side where the CQL is executed.
One of my requirements is that my app should not to affect the performance
of the cassandra cluster or have very minimal overhead.
I am given access to the commit logs and CDC logs (under
//data/cdc_raw). I can access the database to query the metadata.

I understand that using the Mutation object is risky due to changes in
newer releases. Considering no (or minimal) load on C* cluster and
performance of my app, I am leaning more towards Mutation.
CASSANDRA-8844 suggests use of CommitLogReader and implementing a
CommitLogReadHandler interface which pushes the
Mutation object.
Are you guys aware how we could use the CDC feature without decoding
Mutation?
Just want to make sure I am not missing some functionality available in the
CDC feature and I am using the CDC feature in the expected fashion.

Just looking at the state of Mutation object,  this is what I get:
DELETE CQL:
mutation.getPartitionUpdates().partitionUpdate.columns().isEmpty() : true
INSERT/UPDATE CQL:
mutation.getPartitionUpdates().partitionUpdate.columns().isEmpty() : false
I am checking internally with my team whether I can live with INSERT/UPDATE
classified as upsert (as Jacques-Henri did earlier).

I am able to decode partition key, ksName, cfName, ColumnData and Column
definition from the Mutation object.

Thanks folks, great help from this community.

Best regards,
Sanal

On Wed, Nov 23, 2016 at 8:36 PM, Benjamin Lerer  wrote:

> >
> > My goal is to reconstruct the CQL operation from the Mutation object.
> > So that I can trigger the same action on another NoSQL target like
> MongoDB.
> >
>
> There are different way of keeping your 2 database in sync. Unfortunatly,
> they all have some trade offs (as always ;-))
>
>
>1. If you have controle on the client side, you could wrap the driver
>and add some code that convert the query and write it to the other
> database
>at the same time. The main problem with that approach is that a write
> can
>succeed on one of the database but not on the other. Which means that
> you
>will need a mechanism to resolve those problems.
>2. On the Cassandra side you could, as Nate suggested, extends the
>QueryProcessor in order to log the mutations to a log file. As the
>QueryProcessor has access to the prepared statement cache and to the
> bind
>parameter you should be able to extract the information you need. Some
> of
>the problems of that approach are:
>   1. You cannot reprocess already inserted data
>   2. You will probably have to use a replication log to deal with the
>   cases where the other database is unreachable
>   3. It might slow down your query processing and take some of your
>   band width at critical time (heavy write)
>   3. Use a fake index as Jacques-Henri suggested. It will allow to
>easily reprocess already inserted data so you will not need some
>replication logs (at the same time having to rebuild the index might
> slow
>down your database). The main issues for that solution are:
>1. All the tables that you want to replicate will have to have that
>   index and you cannot automatically update the schemas on your
> other database
>   2. It might slow down your query processing and take some of your
>   band width at critical time (heavy write)
>4. Read the commitlogs to recreate the mutation statements (your initial
>approach). The main problem is that it is simply not easy to do and
> might
>break up with new major releases. You will also have to make sure that
> the
>files do not disappear before you have processed them.
>5. Try a Datawarehouse/ETL approach to synchronized your data.
>CASSANDRA-8844 added support for CDC (Change Data Capture) which might
> help
>you there. Unfortunatly, I have not really worked on it so I cannot help
>you much there.
>
> There might be some other approach that are worth considering but they did
> not come to my mind.
>
> Hope it helps
>
> Benjamin
>
> PS: MongoDB ... Seriously ??? ;-)
>



-- 
Sanal Vasudevan Nair


Re: Cassandra Mutation object decoding

2016-11-23 Thread Nate McCall
> I must say that it is really encouraging to get your thoughts.
> Thanks a ton Benjamin, Jacques-Henri, Jordan Nate and Chris.
>
> I do not have access on the client side where the CQL is executed.

QueryHandler (I called it QueryProcessor incorrectly in my initial
reply) is server side:
https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/cql3/QueryHandler.java

implemented by the QueryProcessor (which you would most likely want to
extend and override):
https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/cql3/QueryProcessor.java

Nothing is stopping you from continuing with processing mutations
somewhere else though, I just think the above would be a good place to
start and is supported-ish API.


Re: Cassandra Mutation object decoding

2016-11-23 Thread Sanal Vasudevan
Hi Nate,

Thank you.

I can give it a try.
Any examples you can point me to using QueryProcessor to read operations
from the CommitLogs?

Best regards,
Sanal

On Thu, Nov 24, 2016 at 1:22 PM, Nate McCall  wrote:

> > I must say that it is really encouraging to get your thoughts.
> > Thanks a ton Benjamin, Jacques-Henri, Jordan Nate and Chris.
> >
> > I do not have access on the client side where the CQL is executed.
>
> QueryHandler (I called it QueryProcessor incorrectly in my initial
> reply) is server side:
> https://github.com/apache/cassandra/blob/cassandra-3.0/
> src/java/org/apache/cassandra/cql3/QueryHandler.java
>
> implemented by the QueryProcessor (which you would most likely want to
> extend and override):
> https://github.com/apache/cassandra/blob/cassandra-3.0/
> src/java/org/apache/cassandra/cql3/QueryProcessor.java
>
> Nothing is stopping you from continuing with processing mutations
> somewhere else though, I just think the above would be a good place to
> start and is supported-ish API.
>



-- 
Sanal Vasudevan Nair