RE: An extremely fast cassandra table full scan utility

2016-11-01 Thread SEAN_R_DURITY
cassandra table full scan utility I undertook a similar effort a while ago. https://issues.apache.org/jira/browse/CASSANDRA-7014 Other than the fact that it was closed with no comments, I can tell you that other efforts I had to embed things in Cassandra did not go swimmingly. Although at the time

Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread Edward Capriolo
I undertook a similar effort a while ago. https://issues.apache.org/jira/browse/CASSANDRA-7014 Other than the fact that it was closed with no comments, I can tell you that other efforts I had to embed things in Cassandra did not go swimmingly. Although at the time ideas were rejected like groovy

Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread Bhuvan Rawal
Hi Jonathan, If full scan is a regular requirement then setting up a spark cluster in locality with Cassandra nodes makes perfect sense. But supposing that it is a one off requirement, say a weekly or a fortnightly task, a spark cluster could be an added overhead with additional capacity, resource

Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread siddharth verma
Hi Jon, It wan't allowed. Moreover, if someone who isn't familiar with spark, and might be new to map filter reduce etc. operations, could also use the utility for some simple operations assuming a sequential scan of the cassandra table. Regards Siddharth Verma On Tue, Oct 4, 2016 at 1:32 AM, Jon

Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread Jonathan Haddad
Couldn't set up as couldn't get it working, or its not allowed? On Mon, Oct 3, 2016 at 3:23 PM Siddharth Verma wrote: > Hi Jon, > We couldn't setup a spark cluster. > > For some use case, a spark cluster was required, but for some reason we > couldn't create spark cluster. Hence, one may use this

Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread Siddharth Verma
Hi Jon, We couldn't setup a spark cluster. For some use case, a spark cluster was required, but for some reason we couldn't create spark cluster. Hence, one may use this utility to iterate through the entire table at very high speed. Had to find a work around, that would be faster than paging on r

Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread Jonathan Haddad
It almost sounds like you're duplicating all the work of both spark and the connector. May I ask why you decided to not use the existing tools? On Mon, Oct 3, 2016 at 2:21 PM siddharth verma wrote: > Hi DuyHai, > Thanks for your reply. > A few more features planned in the next one(if there is on

Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread Bhuvan Rawal
It will be interesting to have a comparison with spark here for basic use cases. >From a naive observation it appears that this could be slower than spark as a lot of data is streamed over network. On the other hand in this approach we have seen that Young GC takes nearly full CPU (possibly becau

Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread siddharth verma
Hi DuyHai, Thanks for your reply. A few more features planned in the next one(if there is one) like, custom policy keeping in mind the replication of token range on specific nodes, fine graining the token range(for more speedup), and a few more. I think, as fine graining a token range, If one toke

Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread DuyHai Doan
Hello Siddarth I just throw an eye over the architecture diagram. The idea of using multiple threads, one for each token range is great. It help maxing out parallelism. With https://issues.apache.org/jira/browse/CASSANDRA-11521 it would be even faster. On Mon, Oct 3, 2016 at 7:51 PM, siddharth v

An extremely fast cassandra table full scan utility

2016-10-03 Thread siddharth verma
Hi, I was working on a utility which can be used for cassandra full table scan, at a tremendously high velocity, cassandra fast full table scan. How fast? The script dumped ~ 229 million rows in 116 seconds, with a cluster of size 6 nodes. Data transfer rates were upto 25MBps was observed on cassan