Re: Heavy one-off writes best practices

2018-02-06 Thread Romain Hardouin
We use Spark2Cassandra (this fork works with C*3.0  https://github.com/leoromanovsky/Spark2Cassandra ) SSTables are streamed to Cassandra by Spark2Cassandra (so you need to open port 7000 accordingly).During benchmark we used 25 EMR nodes but in production we use less nodes to be more gentle wit

Re: Heavy one-off writes best practices

2018-02-06 Thread Julien Moumne
This does look like a very viable solution. Thanks. Could you give us some pointers/documentation on : - how can we build such SSTables using spark jobs, maybe https://github.com/Netflix/sstable-adaptor ? - how do we send these tables to cassandra? does a simple SCP work? - what is the recommen

Re: Heavy one-off writes best practices

2018-02-05 Thread Romain Hardouin
Hi Julien, We have such a use case on some clusters. If you want to insert big batches at fast pace the only viable solution is to generate SSTables on Spark side and stream them to C*. Last time we benchmarked such a job we achieved 1.3 million partitions inserted per seconde on a 3 C* nodes

Re: Heavy one-off writes best practices

2018-02-04 Thread kurt greaves
> > Would you know if there is evidence that inserting skinny rows in sorted > order (no batching) helps C*? This won't have any effect as each insert will be handled separately by the coordinator (or a different coordinator, even). Sorting is also very unlikely to help even if you did batch. Al

Re: Heavy one-off writes best practices

2018-02-04 Thread Julien Moumne
Thank you all for your inputs. Before trying advanced ideas, we'll first try reducing the amount of spark executors and see if export times are still acceptable. Two additional questions. Would you know if there is evidence that inserting skinny rows in sorted order (no batching) helps C*? Also

Re: Heavy one-off writes best practices

2018-01-30 Thread Jeff Jirsa
Two other options, both of which will be faster (and less likely to impact read latencies) but require some app side programming, if you’re willing to generate the sstables programmatically with CQLSSTableWriter or similar. Once you do that, you can: 1) stream them in with the sstableloader (wh

Re: Heavy one-off writes best practices

2018-01-30 Thread Lucas Benevides
Hello Julien, After reading the excelent post and video by Alain Rodriguez, maybe you should read the paper Performance Tuning of Big Data Platform: Cassandra Case Study by SATHVIK KATAM. In the results he sets new values

Re: Heavy one-off writes best practices

2018-01-30 Thread Alain RODRIGUEZ
Hi Julien, Whether skinny rows or wide rows, data for a partition key is always > completely updated / overwritten, ie. every command is an insert. Insert and updates are kind of the same thing in Cassandra for standard data types, as Cassandra appends the operation and do not actually update an

Re: Heavy one-off writes best practices

2018-01-30 Thread Alain RODRIGUEZ
I noticed I did not give the credits to Eric Lubow from SimpleReach. The video mentioned above is a talk he gave at the Cassandra Summit 2016 :-). 2018-01-30 9:07 GMT+00:00 Alain RODRIGUEZ : > Hi Julien, > > Whether skinny rows or wide rows, data for a partition key is always >> completely update

Heavy one-off writes best practices

2018-01-30 Thread Julien Moumne
Hello, I am looking for best practices for the following use case : Once a day, we insert at the same time 10 full tables (several 100GiB each) using Spark C* driver, without batching, with CL set to ALL. Whether skinny rows or wide rows, data for a partition key is always completely updated / ov