Hi, On Thursday, 2 August 2012 at 11:47, Owen Davies wrote:
> We want to store a large number of columns in a single row (up to about > 100,000,000), where each value is roughly 10 bytes. > > We also need to be able to get slices of columns from any point in the row. > > We haven't found a problem with smaller amounts of data so far, but can > anyone think of any reason if this is a bad idea, or would cause large > performance problems? my experience with wide rows & cassandra is not positive. We used to have rows of a few hundred megabytes each, to be read during Map Reduce computation, and that caused many issues, especially with timeouts reading the rows (with cassandra under a medium write load) and OutOfMemory exceptions. The solution in our case was to "shard" (timebucket) the rows into smaller pieces (a few megabytes each). The situation might have changed with Cassandra 1.1.0, which claims to have some "wide row" support, but I haven't been able to test that. > > If breaking up the row is something we should do, what is the maximum number > of columns we should have? > > We are not too worried if there is only a small performance decrease, adding > more nodes to the cluster would be an option to help make code simpler. I don't have a precise figure, but I'd limit row size to less than 100MB… much less, if possible. In general, my experience is that hundred of millions of small rows don't cause issues, but having just a few very wide rows will cause timeouts and, in worst cases, OOM. -- Filippo Diotalevi