I've been reading everything I can get my hands on about Cassandra and it sounds like a possibly very good framework for our data needs; I'm about to take the plunge and do some prototyping, but I thought I'd see if I can get a reality check here on whether it makes sense.
Our schema should be fairly simple; we may only keep our original data in Cassandra, and the rollups and analyzed results in a relational db (although this is still open for discussion). We have fairly small records: 120-150 bytes, in maybe 18 columns. Data is additive only; we would rarely, if ever, be deleting data. Our core data set will accumulate at somewhere between 14 and 27 million rows per day; we'll be starting with about a year and a half of data (7.5 - 15 billion rows) and eventually would like to keep 5 years online (25 to 50 billion rows). (So that's maybe 1.3TB or so per year, data only. Not sure about the overhead yet.) Ideally we'd like to also have a cluster with our complete data set, which is maybe 38 billion rows per year (we could live with less than 5 years of that). I haven't really thought through what the schema's going to be; our primary key is an entity's ID plus a timestamp. But there's 2 or 3 other retrieval paths we'll need to support as well. Thoughts? Pitfalls? Gotchas? Are we completely whacked? Thanks, -- dwh