Re: Digg's data model

2010-03-23 Thread Ned Wolpert
I'm curious why you are storing the backups (sstables and commit logs) to HDFS instead of something like lustre. Are your backups using Hadoop's map/reduce somehow? Or is it for convenience? On Sat, Mar 20, 2010 at 8:40 AM, Chris Goffinet wrote: > > 5. Backups : If there is a 4 or 5 TB cassandr

Re: Digg's data model

2010-03-20 Thread Chris Goffinet
On Mar 20, 2010, at 9:10 AM, Jeremy Dunck wrote: > On Sat, Mar 20, 2010 at 10:40 AM, Chris Goffinet wrote: >>> 5. Backups : If there is a 4 or 5 TB cassandra cluster what do you >>> recommend the backup scenario's could be? >> >> Worst case scenario (total failure) we opted to do global snapsh

Re: Digg's data model

2010-03-20 Thread Jeremy Dunck
On Sat, Mar 20, 2010 at 10:40 AM, Chris Goffinet wrote: >> 5. Backups : If there is a  4 or 5 TB cassandra cluster what do you >> recommend the backup scenario's could be? > > Worst case scenario (total failure) we opted to do global snapshots every 24 > hours. This creates hard links to SSTable

Re: Digg's data model

2010-03-20 Thread Chris Goffinet
> 5. Backups : If there is a 4 or 5 TB cassandra cluster what do you recommend > the backup scenario's could be? Worst case scenario (total failure) we opted to do global snapshots every 24 hours. This creates hard links to SSTables on each node. We copy those SSTables to HDFS on daily basis.

Re: Digg's data model

2010-03-20 Thread Chris Goffinet
> Also, Does cassandra support counters? Digg's article said they are going to > contribute their work to open source any idea when that would be? > All of the custom work has been pushed upstream from Digg and continues. We have a few operational tools we will be releasing that will go into co

Re: Digg's data model

2010-03-20 Thread Joe Stump
On Mar 20, 2010, at 2:53 AM, Lenin Gali wrote: > 1. Eventual consistency: Given a volume of 5K writes / sec and roughly 1500 > writes are Updates per sec while the rest are inserts, what kind of latency > can be expected in eventual consistency? Depending on the size of the cluster you're not

Re: Digg's data model

2010-03-20 Thread Lenin Gali
Hi, I have several questions. I hope some of you can share your experiences in each or all of these following. I will be curious about twitter and digg's experience as they might be processing 1. Eventual consistency: Given a volume of 5K writes / sec and roughly 1500 writes are Updates per sec wh

Re: Digg's data model

2010-03-19 Thread Jonathan Ellis
Jeff Hodsdon edited the new link in: http://about.digg.com/blog/looking-future-cassandra On Fri, Mar 19, 2010 at 2:49 PM, Nathan McCall wrote: > Gary, > Did you see this larticle linked from the Cassandra wiki? > http://about.digg.com/node/564 > > See http://wiki.apache.org/cassandra/ArticlesAndP

Re: Digg's data model

2010-03-19 Thread Nathan McCall
Gary, Did you see this larticle linked from the Cassandra wiki? http://about.digg.com/node/564 See http://wiki.apache.org/cassandra/ArticlesAndPresentations for more examples like the above. In general, you structure your data according to how it will be queried. This can lead to duplication, but

Re: Digg's data model

2010-03-19 Thread David Strauss
On 2010-03-19 19:16, Gary wrote: > I am a newbie to bigtable like model and have a question as follows. > Take Digg as an example, I want to find a list users who dug a URL and > also want to find a list of URLs a user dug. How should the data model > look like for the queries to be efficient? If I

Re: Digg's data model

2010-03-19 Thread Joe Stump
On Mar 19, 2010, at 1:16 PM, Gary wrote: > I am a newbie to bigtable like model and have a question as follows. Take > Digg as an example, I want to find a list users who dug a URL and also want > to find a list of URLs a user dug. How should the data model look like for > the queries to be ef

Digg's data model

2010-03-19 Thread Gary
I am a newbie to bigtable like model and have a question as follows. Take Digg as an example, I want to find a list users who dug a URL and also want to find a list of URLs a user dug. How should the data model look like for the queries to be efficient? If I use the username and the URL for two row