Re: RFC: Cassandra Virtual Nodes
On 22 March 2012 05:48, Zhu Han wrote: > I second it. > > Is there some goals we missed which can not be achieved by assigning > multiple tokens to a single node? This is exactly the proposed solution. The discussion is about how to implement this, and the methods of choosing tokens and replication strategy. Richard.
Re: RFC: Cassandra Virtual Nodes
On Thu, Mar 22, 2012 at 6:20 PM, Richard Low wrote: > On 22 March 2012 05:48, Zhu Han wrote: > > > I second it. > > > > Is there some goals we missed which can not be achieved by assigning > > multiple tokens to a single node? > > This is exactly the proposed solution. The discussion is about how to > implement this, and the methods of choosing tokens and replication > strategy. > Does the new scheme still require the node to re-iterate all sstables to build the merkle tree or streaming data for partition level repair and move? The disk IO triggered by above steps could be very time-consuming if the dataset on single node is very large. It could be much more costly than the network IO, especially when concurrent repair tasks hit the same node. Is there any good ideas on it? > Richard. >
Re: RFC: Cassandra Virtual Nodes
> > Does the new scheme still require the node to re-iterate all sstables to > build the merkle tree or streaming data for partition level > repair and move? You would have to iterate through all sstables on the system to repair one vnode, yes: but building the tree for just one range of the data means that huge portions of the sstables files can be skipped. It should scale down linearly as the number of vnodes increases (ie, with 100 vnodes, it will take 1/100th the time to repair one vnode). On Thu, Mar 22, 2012 at 5:46 AM, Zhu Han wrote: > On Thu, Mar 22, 2012 at 6:20 PM, Richard Low wrote: > > > On 22 March 2012 05:48, Zhu Han wrote: > > > > > I second it. > > > > > > Is there some goals we missed which can not be achieved by assigning > > > multiple tokens to a single node? > > > > This is exactly the proposed solution. The discussion is about how to > > implement this, and the methods of choosing tokens and replication > > strategy. > > > > Does the new scheme still require the node to re-iterate all sstables to > build the merkle tree or streaming data for partition level > repair and move? > > The disk IO triggered by above steps could be very time-consuming if the > dataset on single node is very large. It could be much more costly than > the network IO, especially when concurrent repair tasks hit the same > node. > > Is there any good ideas on it? > > > > Richard. > > >
Re: RFC: Cassandra Virtual Nodes
> You would have to iterate through all sstables on the system to repair one > vnode, yes: but building the tree for just one range of the data means that > huge portions of the sstables files can be skipped. It should scale down > linearly as the number of vnodes increases (ie, with 100 vnodes, it will > take 1/100th the time to repair one vnode). The story is less good for "nodetool cleanup" however, which still has to truck over the entire dataset. (The partitions/buckets in my crush-inspired scheme addresses this by allowing that each ring segment, in vnode terminology, be stored separately in the file system.) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
On Fri, Mar 23, 2012 at 6:54 AM, Peter Schuller wrote: > > You would have to iterate through all sstables on the system to repair > one > > vnode, yes: but building the tree for just one range of the data means > that > > huge portions of the sstables files can be skipped. It should scale down > > linearly as the number of vnodes increases (ie, with 100 vnodes, it will > > take 1/100th the time to repair one vnode). > The SSTable indices should still be scanned for size tiered compaction. Do I miss anything here? > The story is less good for "nodetool cleanup" however, which still has > to truck over the entire dataset. > > (The partitions/buckets in my crush-inspired scheme addresses this by > allowing that each ring segment, in vnode terminology, be stored > separately in the file system.) > But the number of files can be a big problem if there are hundreds of vnodes and millions of sstables on the same physical node. We need a way to pin sstable inode to memory. Otherwise, it's possible the average number of disk IO to access a row in a sstable could be five or more. > > -- > / Peter Schuller (@scode, http://worldmodscode.wordpress.com) >