Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Richard Low
On 22 March 2012 05:48, Zhu Han  wrote:

> I second it.
>
> Is there some goals we missed which can not be achieved by assigning
> multiple tokens to a single node?

This is exactly the proposed solution.  The discussion is about how to
implement this, and the methods of choosing tokens and replication
strategy.

Richard.


Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Zhu Han
On Thu, Mar 22, 2012 at 6:20 PM, Richard Low  wrote:

> On 22 March 2012 05:48, Zhu Han  wrote:
>
> > I second it.
> >
> > Is there some goals we missed which can not be achieved by assigning
> > multiple tokens to a single node?
>
> This is exactly the proposed solution.  The discussion is about how to
> implement this, and the methods of choosing tokens and replication
> strategy.
>

Does the new scheme  still require the node to re-iterate all sstables to
build the merkle tree or streaming data for partition level
repair and move?

The disk IO triggered by above steps could be very time-consuming if the
dataset on single node is very large.  It could be much more costly than
the network IO, especially when concurrent repair tasks hit the same
node.

Is there any good ideas on it?


> Richard.
>


Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Stu Hood
>
> Does the new scheme  still require the node to re-iterate all sstables to
> build the merkle tree or streaming data for partition level
> repair and move?

You would have to iterate through all sstables on the system to repair one
vnode, yes: but building the tree for just one range of the data means that
huge portions of the sstables files can be skipped. It should scale down
linearly as the number of vnodes increases (ie, with 100 vnodes, it will
take 1/100th the time to repair one vnode).

On Thu, Mar 22, 2012 at 5:46 AM, Zhu Han  wrote:

> On Thu, Mar 22, 2012 at 6:20 PM, Richard Low  wrote:
>
> > On 22 March 2012 05:48, Zhu Han  wrote:
> >
> > > I second it.
> > >
> > > Is there some goals we missed which can not be achieved by assigning
> > > multiple tokens to a single node?
> >
> > This is exactly the proposed solution.  The discussion is about how to
> > implement this, and the methods of choosing tokens and replication
> > strategy.
> >
>
> Does the new scheme  still require the node to re-iterate all sstables to
> build the merkle tree or streaming data for partition level
> repair and move?
>
> The disk IO triggered by above steps could be very time-consuming if the
> dataset on single node is very large.  It could be much more costly than
> the network IO, especially when concurrent repair tasks hit the same
> node.
>
> Is there any good ideas on it?
>
>
> > Richard.
> >
>


Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Peter Schuller
> You would have to iterate through all sstables on the system to repair one
> vnode, yes: but building the tree for just one range of the data means that
> huge portions of the sstables files can be skipped. It should scale down
> linearly as the number of vnodes increases (ie, with 100 vnodes, it will
> take 1/100th the time to repair one vnode).

The story is less good for "nodetool cleanup" however, which still has
to truck over the entire dataset.

(The partitions/buckets in my crush-inspired scheme addresses this by
allowing that each ring segment, in vnode terminology, be stored
separately in the file system.)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Zhu Han
On Fri, Mar 23, 2012 at 6:54 AM, Peter Schuller  wrote:

> > You would have to iterate through all sstables on the system to repair
> one
> > vnode, yes: but building the tree for just one range of the data means
> that
> > huge portions of the sstables files can be skipped. It should scale down
> > linearly as the number of vnodes increases (ie, with 100 vnodes, it will
> > take 1/100th the time to repair one vnode).
>

The SSTable indices should still be scanned for size tiered compaction.
Do I miss anything here?


> The story is less good for "nodetool cleanup" however, which still has
> to truck over the entire dataset.
>
> (The partitions/buckets in my crush-inspired scheme addresses this by
> allowing that each ring segment, in vnode terminology, be stored
> separately in the file system.)
>

But the number of files can be a big problem if there are hundreds of
vnodes and millions of sstables
on the same physical node.

We need a way to pin sstable inode to memory.  Otherwise,
it's possible the average number of disk IO to access a row in a sstable
could
be five or more.


>
> --
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
>