Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

Clint Kelly Mon, 24 Mar 2014 17:00:12 -0700

Hi Shao-Chuan,

That sounds like a good idea, thanks for your response.  I think I may have
missed the e-mail from Tyler that you reference --- I'll go back and look.


FWIW the code that I have written so far is here:

    https://github.com/wibiclint/cassandra2-hadoop2

It is in rough shape now because I really just had to get it to a level for
unit testing with the rest of the Kiji framework as we added C* support.
In the next few weeks, however, we are going to have a couple of folks here
at my company (WibiData) working on it.

I saw a thread a while back about how to support Hadoop 1 and Hadoop 2 as
well.  For the Kiji platform, we have used "platform bridges," described
here: https://github.com/kijiproject/wiki/wiki/Platform-bridges, to select
at run time the proper code to load.  The bridges for different Hadoop
versions are in different JARs, which get loaded dynamically.  We could
probably apply a similar approach to the Cassandra / Hadoop integration if
folks are interested in that.

Thanks again for your help!

Best regards,
Clint


On Mon, Mar 24, 2014 at 4:13 PM, Shao-Chuan Wang <
shaochuan.w...@bloomreach.com> wrote:

> Tyler mentioned that client.describe_ring(myKeyspace); can be replaced by a
> query of system.peers table which has the ring information. The challenge
> here is to describe_splits_ex which needs the estimate the number of rows
> in each sub token range (as you mentioned).
>
> From what I understand and trials and errors so far, I don't think Datastax
> Java driver is able to do describe_splits_ex via a simple API call. If you
> look at the implementation of CassandraServer.describe_splits_ex() and
> StorageService.instance.getSplits(), what it does is that it is splitting a
> token range into several sub token ranges, with estimated row count in each
> sub token rage. Inside StorageService.instance.getSplits() call, it is
> adjusting split count based on a estimated row count, too.
> StorageService.instance.getSplits() is only publicly exported by thrift. It
> would be non-trivial to re-build the same logic inside
> StorageService.instance.getSplits().
>
> That said, it looks like we could implement the splits logic at
> AbstractColumnFamilyInputFormat.getSubSplits by querying
> system.schema_columnfamilies and use CFMetaData.fromSchema to construct
> CFMetaData. Inside CFMetaData it has the indexInterval which can be used to
> estimate row count, and the next thing is to mimic the logic in
> StorageService.instance.getSplits() to divide token range into several sub
> token ranges and use TokenFactory (which is obtained from partitioner) to
> construct sub token ranges at AbstractColumnFamilyInputFormat.getSubSplits.
> Basically, it is moving the splitting code from the server side to the
> client side.
>
> Any thoughts?
>
> Shao-Chuan
>
>
> On Mon, Mar 24, 2014 at 11:54 AM, Clint Kelly <clint.ke...@gmail.com>
> wrote:
>
> > I just saw this question about thrift in the Hadoop / Cassandra
> integration
> > in the discussion on the user list about freezing thrift.  I have been
> > working on a project to integrate Hadoop 2 and Cassandra 2 and have been
> > trying to move all of the way over to the Java driver and away from
> thrift.
> >
> > I have finished most of the driver.  It is still pretty rough, but I have
> > been using it for testing a prototype of the Kiji platfrom (www.kiji.org
> )
> > that uses Cassandra instead of HBase.
> >
> > One thing I have not been able to figure out is how to calculate input
> > splits without thrift.  I am currently doing the following:
> >
> >       map = client.describe_ring(myKeyspace);
> >
> > (where client is of type Cassandra.Client).
> >
> > This call returns a list of token ranges (max and min token values) for
> > different nodes in the cluster.  We then use this information, along with
> > another thrift call,
> >
> >     client.describe_splits_ex(cfName, range.start_token, range.end_token,
> > splitSize);
> >
> > to estimate the number of rows in each token range, etc.
> >
> > I have looked all over the Java driver documentation and pinged the user
> > list and have not gotten any proposals that work for the Java driver.
>  Does
> > anyone here have any suggestions?
> >
> > Thanks!
> >
> > Best regards,
> > Clint
> >
> >
> > On Tue, Mar 11, 2014 at 12:41 PM, Shao-Chuan Wang <
> > shaochuan.w...@bloomreach.com> wrote:
> >
> > > Hi,
> > >
> > > I just received this email from Jonathan regarding this deprecation of
> > > thrift in 2.1 in dev emailing list.
> > >
> > > In fact, we migrated from thrift client to native one several months
> ago;
> > > however, in the Cassandra.hadoop, there are still a lot of dependencies
> > on
> > > thrift interface, for example describe_splits_ex in
> > > org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.
> > >
> > > Therefore, we had to keep thrift and native in our server but mainly,
> the
> > > CRUD query are through native protocol.
> > > However, Jonathan says "*I don't know of any use cases for Thrift that
> > > can't be **done in CQL"*. This statement makes me wonder maybe there is
> > > something I don't know about native protocol yet.
> > >
> > > So, does anyone know how to do "describing the splits" and "describing
> > the
> > > local rings" using native protocol?
> > >
> > > Also, cqlsh uses python client, which is talking via thrift protocol
> too.
> > > Does it mean that it will be migrated to native protocol soon as well?
> > >
> > > Comments, pointers, suggestions are much appreciated.
> > >
> > > Many thanks,
> > >
> > > Shao-Chuan
> > >
> >
>

Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

Reply via email to