Re: ColumnFamilyRecordWriter

2011-03-01 Thread Mayank Mishra
Thanks Jeremy, It make sense to abstract out CFOF and CFRW (right now it's tightly bounded to avro), so that one can plugin custom serializer (avro, thrift and going forward I guess may be CQL). I will create a JIRA and submit the patch with do the needful changes. Surely, I will ping you if I

Re: ColumnFamilyRecordWriter

2011-02-28 Thread Jeremy Hanna
One thing that could be done is the CFRW could be abstracted more so that it's easier to extend and only the serialization mechanism is required to extend it. That is, all of the core functionality relating to Cassandra would be in an abstract class or something like that. Then the avro based

Re: ColumnFamilyRecordWriter

2011-02-28 Thread Jeremy Hanna
There certainly could be a thrift based record writer. However, (if I remember correctly) to enable Hadoop output streaming, it was easier to go with Avro for doing the records as the schema is included. There could also have been a thrift version of the record writer, but it's simpler to just

ColumnFamilyRecordWriter

2011-02-28 Thread Mayank Mishra
Hi all, As I was integrating Hadoop with Cassandra, I wanted to serialize mutations, hence I used thrift mutations in M/R jobs. During the course, I came to know that CFRW considers only Avro mutations. Can someone please explain me why only avro transport is entertained by CFRW. Why not, bo

Re: [mapreduce] ColumnFamilyRecordWriter hidden reuse

2011-01-26 Thread Jonathan Ellis
On Tue, Jan 25, 2011 at 12:09 PM, Mick Semb Wever wrote: > Well your key is a mutable Text object, so i can see some possibility > depending on how hadoop uses these objects. Yes, that's it exactly. We recently fixed a bug in the demo word_count program for this. Now we do ByteBuffer.wrap(Arrays

Re: [mapreduce] ColumnFamilyRecordWriter hidden reuse

2011-01-26 Thread Mck
On Wed, 2011-01-26 at 12:13 +0100, Patrik Modesto wrote: > BTW how to get current time in microseconds in Java? I'm using HFactory.clock() (from hector). > > As far as moving the clone(..) into ColumnFamilyRecordWriter.write(..) > > won't this hurt performance? > > The size of the queue is comp

Re: [mapreduce] ColumnFamilyRecordWriter hidden reuse

2011-01-26 Thread Patrik Modesto
On Wed, Jan 26, 2011 at 08:58, Mck wrote: >> You are correct that microseconds would be better but for the test it >> doesn't matter that much. > > Have you tried. I'm very new to cassandra as well, and always uncertain > as to what to expect... IMHO it's matter of use-case. In my use-case there

Re: [mapreduce] ColumnFamilyRecordWriter hidden reuse

2011-01-25 Thread Mck
> > is "d.timestamp = System.currentTimeMillis();" ok? > > You are correct that microseconds would be better but for the test it > doesn't matter that much. Have you tried. I'm very new to cassandra as well, and always uncertain as to what to expect... > ByteBuffer bbKey = ByteBufferUtil.clo

Re: [mapreduce] ColumnFamilyRecordWriter hidden reuse

2011-01-25 Thread Patrik Modesto
On Tue, Jan 25, 2011 at 19:09, Mick Semb Wever wrote: > In fact i have another problem (trying to write an empty byte[], or > something, as a key, which put one whole row out of whack, ((one row in > 25 million...))). > > But i'm debugging along the same code. > > I don't quite understand how the

Re: [mapreduce] ColumnFamilyRecordWriter hidden reuse

2011-01-25 Thread Mick Semb Wever
On Tue, 2011-01-25 at 14:16 +0100, Patrik Modesto wrote: > The atttached file contains the working version with cloned key in > reduce() method. My other aproache was: > > > context.write(ByteBuffer.wrap(key.getBytes(), 0, key.getLength()), > > Collections.singletonList(getMutation(key))); > > Wh

Re: [mapreduce] ColumnFamilyRecordWriter hidden reuse

2011-01-25 Thread Patrik Modesto
Hi Mick, attached is the very simple MR job, that deletes expired URL from my test Cassandra DB. The keyspace looks like this: Keyspace: Test: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Replication Factor: 2 Column Families: ColumnFamily: Url2 Columns sort

Re: [mapreduce] ColumnFamilyRecordWriter hidden reuse

2011-01-25 Thread Mick Semb Wever
On Tue, 2011-01-25 at 09:37 +0100, Patrik Modesto wrote: > While developing really simple MR task, I've found that a > combiantion of Hadoop optimalization and Cassandra > ColumnFamilyRecordWriter queue creates wrong keys to send to > batch_mutate(). I've seen similar beha

[mapreduce] ColumnFamilyRecordWriter hidden reuse

2011-01-25 Thread Patrik Modesto
Hi, I play with Cassandra 0.7.0 and Hadoop, developing simple MapReduce tasks. While developing really simple MR task, I've found that a combiantion of Hadoop optimalization and Cassandra ColumnFamilyRecordWriter queue creates wrong keys to send to batch_mutate(). The proble is in the reduce