Re: Load Balancing Mapper Tasks

2010-05-17 Thread Jonathan Ellis
That means they are blocking for something to be added to the task queue On Mon, May 17, 2010 at 9:42 AM, Joost Ouwerkerk wrote: > At any given moment at least half of those threads are in the following > state; what does it represent? > Name: ROW-READ-STAGE:6 > State: WAITING on > java.util.conc

Re: Load Balancing Mapper Tasks

2010-05-17 Thread Joost Ouwerkerk
At any given moment at least half of those threads are in the following state; what does it represent? Name: ROW-READ-STAGE:6 State: WAITING on java.util.concurrent.locks.abstractqueuedsynchronizer$conditionobj...@fea6030 Total blocked: 44 Total waited: 479 Stack trace: sun.misc.Unsafe.park(Nati

Re: Load Balancing Mapper Tasks

2010-05-16 Thread Jonathan Ellis
On Sun, May 16, 2010 at 2:52 PM, Joost Ouwerkerk wrote: > Meanwhile. I'm still getting TimedOutException errors when mapping this > 30-million row table, even when retrieving no data at all.  It looks like it > is related to disk activity on "hot" nodes (when the same cassandra node has > to handl

Re: Load Balancing Mapper Tasks

2010-05-16 Thread Joost Ouwerkerk
Hadoop doesn't make any assumptions about how input source data is distributed. It can't 'know' that the data for the first 30 splits emitted by the InputFormat are all stored on the same cassandra node. The new case with the patch is CASSANDRA-1096 Meanwhile. I'm still getting TimedOutException

Re: Load Balancing Mapper Tasks

2010-05-15 Thread Jonathan Ellis
Oh, very interesting. I assumed Hadoop would be smart enough to load-balance the jobs it sends out. Guess not. Can you submit a patch? On Wed, May 12, 2010 at 12:32 PM, Joost Ouwerkerk wrote: > I've been trying to improve the time it takes to map 30 million rows using a > hadoop / cassandra cl

Load Balancing Mapper Tasks

2010-05-12 Thread Joost Ouwerkerk
I've been trying to improve the time it takes to map 30 million rows using a hadoop / cassandra cluster with 30 nodes. I discovered that since CassandraInputFormat returns an ordered list of splits, when there are many splits (e.g. hundreds or more) the load on cassandra is horribly unbalanced. e