>>Why do you need to know this? Were you trying to do a percentage of rows per region? Yes, that's exactly why I want to know this. Calculating % is the best way of distributing keys evenly across regions - me thinks. Everything else is just approximation.
>>Otherwise just have a member variable of your reducer class and increment it on each call to reduce(). Tried it, but didn't work. Basically, we should be able to find out in a reducer how many rows were created by all Mappers. I am a bit surprised MR framework doesn't provide this. >>I think you'll be better of finding a way to do it not using percentage if possible. Yes, if everything else fails, approximation is the way to go. >>Try calculating the size of the data instead perhaps. Correct, but as I said in previous email, I would rather not run another MR job just to calculate COUNT(*). There are over 8 Billion rows & sorting them is a bit expensive operation. >>we tallied up the KeyValue.getLenght() for each KeyValue in a row until the size reached a certain limit. The keyword in this line is "certain". How do you come up with that "certain" number, by approximation right? If everything else fails, that's what I will do. Thanks for your help. On Sun, May 13, 2012 at 9:35 AM, Bryan Beaudreault <[email protected] > wrote: > Why do you need to know this? Were you trying to do a percentage of rows > per region? Otherwise just have a member variable of your reducer class > and increment it on each call to reduce(). I think you'll be better of > finding a way to do it not using percentage if possible. Try calculating > the size of the data instead perhaps. You should have that available since > you are trying to bulkload anyway (which requires Put or KeyValue values, > both of which you can get the size from). > > On Sun, May 13, 2012 at 2:11 AM, Something Something < > [email protected]> wrote: > > > Is there no way to find out inside a single reducer how many records were > > created by all the Mappers? I tried several ways but nothing works. For > > example, I tried this: > > > > reporter.getCounter(Task.Counter.REDUCE_INPUT_RECORDS).getValue(); > > > > It's not working for me. Should this have worked? Am I just doing > > something dumb? I would rather not create another MR job just to count # > > of lines. > > > > > > On Sat, May 12, 2012 at 7:07 PM, Bryan Beaudreault < > > [email protected] > > > wrote: > > > > > I did a very similar approach and it worked fine for me. Just spot > check > > > the regions after to make sure they look lexicographically sorted. I > > used > > > ImmutableBytesWritable as my key, and the default hadoop sorting for > that > > > turned out to sort lexicographically as required. Our hbase rows > varied > > in > > > size, so instead of doing a count of the number of rows, we tallied up > > the > > > KeyValue.getLenght() for each KeyValue in a row until the size reached > a > > > certain limit. > > > > > > On Sat, May 12, 2012 at 7:21 PM, Something Something < > > > [email protected]> wrote: > > > > > > > Hello, > > > > > > > > This is really a MapReduce question, but the output from this will be > > > used > > > > to create regions for an HBase table. Here's what I want to do: > > > > > > > > Take an input file that contains data about users. > > > > Sort this file by a key (which consists of a few fields from the row) > > > > After every x # of rows write the key. > > > > > > > > > > > > Here's how I was going to structure my MapReduce: > > > > > > > > public Splitter { > > > > > > > > static int counter; > > > > > > > > private Mapper { > > > > map() { > > > > Build key by concatenating fields > > > > Write key > > > > increment counter; > > > > } > > > > } > > > > > > > > // # of reducers will be set to 1. My understanding is that this > > > will > > > > send the lines to reducer in sorted order one at a time - is this a > > > correct > > > > assumption? > > > > private Reducer { > > > > static long i; > > > > reduce() { > > > > static long splitSize = counter / 300; // 300 is region > > > size > > > > if (i == 0 || i == splitSize) { > > > > Write key; // this will be used as a 'startkey'. > > > > i = 0; > > > > } > > > > i++; > > > > } > > > > } > > > > } > > > > > > > > To summarize, there are 2 questions: > > > > > > > > 1) I am passing # of rows processed by Mapper to Reducer via a > static > > > > counter. Would this work? Is there a better way? > > > > 2) If I set # of reducers to 1, would the lines be sent to reducer > in > > > > sorted order one at a time? > > > > > > > > Thanks in advance for the help. > > > > > > > > > >
