could you complain the problem more clear?
2012/5/5 Jim Donofrio <[email protected]>
> I am trying to use a map side join to merge the output of multiple map
> side joins. This is failing because of the below code in
> JobClient.writeOldSplits which reorders the splits from largest to
> smallest. Why is that done, is that so that the largest split which will
> take the longest gets processed first?
>
> Each map side join then fails to name its part-* files with the same
> number as the incoming partition so files that named part-00000 that go
> into the first map side join get outputted to part-00010 while another one
> of the first level map side joins sends files named part-00000 to
> part-00005. The second level map side join then does not get the input
> splits in partitioner order from each first level map side join output
> directory.
>
> I can think of only 2 fixes, add some conf property to allow turning off
> the below sorting OR extend FileOutputCommitter to rename the outputs of
> the first level map side join to merge_part-the orginal partition number.
> Any other solutions?
>
> // sort the splits into order based on size, so that the biggest
> // go first
> Arrays.sort(splits, new Comparator<org.apache.hadoop.**mapred.InputSplit>()
> {
> public int compare(org.apache.hadoop.**mapred.InputSplit a,
> org.apache.hadoop.mapred.**InputSplit b) {
> try {
> long left = a.getLength();
> long right = b.getLength();
> if (left == right) {
> return 0;
> } else if (left < right) {
> return 1;
> } else {
> return -1;
> }
>
--
Regards
Junyong