|
|
-
cannot use a map side join to merge the output of multiple map side joins
Jim Donofrio 2012-05-05, 15:50
I am trying to use a map side join to merge the output of multiple map side joins. This is failing because of the below code in JobClient.writeOldSplits which reorders the splits from largest to smallest. Why is that done, is that so that the largest split which will take the longest gets processed first?
Each map side join then fails to name its part-* files with the same number as the incoming partition so files that named part-00000 that go into the first map side join get outputted to part-00010 while another one of the first level map side joins sends files named part-00000 to part-00005. The second level map side join then does not get the input splits in partitioner order from each first level map side join output directory.
I can think of only 2 fixes, add some conf property to allow turning off the below sorting OR extend FileOutputCommitter to rename the outputs of the first level map side join to merge_part-the orginal partition number. Any other solutions?
// sort the splits into order based on size, so that the biggest // go first Arrays.sort(splits, new Comparator<org.apache.hadoop.mapred.InputSplit>() { public int compare(org.apache.hadoop.mapred.InputSplit a, org.apache.hadoop.mapred.InputSplit b) { try { long left = a.getLength(); long right = b.getLength(); if (left == right) { return 0; } else if (left < right) { return 1; } else { return -1; }
-
Re: cannot use a map side join to merge the output of multiple map side joins
JunYong Li 2012-05-07, 06:54
could you complain the problem more clear?
2012/5/5 Jim Donofrio <[EMAIL PROTECTED]>
> I am trying to use a map side join to merge the output of multiple map > side joins. This is failing because of the below code in > JobClient.writeOldSplits which reorders the splits from largest to > smallest. Why is that done, is that so that the largest split which will > take the longest gets processed first? > > Each map side join then fails to name its part-* files with the same > number as the incoming partition so files that named part-00000 that go > into the first map side join get outputted to part-00010 while another one > of the first level map side joins sends files named part-00000 to > part-00005. The second level map side join then does not get the input > splits in partitioner order from each first level map side join output > directory. > > I can think of only 2 fixes, add some conf property to allow turning off > the below sorting OR extend FileOutputCommitter to rename the outputs of > the first level map side join to merge_part-the orginal partition number. > Any other solutions? > > // sort the splits into order based on size, so that the biggest > // go first > Arrays.sort(splits, new Comparator<org.apache.hadoop.**mapred.InputSplit>() > { > public int compare(org.apache.hadoop.**mapred.InputSplit a, > org.apache.hadoop.mapred.**InputSplit b) { > try { > long left = a.getLength(); > long right = b.getLength(); > if (left == right) { > return 0; > } else if (left < right) { > return 1; > } else { > return -1; > } >
-- Regards Junyong
-
Re: cannot use a map side join to merge the output of multiple map side joins
Robert Evans 2012-05-07, 15:19
I believe that you are correct about the split processing. It orders the splits by size so that the largest splits are processed first. This allows for the smaller splits to potentially fill in the gaps. As far as a fix is concerned I think overriding the file name in the file output committer is a much more straight forward solution to the issue.
--Bobby Evans
On 5/5/12 10:50 AM, "Jim Donofrio" <[EMAIL PROTECTED]> wrote:
I am trying to use a map side join to merge the output of multiple map side joins. This is failing because of the below code in JobClient.writeOldSplits which reorders the splits from largest to smallest. Why is that done, is that so that the largest split which will take the longest gets processed first?
Each map side join then fails to name its part-* files with the same number as the incoming partition so files that named part-00000 that go into the first map side join get outputted to part-00010 while another one of the first level map side joins sends files named part-00000 to part-00005. The second level map side join then does not get the input splits in partitioner order from each first level map side join output directory.
I can think of only 2 fixes, add some conf property to allow turning off the below sorting OR extend FileOutputCommitter to rename the outputs of the first level map side join to merge_part-the orginal partition number. Any other solutions?
// sort the splits into order based on size, so that the biggest // go first Arrays.sort(splits, new Comparator<org.apache.hadoop.mapred.InputSplit>() { public int compare(org.apache.hadoop.mapred.InputSplit a, org.apache.hadoop.mapred.InputSplit b) { try { long left = a.getLength(); long right = b.getLength(); if (left == right) { return 0; } else if (left < right) { return 1; } else { return -1; }
-
Re: cannot use a map side join to merge the output of multiple map side joins
Jim Donofrio 2012-05-08, 01:00
I ended up just using a MultiNamedMultipleOutput with the dynamic part of the multioutput set to the partition number from one of the filesplit's inside of the CompositeInputSplit
On 05/07/2012 11:19 AM, Robert Evans wrote: > I believe that you are correct about the split processing. It orders the splits by size so that the largest splits are processed first. This allows for the smaller splits to potentially fill in the gaps. As far as a fix is concerned I think overriding the file name in the file output committer is a much more straight forward solution to the issue. > > --Bobby Evans > > On 5/5/12 10:50 AM, "Jim Donofrio"<[EMAIL PROTECTED]> wrote: > > I am trying to use a map side join to merge the output of multiple map > side joins. This is failing because of the below code in > JobClient.writeOldSplits which reorders the splits from largest to > smallest. Why is that done, is that so that the largest split which will > take the longest gets processed first? > > Each map side join then fails to name its part-* files with the same > number as the incoming partition so files that named part-00000 that go > into the first map side join get outputted to part-00010 while another > one of the first level map side joins sends files named part-00000 to > part-00005. The second level map side join then does not get the input > splits in partitioner order from each first level map side join output > directory. > > I can think of only 2 fixes, add some conf property to allow turning off > the below sorting OR extend FileOutputCommitter to rename the outputs of > the first level map side join to merge_part-the orginal partition > number. Any other solutions? > > // sort the splits into order based on size, so that the biggest > // go first > Arrays.sort(splits, new > Comparator<org.apache.hadoop.mapred.InputSplit>() { > public int compare(org.apache.hadoop.mapred.InputSplit a, > org.apache.hadoop.mapred.InputSplit b) { > try { > long left = a.getLength(); > long right = b.getLength(); > if (left == right) { > return 0; > } else if (left< right) { > return 1; > } else { > return -1; > } > >
|
|