Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Problem when using MultipleOutputs with many files


Copy link to this message
-
Re: Problem when using MultipleOutputs with many files
David Rosenstrauch 2011-09-02, 15:09
On 09/02/2011 09:14 AM, Panagiotis Antonopoulos wrote:
>
> Hello guys,
>
> I am using hadoop-0.20.2-cdh3u0 and I use MultipleOutputs to divide the HFiles (which are the output of my MR job) so that each file can fit into one region of the table where I am going to bulk load them.
>
> Therefore I have one MultipleOutput per region and as a result I had 280 different outputs.
> I just realized that using so many outputs makes my job a lot slower than it is when I have just one output.
>
> Do you know what goes wrong? Has anyone noticed the same?
>
> Thank you!
> Panagiotis
You're probably running into this bug, which crushes the performance of
MultipleOutputs:

https://issues.apache.org/jira/browse/MAPREDUCE-1853

Apparently it's fixed in v0.21, so try to upgrade if you can.

I wasn't able to in our code however (we were also using Cloudera CDH,
which as you see is 0.20).  What I eventually wound up doing to work
around it was to use our own local copy of the MultipleOutputs class (I
called it BugFixMultipleOutputs_0_20) which I manually patched with the fix.

HTH,

DR