Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Problem when using MultipleOutputs with many files


Copy link to this message
-
Re: Problem when using MultipleOutputs with many files
On 09/02/2011 09:14 AM, Panagiotis Antonopoulos wrote:
>
> Hello guys,
>
> I am using hadoop-0.20.2-cdh3u0 and I use MultipleOutputs to divide the HFiles (which are the output of my MR job) so that each file can fit into one region of the table where I am going to bulk load them.
>
> Therefore I have one MultipleOutput per region and as a result I had 280 different outputs.
> I just realized that using so many outputs makes my job a lot slower than it is when I have just one output.
>
> Do you know what goes wrong? Has anyone noticed the same?
>
> Thank you!
> Panagiotis
You're probably running into this bug, which crushes the performance of
MultipleOutputs:

https://issues.apache.org/jira/browse/MAPREDUCE-1853

Apparently it's fixed in v0.21, so try to upgrade if you can.

I wasn't able to in our code however (we were also using Cloudera CDH,
which as you see is 0.20).  What I eventually wound up doing to work
around it was to use our own local copy of the MultipleOutputs class (I
called it BugFixMultipleOutputs_0_20) which I manually patched with the fix.

HTH,

DR
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB