Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # dev - InputFormat to regroup splits of underlying InputFormat to control number of map tasks


Copy link to this message
-
Re: InputFormat to regroup splits of underlying InputFormat to control number of map tasks
Robert Evans 2013-06-19, 15:01
This sounds similar to MultiFileInputFormat

http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/h
adoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apach
e/hadoop/mapred/MultiFileInputFormat.java?revision=1239482&view=markup

It would be nice if you could take a look at it and see if there is
something we can do here to improve it/combine the two.

--Bobby

On 6/19/13 2:53 AM, "Nicolae Marasoiu" <[EMAIL PROTECTED]> wrote:

>Hi,
>
>When running map-reduce with many splits it would be nice from a
>performance perspective to have fewer splits while maintaining data
>locality, so that the overhead of running a map task (jvm creation, map
>executor ramp-up e.g. spring context, etc) be less impactful when
>frequently running map-reduces with low data & processing.
>
>I created such an AggregatingInputFormat that simply groups input splits
>into composite ones with same location and creates a record reader that
>iterates over the record reader created by underlying inputFormat for the
>underlying raw splits.
>
>Currently we intend to use it for hbase sharding but I would like to also
>implement an optimal algorithm to ensure both fair distribution and
>locality, which I can describe if you find it useful to apply in
>multi-locations such as replicated kafka or hdfs.
>
>Thanks,
>waiting for your feedback,
>Nicu Marasoiu
>Adobe