Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # dev >> InputFormat to regroup splits of underlying InputFormat to control number of map tasks


Copy link to this message
-
Re: InputFormat to regroup splits of underlying InputFormat to control number of map tasks
This sounds similar to MultiFileInputFormat

http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/h
adoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apach
e/hadoop/mapred/MultiFileInputFormat.java?revision=1239482&view=markup

It would be nice if you could take a look at it and see if there is
something we can do here to improve it/combine the two.

--Bobby

On 6/19/13 2:53 AM, "Nicolae Marasoiu" <[EMAIL PROTECTED]> wrote:

>Hi,
>
>When running map-reduce with many splits it would be nice from a
>performance perspective to have fewer splits while maintaining data
>locality, so that the overhead of running a map task (jvm creation, map
>executor ramp-up e.g. spring context, etc) be less impactful when
>frequently running map-reduces with low data & processing.
>
>I created such an AggregatingInputFormat that simply groups input splits
>into composite ones with same location and creates a record reader that
>iterates over the record reader created by underlying inputFormat for the
>underlying raw splits.
>
>Currently we intend to use it for hbase sharding but I would like to also
>implement an optimal algorithm to ensure both fair distribution and
>locality, which I can describe if you find it useful to apply in
>multi-locations such as replicated kafka or hdfs.
>
>Thanks,
>waiting for your feedback,
>Nicu Marasoiu
>Adobe
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB