-Re: InputFormat to regroup splits of underlying InputFormat to control number of map tasks
This sounds similar to MultiFileInputFormat
It would be nice if you could take a look at it and see if there is
something we can do here to improve it/combine the two.
On 6/19/13 2:53 AM, "Nicolae Marasoiu" <[EMAIL PROTECTED]> wrote:
>When running map-reduce with many splits it would be nice from a
>performance perspective to have fewer splits while maintaining data
>locality, so that the overhead of running a map task (jvm creation, map
>executor ramp-up e.g. spring context, etc) be less impactful when
>frequently running map-reduces with low data & processing.
>I created such an AggregatingInputFormat that simply groups input splits
>into composite ones with same location and creates a record reader that
>iterates over the record reader created by underlying inputFormat for the
>underlying raw splits.
>Currently we intend to use it for hbase sharding but I would like to also
>implement an optimal algorithm to ensure both fair distribution and
>locality, which I can describe if you find it useful to apply in
>multi-locations such as replicated kafka or hdfs.
>waiting for your feedback,