You can use CombineFileInputFormat to achieve this with good locality
hints. It tries to pack together block splits that belong to one node
as much as possible.
On Tue, Mar 19, 2013 at 7:43 PM, jin keyao <[EMAIL PROTECTED]> wrote:
> hi Jens,
> thanks for your quick reply.
> actually I am afraid I didn't describe my question very well. Usually for
> example, we make 1 Block as an input, such as we are benchmarking Terasort,
> which is good for scheduling the map on the node contain the data.
> But in a typical scenario I am working on, I want to make the sequential
> blocks as an input, if I take the original example, what I want to do is to
> group 100 blocks into 10 inputs, each input contain 10 sequential blocks to
> be handled by map tasks
> each 10 blocks may distribute on several nodes, based on the block placement
> policy in Hadoop. so in this case, will the map task be scheduled on the
> node which have a nearest way to access those 10 blocks?
> or will hadoop only make sure the first block it's going to access is a
> local one?
> 2013/3/19 Jens Scheidtmann <[EMAIL PROTECTED]>
>> Dear Jin,
>> you wrote:
>> > my question is : will the map task created on the node which access his
>> > 10 blocks most fastest ?
>> hadoop tries hard to run the map tasks on the node, where the data is
>> stored. "Hadoop: The Definitive Guide" has some UML Sequence diagrams on
>> what happens for creation of map jvms. Sorry, I was not able to relocate
>> them on the web, yet (well, safaribooksonline.com ;-).
>> Depending on the specific data layout (e.g. record lengths), the map tasks
>> may need to read other blocks anyway, which may be off-node.
>> On how many nodes is your 100 blocks file stored? on 10?
>> If it is on one node, then you're likely running into map slot limits or
>> container limits.
>> Best regards,