|
|
Pedro Costa 2011-01-14, 11:09
Hi,
If a split location contains more that one location, it means that this split file is replicated through all locations, or it means that a split is divided into several blocks, and each block is in one location?
Thanks, -- Pedro
Pedro Costa 2011-01-14, 11:40
I think that the answer is, each location of the split file corresponds to a replica.
On Fri, Jan 14, 2011 at 11:09 AM, Pedro Costa <[EMAIL PROTECTED]> wrote: > Hi, > > If a split location contains more that one location, it means that > this split file is replicated through all locations, or it means that > a split is divided into several blocks, and each block is in one > location? > > Thanks, > -- > Pedro >
-- Pedro
Harsh J 2011-01-14, 13:10
Yes, this is correct. But also, a logical MapReduce InputSplit is very different from a physical HDFS Block.
On Fri, Jan 14, 2011 at 5:10 PM, Pedro Costa <[EMAIL PROTECTED]> wrote: > I think that the answer is, each location of the split file > corresponds to a replica. > > On Fri, Jan 14, 2011 at 11:09 AM, Pedro Costa <[EMAIL PROTECTED]> wrote: >> Hi, >> >> If a split location contains more that one location, it means that >> this split file is replicated through all locations, or it means that >> a split is divided into several blocks, and each block is in one >> location? >> >> Thanks, >> -- >> Pedro >> > > > > -- > Pedro >
-- Harsh J www.harshj.com
Pedro Costa 2011-01-14, 13:23
What do you mean by that?
For example, if the location of a input split is at /DataCenter1/Rack1/Node1, this means that this is the location of the namenode, and not the physical location of the data blocks?
On Fri, Jan 14, 2011 at 1:10 PM, Harsh J <[EMAIL PROTECTED]> wrote: > Yes, this is correct. But also, a logical MapReduce InputSplit is very > different from a physical HDFS Block. > > On Fri, Jan 14, 2011 at 5:10 PM, Pedro Costa <[EMAIL PROTECTED]> wrote: >> I think that the answer is, each location of the split file >> corresponds to a replica. >> >> On Fri, Jan 14, 2011 at 11:09 AM, Pedro Costa <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> If a split location contains more that one location, it means that >>> this split file is replicated through all locations, or it means that >>> a split is divided into several blocks, and each block is in one >>> location? >>> >>> Thanks, >>> -- >>> Pedro >>> >> >> >> >> -- >> Pedro >> > > > > -- > Harsh J > www.harshj.com >
-- Pedro
Harsh J 2011-01-14, 13:40
An InputSplit is the definition of a Mapper's input and has similar characteristics as a HDFS Block (Offset, Length, Locations). But, an InputSplit is computed by an InputFormat class to suit an input's requirement (such as newline boundaries in Text files, which isn't taken care of while splitting the incoming data into blocks by the HDFS) and can thus span across multiple blocks or be less than one (For example, via minimum split size configurations).
On Fri, Jan 14, 2011 at 6:53 PM, Pedro Costa <[EMAIL PROTECTED]> wrote: > For example, if the location of a input split is at > /DataCenter1/Rack1/Node1, this means that this is the location of the > namenode, and not the physical location of the data blocks?
-- Harsh J www.harshj.com
Owen O'Malley 2011-01-14, 17:33
On Fri, Jan 14, 2011 at 3:09 AM, Pedro Costa <[EMAIL PROTECTED]> wrote:
> Hi, > > If a split location contains more that one location, it means that > this split file is replicated through all locations, or it means that > a split is divided into several blocks, and each block is in one > location? It requests that the map runs on one of those machines or on the same rack as one of those machines. Currently there is no way to weight if one machine in the list is "better" than another. If an input split covers multiple blocks, the InputFormat is best served by picking the top N machines that are close a copy of most of the data, where N is roughly 3 to 5.
-- Owen
Pedro Costa 2012-03-07, 13:57
Hi,
In MapReduce, if the locations of the split are in {HostA, HostB, HostC}, and the respective map tasks will run in HostB, the map tasks will pick up the split from HostB?
Who is responsible to make the map tasks get the spilt in HostB? Is the JobTracker or the MapTask?
-- Best regards,
|
|