I see what you mean about block locality, that is at the regionserver level, transparent to the MR job. This doesn't happen only to the final mappers, some of the early mappers are rack local. The table is reasonably well distributed across the nodes but not perfectly (that is a question I have in a different thread).
Performance of the rack local mappers vs data local is roughly 2x slower, so the performance hit is significant.
On Aug 2, 2012, at 11:37 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote:
> On Wed, Aug 1, 2012 at 11:31 PM, Bryan Keller <[EMAIL PROTECTED]> wrote:
>> I have an 8 node cluster and a table that is pretty well balanced with on average 36 regions/node. When I run a mapreduce job on the cluster against this table, the data locality of the mappers is poor, e.g 100 rack local mappers and only 188 data local mappers. I would expect nearly all of the mappers to be data local. DNS appears to be fine, i.e. the hostname in the splits is the same as the hostnames in the task attempts.
> Thanks for looking at this already, it's the first thing that came in
> mind when looking at the title.
>> The table isn't new and from what I understand, HDFS replication will eventually keep region data blocks local to the regionserver. Are there other reasons for data locality to be poor and any way to fix it?
> Block locality doesn't play a role here, TableInputFormat publishes
> where the region is but then where the data belonging to that region
> is is another matter that's not taken into account. In any case, even
> if the region was on node A and all the data happened to be on node B,
> C, and D, your mapper would still only talk to A since that's where
> the region is.
> So only 1 region server serves a region meaning that there's only 1
> node where you can send the map to in order to have data locality. I'm
> ready to guess that the the maps that aren't local are launched
> towards the end of the job, because you might have slower maps and/or
> not a perfect balance of regions per region server.
> For example, let's say one node is full of data-local maps but it
> still has 2-3 more to process while other nodes have availability. The
> JT has a locality timeout for each map so that if one node is just too
> busy it will fall back to rack-local nodes instead. In this example
> those 2-3 maps might get sent elsewhere.
> There are ways to tune this depending on which scheduler you are
> using, but it will mostly involve waiting more for each task to make
> sure they can get to the right node.
> At your scale it sounds more to me like over-optimizations. How big of
> a hit are you taking?