Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Poor data locality of MR job

Copy link to this message
Poor data locality of MR job
I have an 8 node cluster and a table that is pretty well balanced with on average 36 regions/node. When I run a mapreduce job on the cluster against this table, the data locality of the mappers is poor, e.g 100 rack local mappers and only 188 data local mappers. I would expect nearly all of the mappers to be data local. DNS appears to be fine, i.e. the hostname in the splits is the same as the hostnames in the task attempts.

The performance of the rack local mappers is poor and causes overall scan performance to suffer.

The table isn't new and from what I understand, HDFS replication will eventually keep region data blocks local to the regionserver. Are there other reasons for data locality to be poor and any way to fix it?