Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Poor data locality of MR job


Copy link to this message
-
Poor data locality of MR job
I have an 8 node cluster and a table that is pretty well balanced with on average 36 regions/node. When I run a mapreduce job on the cluster against this table, the data locality of the mappers is poor, e.g 100 rack local mappers and only 188 data local mappers. I would expect nearly all of the mappers to be data local. DNS appears to be fine, i.e. the hostname in the splits is the same as the hostnames in the task attempts.

The performance of the rack local mappers is poor and causes overall scan performance to suffer.

The table isn't new and from what I understand, HDFS replication will eventually keep region data blocks local to the regionserver. Are there other reasons for data locality to be poor and any way to fix it?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB