I have record my trouble shooting process for my random lag region in MR scan issue. share it here, In case you meet similar problem need to diagnose.
Full text with image here: http://blog.csdn.net/colorant/article/details/8510254
Only text, as below:
=== Problem observation ==
When scanning some specific table, there are always some lag behind slow map tasks, (usually cost 150%~200% of the average task run time) And the top 10 slowest tasks usually locate on the same Region Server, And if run the same scan job multiple times, the slowest tasks and their location do not change.
If only judge by the above behavior, you can suspect that the lag behind Region Server must have some problem which slow down the whole system. But the truth is : If you run scan job on different table, the lag behind Region Server is not the same one, say, e.g. with table 1, region server A have a lot of lags behind tasks, while for table 2, it might be region server B which lags behind.
Last but not least, All these tables works fine a few days ago, Seems the problem occurs (or is observed) after a few times of cluster restart.
=== Environment ==
1 master node + 4 data/region node, each with 4 disk. 48G RAM, 16 CPU core
Hadoop 1.1.1, HBase 0.94.1 24/20 Map/Reduce slots on each node.
Each table is around 50GB, 64~96 Regions distributed evenly across 4 Region Servers. The data is generated, and each region have exactly the same number of keyvalues and almost exactly the same size. All table have Major Compact done.
Using Map Reduce job to do whole table scan. Each Region is assigned to a Local Map Task, the map task just scan the local region, and count rows. Since map slot number is equal or large than the region number, the tasks can be assigned within one batch.
=== Trouble shooting ==
My trouble shooting procedure is recorded as below ( with some path finding and misleading works, but also a few knowledge gained as byproduct, well for me a newbie)
== Any bottleneck? =
First of all, supervise the lag Region Server to check out is there any bottleneck when performing the scan job. It appears to be nothing abnormal. The CPU/DISK IO is ok, not reached peak, except that the overall disk IO throughput is a little bit lower than the other Region Servers.
== Data locality? ==
If the region's data is actually not reside on the local data node, then it will also lead to hot spot region, since it will need to read data from other nodes.
To make sure that all data is actually read from local data node, I do a second Major compact on the table to eliminate the possibility that the region get relocated and balanced since last major compact. Then inspect on the network IO when doing MapReduce scan jobs.
Knowledge : A simple fast way to inspect network IO together with other system resource is using "dstat" e.g. dstat -cdnm can supervise CPU / Disk IO / network IO / Memory, Cache, Buffer all together.
The observation show to me that there are no data locality issue, all data is read from local data node, and no notable network IO. The lag behind issue still exist after another Major Compact. But there are some changes been observed. After each Major Compact, the top 10 slow region seems to change randomly with weak relationship (say probably still on the same region server before/after major compact)
Thus, this issue is not related to data locality.
== Cluster configuration ==
Since this problem is random across tables. So I also wondering that is there any configuration I have made for the past days which impact the cluster's stability? e.g. All memory related setting? Some parameters fine tune on map reduce framework?
? First of all I look into the GC behavior, since GC do bring a lot of randomness. And a lot of settings might influence GC behavior. Say Hadoop/Hbase HeapSize, GC strategy, Eden area size, HBase block Cache Enable/Disable etc.
After tuning and comparing different settings on these parameters ( including restore them to the setting that I know is working before this problem occurs), the lag behind issue still exist. Though some settings do behavior better in the sense of GC time, but don't solve the lag region issue.
Knowledge：Disable Hbase block cache will reduce GC time a lot for whole table scan like jobs , for my 50G data, it saves about 10s GC time - observed by jvisualvm GC plugin. And by default, TableInputFormat do disable block cache (obviously, since all the data is accessed only once, they don't need to be cached) , while if you are writing custom InputFormat, you need to disable it by yourself.
? Then I try to tune some parameters which related HDFS/MapReduce/Hbase's concurrent capability, e.g. Data Node Xceiver/Handler number, RegionServer Handler number, map slot number, client scan cache size etc. Though these settings are sync across each node, so it should not bring random issues. But after all, I did change these settings for the past days, so to make sure, I double check and compare different settings of these parameters. And not surprisingly, they don't help on this issue.
Knowledge: when there are no obvious bottleneck, Fine tune of these concurrent related parameters don't have significant impact on overall cluster performance.
Thus, I believe cluster configuration is out of candidates.
== Profile on slow region server ==
Since for a specific table, the lag region server is fixed and can be reproduced across job. So I use JVisualVM to sample Lag Region Server's lagging Map Task and also sample the region server process. From the sample result, the Map Task don't show noticeable different with other tasks. ( That is also reasonable, since for scan, most jobs is done by Region Server, client just count the result.) While at the same time, Region Server's Hotspot method's distribution and percentage also don’t show noticeable different with other Region Servers ( well, a few more idle time).