Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Data locality in HBase


Copy link to this message
-
Re: Data locality in HBase
Lars George 2012-06-21, 10:07
Hi Ben,

According to your fsck dump, the first copy is located on hadoop-143, which has all the blocks for the region. So if you check, I would assume that the region is currently open and served by hadoop-143, right?

The TableInputFormat getSplit() will report that server to the MapReduce framework, so the task would run on that node to access the data locally. *Only* if you have speculative execution turned on you will have a task that is run on another random node in parallel, which would need to do heaps of remote reads. That is why it is recommended to turn that of for HBase and MapReduce in combination.

Lars

On Jun 21, 2012, at 6:57 AM, Ben Kim wrote:

> Hi Lars,
> I appreciate a lot for your reply.
>
> As you told, a regionserver processes hfiles so that all data blocks are
> located in the same physical machine unless the regionserver failes.
> I ran following hadoop command to see location of a HFile
>
> *hadoop fsck
> /hbase/testtable/9488ef7fbd23b62b9bf85b722c015e90/testcf/08dc1940944b4952b23f0cbee51bcea8
> -files -locations -blocks*
>
> here is the output...
>
> FSCK started by hadoop from /203.235.211.142 for path
>> /hbase/testtable/9488ef7fbd23b62b9bf85b722c015e90/testcf/08dc1940944b4952b23f0cbee51bcea8
>> at Thu Jun 21 13:40:11 KST 2012
>> /hbase/testtable/9488ef7fbd23b62b9bf85b722c015e90/testcf/08dc1940944b4952b23f0cbee51bcea8
>> 727156659 bytes, 11 block(s):  OK
>> 0. blk_1832396139416350298_1296638 len=67108864 repl=3 [hadoop-145:50010,
>> hadoop-143:50010, hadoop-144:50010]
>> 1. blk_8910330590545256327_1296640 len=67108864 repl=3 [hadoop-143:50010,
>> hadoop-157:50010, hadoop-159:50010]
>> 2. blk_-3868612696419011016_1296640 len=67108864 repl=3 [hadoop-145:50010,
>> hadoop-143:50010, hadoop-156:50010]
>> 3. blk_-7551946394410945015_1296640 len=67108864 repl=3 [hadoop-145:50010,
>> hadoop-143:50010, hadoop-157:50010]
>> 4. blk_-1875839158119319613_1296640 len=67108864 repl=3 [hadoop-145:50010,
>> hadoop-143:50010, hadoop-146:50010]
>> 5. blk_-6953623390282045248_1296640 len=67108864 repl=3 [hadoop-143:50010,
>> hadoop-157:50010, hadoop-144:50010]
>> 6. blk_-3016727256928339770_1296640 len=67108864 repl=3 [hadoop-143:50010,
>> hadoop-146:50010, hadoop-159:50010]
>> 7. blk_3526351456802007773_1296640 len=67108864 repl=3 [hadoop-143:50010,
>> hadoop-160:50010, hadoop-156:50010]
>> 8. blk_5134681308608742320_1296640 len=67108864 repl=3 [hadoop-145:50010,
>> hadoop-143:50010, hadoop-144:50010]
>> 9. blk_-6875541109589395450_1296640 len=67108864 repl=3 [hadoop-145:50010,
>> hadoop-143:50010, hadoop-156:50010]
>> 10. blk_-553661064097182668_1296640 len=56068019 repl=3 [hadoop-143:50010,
>> hadoop-146:50010, hadoop-159:50010]
>>
>> Status: HEALTHY
>> Total size:    727156659 B
>> Total dirs:    0
>> Total files:   1
>> Total blocks (validated):      11 (avg. block size 66105150 B)
>> Minimally replicated blocks:   11 (100.0 %)
>> Over-replicated blocks:        0 (0.0 %)
>> Under-replicated blocks:       0 (0.0 %)
>> Mis-replicated blocks:         0 (0.0 %)
>> Default replication factor:    3
>> Average block replication:     3.0
>> Corrupt blocks:                0
>> Missing replicas:              0 (0.0 %)
>> Number of data-nodes:          9
>> Number of racks:               1
>> FSCK ended at Thu Jun 21 13:40:11 KST 2012 in 4 milliseconds
>>
>
> As you see, data blocks of the HFile are stored across two different
> datanodes (hadoop-145 and hadoop-143).
>
> Let say a map task runs on hadoop-145 and needs to access the block 7. Then
> the map task needs to remotely access the block 7 on hadoop-143 server.
> Almost half of the data blocks are stored & accessed remotely. Referring
> from the above example, It's hard to say that the data locality is being
> applied to HBase.
>
> Ben
>
>
> On Fri, Jun 15, 2012 at 5:21 PM, Lars George <[EMAIL PROTECTED]> wrote:
>
>> Hi Ben,
>>
>> See inline...
>>
>> On Jun 15, 2012, at 6:56 AM, Ben Kim wrote:
>>
>>> Hi,
>>>
>>> I've been posting questions in the mailing-list quiet often lately, and