Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Data locality in HBase


+
Ben Kim 2012-06-15, 04:56
Copy link to this message
-
Re: Data locality in HBase
Lars George 2012-06-15, 08:21
Hi Ben,

See inline...

On Jun 15, 2012, at 6:56 AM, Ben Kim wrote:

> Hi,
>
> I've been posting questions in the mailing-list quiet often lately, and
> here goes another one about data locality
> I read the excellent blog post about data locality that Lars George wrote
> at http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
>
> I understand data locality in hbase as locating a region in a region-server
> where most of its data blocks reside.

The opposite is happening, i.e. the region server process triggers for all data it writes to be located on the same physical machine.

> So that way fast data access is guranteed when running a MR because each
> map/reduce task is run for each region in the tasktracker where the region
> co-locates.

Correct.

> But what if the data blocks of the region are evenly spread over multiple
> region-servers?

This will not happen, unless the original server fails. Then the region is moved to another that now needs to do a lot of remote reads over the network. This is way there is work being done to allow for custom placement policies in HDFS. That way you can store the entire region and all copies as complete units on three data nodes. In case of a failure you can then move the region to one of the two copies. This is not available yet though, but it is being worked on (so I heard).

> Does a MR task has to remotely access the data blocks from other
> regionservers?

For the above failure case, it would be the region server accessing the remote data, yes.

> How good is hbase locating datablocks where a region resides?

That is again the wrong way around. HBase has no clue as to where blocks reside, nor does it know that the file system in fact uses separate blocks. HBase stores files, HDFS does the block magic underneath the hood, and transparent to HBase.

> Also is it correct to say that if i set smaller data block size data
> locality gets worse, and if data block size gets bigger  data locality gets
> better.

This is not applicable here, I am assuming this stems from the above confusion about which system is handling the blocks, HBase or HDFS. See above.

HTH,
Lars

>
> Best regards,
> --
>
> *Benjamin Kim*
> *benkimkimben at gmail*
+
Ted Yu 2012-06-21, 05:19
+
Ben Kim 2012-06-21, 04:57
+
Lars George 2012-06-21, 10:07
+
Michael Segel 2012-06-21, 12:45
+
Ben Kim 2012-06-27, 06:38