Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Shared HDFS for HBase and MapReduce


Copy link to this message
-
Re: Shared HDFS for HBase and MapReduce
When you run a MR job with HBase as a source/sink, you use the HBase API
under the hood (get, put, scan). That API is how your client (in this case
the map or reduce tasks) interact with the region servers. Data locality in
a MR job is achieved by having the tasks run on the same physical nodes as
the region servers so that communication over the network is minimal.

The data locality for the region servers is a different conversation. That
is about the region server process talking to the local datanode for its
underlying HFiles rather than talking to remote ones. That has nothing to
do with the MR jobs talking to HBase.

On Wed, Jun 6, 2012 at 1:27 PM, Atif Khan <[EMAIL PROTECTED]>wrote:

> Thanks Amandeep!
>
> I think what I was saying that we are trying to support both types of
> workloads.  That is realtime transactional workloads, and batch processing
> for data analysis.  The big question being if a single HDFS cluster should
> be shared between the two workflows.
>
> The point that you are trying to make (if I am understanding you correctly)
> is of data "Locality".
>
> /Amandeep Khurana - "Having a common HDFS cluster and using part of the
> nodes as HBase RS and part as the Hadoop TTs doesn't solve the problem of
> moving data from the HBase RS to the tasks you'll run as a part of your MR
> jobs if HBase is your source/sink. You will still be reading/writing over
> the network."
> /
>
> When running MR jobs over HBase, data locality is provided by HBase (please
> see http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html,
> and
> also HBase the Definitive Guide by Lars George page 298 MapReduce
> Locality).
> In other words, the computation will be exported to where the data is,
> therefore limiting the need to transfer data over the network.  Proper data
> locality has a big impact on the overall performance.
>
> So I believe that a common HDFS cluster does not imply logical segregation
> between HBase RS and Hadoop TTs.  Therefore, your point seems in
> contradiction with Lars George's statement.
>
> Thoughts?
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-tp4018856p4018884.html
> Sent from the HBase - Developer mailing list archive at Nabble.com.
>