Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # dev >> Rack and Datacenter Awareness


Copy link to this message
-
Re: Rack and Datacenter Awareness
there isn't any multi-DC support in hadoop core right now -block placement,
work scheduling is done on the assumption that there is a bandwidth cost
for working across that backbone, but it is still gigabit rate all the way
up its tree. Even if your cross site bandwidth is that high, other
assumptions in the Hadoop code surface

   - latency of cross-rack communications is low and not significantly
   different from intra-rack comms
   - protocols between machines can assume that packet failures are the
   kind that surface on a LAN, not a WAN (lower failure rate, less
   buffered/bursty traffic)
   - external infrastructure (like DNS, NTP) are consistent everywhere
   - Network partitions are rare and significant enough to react to by
   re-replicating data -an expensive operation and overkill for a transient
   WAN outage.

Zookeeper will be extra-brittle here, as group membership protocols tend to
have aggressive timeouts to detect loss of members fast. I'd be really
surprised if it worked well across >1 site.

Wandisco do provide multi-DC hadoop support for Hadoop platforms:
http://www.wandisco.com/hadoop/non-stop-hadoop - I don't know what that
does about applications running on the cluster, or depend on ZK.

I suspect that anyone working across sites is going to have to run ZK &
Accumulo on each, and treat operations that span sites as separate queries
where the results need to be merged in -something that would need to go
into the query engines, though for now you could build a workflow that did
an MR on each site, then an final reduce on one of the sites

-Steve

On 22 January 2014 19:01, Jeff N <[EMAIL PROTECTED]> wrote:

> @Adam
> I am currently interested with the latter half of your second question. My
> main interest lies in determining how to optimize data processing. If I
> have
> two data centers that are geographically far apart and I am working on a
> local machines but I need data from the second data center, how do I have
> the processing occur on the second data center? The constraints to this
> problem include a lack of empirical knowledge of the HDFS node that the
> data
> contains, but is within the network topology I currently reside in.
> Furthermore, it pertains to Map/Reduce jobs that utilize the
> AccumuloInputFormat. Is it possible to have the distant data center process
> my Mapper and send me the resulting data set instead of processing the
> Mapper locally and making numerous network queries?
>
>
>
> -----
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Rack-and-Datacenter-Awareness-tp7193p7225.html
> Sent from the Developers mailing list archive at Nabble.com.
>

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB