We have two Hadoop clusters in two separate buildings. Both clusters
are loading the same data from the same sources (the second cluster is
We're looking at how we can recover the primary cluster and catch it
back up again as new data will continue to feed into the DR cluster.
It's been suggested we use rsync across the network however my concern
is the amount of data we would have to copy over would take several days
(at a minimum) to sync them even with our dual bonded 1 gig network cards.
I'm curious if anyone has come up with a solution short of just loading
the source logs into HDFS. Is there a way to even rsync two clusters
and get them in sync? Been googling around. Haven't found anything of