|
|
hadoopman 2010-11-30, 03:59
We have two Hadoop clusters in two separate buildings. Both clusters are loading the same data from the same sources (the second cluster is for DR).
We're looking at how we can recover the primary cluster and catch it back up again as new data will continue to feed into the DR cluster. It's been suggested we use rsync across the network however my concern is the amount of data we would have to copy over would take several days (at a minimum) to sync them even with our dual bonded 1 gig network cards.
I'm curious if anyone has come up with a solution short of just loading the source logs into HDFS. Is there a way to even rsync two clusters and get them in sync? Been googling around. Haven't found anything of substances yet.
Thanks!
-
Re: HDFS Rsync process??
Steve Loughran 2010-11-30, 10:51
On 30/11/10 03:59, hadoopman wrote: > We have two Hadoop clusters in two separate buildings. Both clusters > are loading the same data from the same sources (the second cluster is > for DR). > > We're looking at how we can recover the primary cluster and catch it > back up again as new data will continue to feed into the DR cluster. > It's been suggested we use rsync across the network however my concern > is the amount of data we would have to copy over would take several days > (at a minimum) to sync them even with our dual bonded 1 gig network cards. > > I'm curious if anyone has come up with a solution short of just loading > the source logs into HDFS. Is there a way to even rsync two clusters and > get them in sync? Been googling around. Haven't found anything of > substances yet. you don't need all the files in the cluster in sync as a lot of them are intermediate and transient files.
Instead use dfscopy to copy source files to the two clusters, this runs across the machines in the cluster and is also designed to work across hadoop versions, with some limitations.
-
Re: HDFS Rsync process??
Alejandro Abdelnur 2010-11-30, 11:18
The other approach, if the DR cluster is idle or has enough excess capacity, would be running all the jobs on the input data in both clusters and perform checksums on the outputs to ensure everything is consistent. And you could take advantage and distribute ad hoc queries between the 2 clusters.
Alejandro
On Tue, Nov 30, 2010 at 6:51 PM, Steve Loughran <[EMAIL PROTECTED]> wrote:
> On 30/11/10 03:59, hadoopman wrote: > >> We have two Hadoop clusters in two separate buildings. Both clusters >> are loading the same data from the same sources (the second cluster is >> for DR). >> >> We're looking at how we can recover the primary cluster and catch it >> back up again as new data will continue to feed into the DR cluster. >> It's been suggested we use rsync across the network however my concern >> is the amount of data we would have to copy over would take several days >> (at a minimum) to sync them even with our dual bonded 1 gig network cards. >> >> I'm curious if anyone has come up with a solution short of just loading >> the source logs into HDFS. Is there a way to even rsync two clusters and >> get them in sync? Been googling around. Haven't found anything of >> substances yet. >> > > > you don't need all the files in the cluster in sync as a lot of them are > intermediate and transient files. > > Instead use dfscopy to copy source files to the two clusters, this runs > across the machines in the cluster and is also designed to work across > hadoop versions, with some limitations. > > >
-
Re: HDFS Rsync process??
hadoopman 2010-11-30, 18:46
On 11/30/2010 03:51 AM, Steve Loughran wrote: > On 30/11/10 03:59, hadoopman wrote: > > > you don't need all the files in the cluster in sync as a lot of them > are intermediate and transient files. > > Instead use dfscopy to copy source files to the two clusters, this > runs across the machines in the cluster and is also designed to work > across hadoop versions, with some limitations. > > >
Page 70 in the Oreilly Hadoop book talks about distcp to copy data across two hdfs clusters. I'm curious if something like that would also work? Would I just be able to call both namenode1 from both clusters when initiating the copy? Still playing with it. Figured I should ask :-)
Thanks
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext