The OP hasn't provided enough information to even start trying to make a real recommendation on how to solve this problem.
On Aug 4, 2012, at 7:32 AM, Nitin Kesarwani <[EMAIL PROTECTED]> wrote:
> Given the size of data, there can be several approaches here:
> 1. Moving the boxes
> Not possible, as I suppose the data must be needed for 24x7 analytics.
> 2. Mirroring the data.
> This is a good solution. However, if you have data being written/removed
> continuously (if a part of live system), there are chances of losing some
> of the data during mirroring happens, unless
> a) You block writes/updates during that time (if you do so, that would be
> as good as unplugging and moving the machine around), or,
> b) Keep a track of what was modified since you started the mirroring
> I would recommend you to go with 2b) because it minimizes downtime. Here is
> how I think you can do it, by using some of the tools provided by Hadoop
> a) You can use some fast distributed copying tool to copy large chunks of
> data. Before you kick-off with this, you can create a utility that tracks
> the modification of data made to your live system while copying is going on
> in the background. The utility will log the modifications into an audit
> b) Once you're done copying the files, allow the new data store
> replication to catch up by reading the real-time modifications that were
> made, from your utility's log file. Once sync'ed up you can begin with the
> minimal downtime by switching off the JobTracker in live cluster so that
> new files are not created.
> c) As soon as you reach the last chunk of copying, change the DNS entries
> so that the hostnames referenced by the Hadoop jobs points to the new
> d) Turn on the JobTracker for the new cluster.
> e) Enjoy a drink with the money you saved by not using other paid third
> party solutions and pat your back! ;)
> The key of the above solution is to make data copying of step a) as fast as
> possible. Lesser the time, lesser the contents in audit trail, lesser the
> overall downtime.
> You can develop some in house solution for this, or use DistCp, provided by
> Hadoop that uses copies over the data using Map/Reduce.
> On Sat, Aug 4, 2012 at 3:27 AM, Michael Segel <[EMAIL PROTECTED]>wrote:
>> Sorry at 1PB of disk... compression isn't going to really help a whole
>> heck of a lot. Your networking bandwidth will be your bottleneck.
>> So lets look at the problem.
>> How much down time can you afford?
>> What does your hardware look like?
>> How much space do you have in your current data center?
>> You have 1PB of data. OK, what does the access pattern look like?
>> There are a couple of ways to slice and dice this. How many trucks do you
>> On Aug 3, 2012, at 4:24 PM, Harit Himanshu <[EMAIL PROTECTED]>
>>> Moving 1 PB of data would take loads of time,
>>> - check if this new data center provides something similar to
>>> - Consider multi part uploading of data
>>> - consider compressing the data
>>> On Aug 3, 2012, at 2:19 PM, Patai Sangbutsarakum wrote:
>>>> thanks for response.
>>>> Physical move is not a choice in this case. Purely looking for copying
>>>> data and how to catch up with the update of a file while it is being
>>>> On Fri, Aug 3, 2012 at 12:40 PM, Chen He <[EMAIL PROTECTED]> wrote:
>>>>> sometimes, physically moving hard drives helps. :)
>>>>> On Aug 3, 2012 1:50 PM, "Patai Sangbutsarakum" <
>> [EMAIL PROTECTED]>
>>>>>> Hi Hadoopers,
>>>>>> We have a plan to migrate Hadoop cluster to a different datacenter
>>>>>> where we can triple the size of the cluster.
>>>>>> Currently, our 0.20.2 cluster have around 1PB of data. We use only
>>>>>> I would like to get some input how we gonna handle with transferring