-Re: Re: distcp question
kojie.fu 2012-10-12, 19:20
Date: 2012-10-13 03:19
Subject: Re: distcp question
thanks for the advise.
Before I push or pull. Are there any tests I can run before I do the
distCP. I am not 100% sure if I have my webhdfs setup properly.
On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis <[EMAIL PROTECTED]>wrote:
> Are you doing a push from the source cluster or a pull from the target
> Doing a pull with distcp using hftp (to accomodate for version differences)
> has the advantage of slightly fewer transfers of blocks over the TORs. Each
> block is read from exactly the datanode where it is located, and on the
> target side (where the mappers run) the first write is to the local
> datanode. With RF=3 each block transfers out of the source TOR, into the
> target TOR, out of the first target-cluster TOR into a different
> target-cluster TOR for replica 2 & 3. Overall 2 time out, and 2 times in.
> Doing a pull with webhdfs:// the proxy server has to collect all blocks
> from the source DNs, then they get pulled to the target machine.
> Situation is similar as above, with the one extra transfer of all data
> going through the "proxy" server.
> Doing a push with webhdfs:// on the target cluster size, the mapper has to
> collect all blocks from one or more files (depending on # mappers used) and
> send them to the proxy server, which then writes blocks to the target
> cluster. Advantage on the target cluster is that each block for a large
> multi-block files get spread over different datanodes on the target side.
> But if I'm counting correctly, you'll have the most data transfer. Out of
> each source DN, through source cluster mapper DN, through target proxy
> server, to target DN, and out/in again for replicas 2&3.
> So convenience and setup aside, I think the first option would be the least
> network transfers.
> Now if you're clusters are separated over a WAN, then this may not matter
> all at.
> Just something to think about.
> On Fri, Oct 12, 2012 at 8:37 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> > Rita,
> > I believe, per the implementation, that webhdfs:// URIs should work
> > fine. Please give it a try and let us know.
> > On Fri, Oct 12, 2012 at 7:14 PM, Rita <[EMAIL PROTECTED]> wrote:
> > > I have 2 different versions of Hadoop running. I need to copy
> > > amount of data (100tb) from one cluster to another. I know distcp is
> > > way to do. On the target cluster I have webhdfs running. Would that
> > >
> > > The DistCp manual says, I need to use "HftpFileSystem". Is that
> > > or will webhdfs do the task?
> > >
> > >
> > >
> > > --
> > > --- Get your facts first, then you can distort them as you please.--
> > --
> > Harsh J
--- Get your facts first, then you can distort them as you please.--