Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> distcp question

Copy link to this message
Re: distcp question

Are you doing a push from the source cluster or a pull from the target

Doing a pull with distcp using hftp (to accomodate for version differences)
has the advantage of slightly fewer transfers of blocks over the TORs. Each
block is read from exactly the datanode where it is located, and on the
target side (where the mappers run) the first write is to the local
datanode. With RF=3 each block transfers out of the source TOR, into the
target TOR, out of the first target-cluster TOR into a different
target-cluster TOR for replica 2 & 3. Overall 2 time out, and 2 times in.

Doing a pull with webhdfs:// the proxy server has to collect all blocks
from the source DNs, then they get pulled to the target machine.
Situation is similar as above, with the one extra transfer of all data
going through the "proxy" server.

Doing a push with webhdfs:// on the target cluster size, the mapper has to
collect all blocks from one or more files (depending on # mappers used) and
send them to the proxy server, which then writes blocks to the target
cluster. Advantage on the target cluster is that each block for a large
multi-block files get spread over different datanodes on the target side.
But if I'm counting correctly, you'll have the most data transfer. Out of
each source DN, through source cluster mapper DN, through target proxy
server, to target DN, and out/in again for replicas 2&3.

So convenience and setup aside, I think the first option would be the least
network transfers.
Now if you're clusters are separated over a WAN, then this may not matter
all at.

Just something to think about.


On Fri, Oct 12, 2012 at 8:37 AM, Harsh J <[EMAIL PROTECTED]> wrote:

> Rita,
> I believe, per the implementation, that webhdfs:// URIs should work
> fine. Please give it a try and let us know.
> On Fri, Oct 12, 2012 at 7:14 PM, Rita <[EMAIL PROTECTED]> wrote:
> > I have 2 different versions of Hadoop running. I need to copy significant
> > amount of data  (100tb) from one cluster to another. I know distcp is the
> > way to do. On the target cluster I have webhdfs running. Would that work?
> >
> > The DistCp manual says, I need to use "HftpFileSystem". Is that necessary
> > or will webhdfs do the task?
> >
> >
> >
> > --
> > --- Get your facts first, then you can distort them as you please.--
> --
> Harsh J