Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - distcp question


Copy link to this message
-
Re: distcp question
Rita 2012-10-12, 19:19
thanks for the advise.

Before I push or pull. Are there any tests I can run before I do the
distCP. I am not 100% sure if I have my webhdfs setup properly.
On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis <[EMAIL PROTECTED]>wrote:

> Rita,
>
> Are you doing a push from the source cluster or a pull from the target
> cluster?
>
> Doing a pull with distcp using hftp (to accomodate for version differences)
> has the advantage of slightly fewer transfers of blocks over the TORs. Each
> block is read from exactly the datanode where it is located, and on the
> target side (where the mappers run) the first write is to the local
> datanode. With RF=3 each block transfers out of the source TOR, into the
> target TOR, out of the first target-cluster TOR into a different
> target-cluster TOR for replica 2 & 3. Overall 2 time out, and 2 times in.
>
> Doing a pull with webhdfs:// the proxy server has to collect all blocks
> from the source DNs, then they get pulled to the target machine.
> Situation is similar as above, with the one extra transfer of all data
> going through the "proxy" server.
>
> Doing a push with webhdfs:// on the target cluster size, the mapper has to
> collect all blocks from one or more files (depending on # mappers used) and
> send them to the proxy server, which then writes blocks to the target
> cluster. Advantage on the target cluster is that each block for a large
> multi-block files get spread over different datanodes on the target side.
> But if I'm counting correctly, you'll have the most data transfer. Out of
> each source DN, through source cluster mapper DN, through target proxy
> server, to target DN, and out/in again for replicas 2&3.
>
> So convenience and setup aside, I think the first option would be the least
> network transfers.
> Now if you're clusters are separated over a WAN, then this may not matter
> all at.
>
> Just something to think about.
>
> Cheers,
>
> Joep
>
>
> On Fri, Oct 12, 2012 at 8:37 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>
> > Rita,
> >
> > I believe, per the implementation, that webhdfs:// URIs should work
> > fine. Please give it a try and let us know.
> >
> > On Fri, Oct 12, 2012 at 7:14 PM, Rita <[EMAIL PROTECTED]> wrote:
> > > I have 2 different versions of Hadoop running. I need to copy
> significant
> > > amount of data  (100tb) from one cluster to another. I know distcp is
> the
> > > way to do. On the target cluster I have webhdfs running. Would that
> work?
> > >
> > > The DistCp manual says, I need to use "HftpFileSystem". Is that
> necessary
> > > or will webhdfs do the task?
> > >
> > >
> > >
> > > --
> > > --- Get your facts first, then you can distort them as you please.--
> >
> >
> >
> > --
> > Harsh J
> >
>

--
--- Get your facts first, then you can distort them as you please.--