Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop, mail # user - Re: Re: distcp question


+
kojie.fu 2012-10-12, 19:20
Copy link to this message
-
Re: Re: distcp question
Rita 2012-10-12, 20:40
nvermind. Figured it out.
On Fri, Oct 12, 2012 at 3:20 PM, kojie.fu <[EMAIL PROTECTED]> wrote:

>
>
>
>
>
> kojie.fu
>
> From: Rita
> Date: 2012-10-13 03:19
> To: common-user
> Subject: Re: distcp question
> thanks for the advise.
>
> Before I push or pull. Are there any tests I can run before I do the
> distCP. I am not 100% sure if I have my webhdfs setup properly.
>
>
>
>
> On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis <[EMAIL PROTECTED]
> >wrote:
>
> > Rita,
> >
> > Are you doing a push from the source cluster or a pull from the target
> > cluster?
> >
> > Doing a pull with distcp using hftp (to accomodate for version
> differences)
> > has the advantage of slightly fewer transfers of blocks over the TORs.
> Each
> > block is read from exactly the datanode where it is located, and on the
> > target side (where the mappers run) the first write is to the local
> > datanode. With RF=3 each block transfers out of the source TOR, into the
> > target TOR, out of the first target-cluster TOR into a different
> > target-cluster TOR for replica 2 & 3. Overall 2 time out, and 2 times in.
> >
> > Doing a pull with webhdfs:// the proxy server has to collect all blocks
> > from the source DNs, then they get pulled to the target machine.
> > Situation is similar as above, with the one extra transfer of all data
> > going through the "proxy" server.
> >
> > Doing a push with webhdfs:// on the target cluster size, the mapper has
> to
> > collect all blocks from one or more files (depending on # mappers used)
> and
> > send them to the proxy server, which then writes blocks to the target
> > cluster. Advantage on the target cluster is that each block for a large
> > multi-block files get spread over different datanodes on the target side.
> > But if I'm counting correctly, you'll have the most data transfer. Out of
> > each source DN, through source cluster mapper DN, through target proxy
> > server, to target DN, and out/in again for replicas 2&3.
> >
> > So convenience and setup aside, I think the first option would be the
> least
> > network transfers.
> > Now if you're clusters are separated over a WAN, then this may not matter
> > all at.
> >
> > Just something to think about.
> >
> > Cheers,
> >
> > Joep
> >
> >
> > On Fri, Oct 12, 2012 at 8:37 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> >
> > > Rita,
> > >
> > > I believe, per the implementation, that webhdfs:// URIs should work
> > > fine. Please give it a try and let us know.
> > >
> > > On Fri, Oct 12, 2012 at 7:14 PM, Rita <[EMAIL PROTECTED]> wrote:
> > > > I have 2 different versions of Hadoop running. I need to copy
> > significant
> > > > amount of data  (100tb) from one cluster to another. I know distcp is
> > the
> > > > way to do. On the target cluster I have webhdfs running. Would that
> > work?
> > > >
> > > > The DistCp manual says, I need to use "HftpFileSystem". Is that
> > necessary
> > > > or will webhdfs do the task?
> > > >
> > > >
> > > >
> > > > --
> > > > --- Get your facts first, then you can distort them as you please.--
> > >
> > >
> > >
> > > --
> > > Harsh J
> > >
> >
>
>
>
> --
> --- Get your facts first, then you can distort them as you please.--
>

--
--- Get your facts first, then you can distort them as you please.--
+
Rita 2012-10-12, 13:44
+
Harsh J 2012-10-12, 15:37
+
J. Rottinghuis 2012-10-12, 17:01
+
Rita 2012-10-12, 19:19