Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> distcp question


Copy link to this message
-
Re: distcp question
thanks for the advise.

Before I push or pull. Are there any tests I can run before I do the
distCP. I am not 100% sure if I have my webhdfs setup properly.
On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis <[EMAIL PROTECTED]>wrote:

> Rita,
>
> Are you doing a push from the source cluster or a pull from the target
> cluster?
>
> Doing a pull with distcp using hftp (to accomodate for version differences)
> has the advantage of slightly fewer transfers of blocks over the TORs. Each
> block is read from exactly the datanode where it is located, and on the
> target side (where the mappers run) the first write is to the local
> datanode. With RF=3 each block transfers out of the source TOR, into the
> target TOR, out of the first target-cluster TOR into a different
> target-cluster TOR for replica 2 & 3. Overall 2 time out, and 2 times in.
>
> Doing a pull with webhdfs:// the proxy server has to collect all blocks
> from the source DNs, then they get pulled to the target machine.
> Situation is similar as above, with the one extra transfer of all data
> going through the "proxy" server.
>
> Doing a push with webhdfs:// on the target cluster size, the mapper has to
> collect all blocks from one or more files (depending on # mappers used) and
> send them to the proxy server, which then writes blocks to the target
> cluster. Advantage on the target cluster is that each block for a large
> multi-block files get spread over different datanodes on the target side.
> But if I'm counting correctly, you'll have the most data transfer. Out of
> each source DN, through source cluster mapper DN, through target proxy
> server, to target DN, and out/in again for replicas 2&3.
>
> So convenience and setup aside, I think the first option would be the least
> network transfers.
> Now if you're clusters are separated over a WAN, then this may not matter
> all at.
>
> Just something to think about.
>
> Cheers,
>
> Joep
>
>
> On Fri, Oct 12, 2012 at 8:37 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>
> > Rita,
> >
> > I believe, per the implementation, that webhdfs:// URIs should work
> > fine. Please give it a try and let us know.
> >
> > On Fri, Oct 12, 2012 at 7:14 PM, Rita <[EMAIL PROTECTED]> wrote:
> > > I have 2 different versions of Hadoop running. I need to copy
> significant
> > > amount of data  (100tb) from one cluster to another. I know distcp is
> the
> > > way to do. On the target cluster I have webhdfs running. Would that
> work?
> > >
> > > The DistCp manual says, I need to use "HftpFileSystem". Is that
> necessary
> > > or will webhdfs do the task?
> > >
> > >
> > >
> > > --
> > > --- Get your facts first, then you can distort them as you please.--
> >
> >
> >
> > --
> > Harsh J
> >
>

--
--- Get your facts first, then you can distort them as you please.--
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB