Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - Re: Copy Vs DistCP


+
Alexander Pivovarov 2013-04-11, 03:37
+
Jay Vyas 2013-04-11, 12:44
+
KayVajj 2013-04-11, 16:52
Copy link to this message
-
Re: Copy Vs DistCP
Lance Norskog 2013-04-12, 17:15
Shell 'cp' only works if you use 'fuse', which makes the HDFS file
system visible as a Unix mounted file system. Otherwise, Unix programs
cannot read or write HDFS files.

On 04/11/2013 09:52 AM, KayVajj wrote:
> Summing up what would be the recommendations for copy
>
> 1) DistCP
> 2) shell cp command
> 3) Using File System API(FileUtils to be precise) inside of a Java program
> 4) A MR with an Identity Mapper and no Reducer (may be this is what
> DistCP does)
>
>
> I did not run any comparisons as my dev cluster is just a two node
> cluster and not sure how this would perform on a production cluster.
>
> Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     Yes makes sense...  cp is serialized and simpler, and does not
>     rely on jobtracker- Whereas distcp actually only submits a job and
>     waits for completion.
>     So it can fail if tasks start to fail or timeout.
>      I Have seen distcp fail and hang before albeit not often.
>
>     Sent from my iPhone
>
>     On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov
>     <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>
>>     if cluster is busy with other jobs distcp will wait for free map
>>     slots. Regular cp is more reliable and predictable. Especialy if
>>     you need to copy just several GB
>>
>>     On Apr 10, 2013 6:31 PM, "Azuryy Yu" <[EMAIL PROTECTED]
>>     <mailto:[EMAIL PROTECTED]>> wrote:
>>
>>         CP command is not parallel, It's just call FileSystem, even
>>         if DFSClient has multi threads.
>>
>>         DistCp can work well on the same cluster.
>>
>>
>>         On Thu, Apr 11, 2013 at 8:17 AM, KayVajj
>>         <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>>
>>             The File System Copy utility copies files byte by byte if
>>             I'm not wrong. Could it be possible that the cp command
>>             works with blocks and moves them which could be
>>             significantly efficient?
>>
>>
>>             Also how does the cp command work if the file is
>>             distributed on different data nodes??
>>
>>             Thanks
>>             Kay
>>
>>
>>             On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas
>>             <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>>
>>                 DistCP is a full blown mapreduce job (mapper only,
>>                 where the mappers do a "fully" parallel copy to the
>>                 detsination).
>>
>>                 CP appears (correct me if im wrong) to simply invoke
>>                 the FileSystem and issues a copy command for every
>>                 source file.
>>
>>                 I have an additional question: how is CP which is
>>                 internal to a cluster optimized (if at all) ?
>>
>>
>>
>>                 On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣
>>                 <[EMAIL PROTECTED]
>>                 <mailto:[EMAIL PROTECTED]>> wrote:
>>
>>                     Hi,
>>                     I think it' better using Copy in the same cluster
>>                     while using distCP between clusters, and cp
>>                     command is a hadoop internal parallel process and
>>                     will not copy files locally.
>>                     ------------------------------------------------------------------------
>>                     麦树荣
>>                     *From:* KayVajj <mailto:[EMAIL PROTECTED]>
>>                     *Date:* 2013-04-11 06 <tel:2013-04-11%C2%A006>:20
>>                     *To:* [EMAIL PROTECTED]
>>                     <mailto:[EMAIL PROTECTED]>
>>                     *Subject:* Copy Vs DistCP
>>                     I have few questions regarding the usage of
>>                     DistCP for copying files in the same cluster.
>>
>>
>>                     1) Which one is better within a  same cluster and
>>                     what factors (like file size etc) wouldinfluence
>>                     the usage of one over te other?
+
Ted Dunning 2013-04-14, 04:14
+
Mathias Herberts 2013-04-14, 08:13
+
Ted Dunning 2013-04-14, 17:00
+
Mathias Herberts 2013-04-14, 17:33
+
Ted Dunning 2013-04-14, 18:01
+
Hemanth Yamijala 2013-04-11, 07:40
+
Azuryy Yu 2013-04-11, 10:51