Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Copy Vs DistCP


Copy link to this message
-
Re: Copy Vs DistCP
Inline
On Sun, Apr 14, 2013 at 1:13 AM, Mathias Herberts <
[EMAIL PROTECTED]> wrote:

> That was a hidden shameless plug Ted ;-)
>

Well, I will admit it was a shameless correction to Lance's absolute and
incorrect claim.
> The main disadvantage of fs -cp is that all data has to transit via the
> machine you issue the command on, depending on the size of data you want to
> copy that can be a killer. DistCp is distributed as its name imply, so no
> bottleneck of this kind then.
>

This is absolutely true.  Distcp dominates cp for large copies.  On the
other hand cp dominates distcp for convenience.

In my own experience, I love cp when copying relatively small amounts of
data (10's of GB) where the available bandwidth of about a GB/s allows the
copy to complete in less time that it takes distcp to get started.

At larger sizes (100's of GB and up), the startup time of distcp doesn't
matter because once it gets going, it moves data much faster.

> On Apr 14, 2013 6:15 AM, "Ted Dunning" <[EMAIL PROTECTED]> wrote:
>
>>
>> Lance,
>>
>> Never say never.
>>
>> Linux programs can read from the right kind of Hadoop cluster without
>> using FUSE.
>>
>>
>>
>>
>> On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <[EMAIL PROTECTED]>wrote:
>>
>>>  Shell 'cp' only works if you use 'fuse', which makes the HDFS file
>>> system visible as a Unix mounted file system. Otherwise, Unix programs
>>> cannot read or write HDFS files.
>>>
>>> On 04/11/2013 09:52 AM, KayVajj wrote:
>>>
>>>    Summing up what would be the recommendations for copy
>>>
>>>  1) DistCP
>>>  2) shell cp command
>>>  3) Using File System API(FileUtils to be precise) inside of a Java
>>> program
>>>  4) A MR with an Identity Mapper and no Reducer (may be this is what
>>> DistCP does)
>>>
>>>
>>>  I did not run any comparisons as my dev cluster is just a two node
>>> cluster and not sure how this would perform on a production cluster.
>>>
>>>  Kay
>>>
>>>
>>> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <[EMAIL PROTECTED]> wrote:
>>>
>>>>  Yes makes sense...  cp is serialized and simpler, and does not rely
>>>> on jobtracker- Whereas distcp actually only submits a job and waits for
>>>> completion.
>>>> So it can fail if tasks start to fail or timeout.
>>>>  I Have seen distcp fail and hang before albeit not often.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <[EMAIL PROTECTED]>
>>>> wrote:
>>>>
>>>>   if cluster is busy with other jobs distcp will wait for free map
>>>> slots. Regular cp is more reliable and predictable. Especialy if you need
>>>> to copy just several GB
>>>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <[EMAIL PROTECTED]> wrote:
>>>>
>>>>>  CP command is not parallel, It's just call FileSystem, even if
>>>>> DFSClient has multi threads.
>>>>>
>>>>>  DistCp can work well on the same cluster.
>>>>>
>>>>>
>>>>>  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>>  The File System Copy utility copies files byte by byte if I'm not
>>>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>>>> them which could be significantly efficient?
>>>>>>
>>>>>>
>>>>>>  Also how does the cp command work if the file is distributed on
>>>>>> different data nodes??
>>>>>>
>>>>>>  Thanks
>>>>>>  Kay
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <[EMAIL PROTECTED]>wrote:
>>>>>>
>>>>>>>  DistCP is a full blown mapreduce job (mapper only, where the
>>>>>>> mappers do a "fully" parallel copy to the detsination).
>>>>>>>
>>>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>>>> and issues a copy command for every source file.
>>>>>>>
>>>>>>>  I have an additional question: how is CP which is internal to a
>>>>>>> cluster optimized (if at all) ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <[EMAIL PROTECTED]> wrote:
>>>>>>>
>>>>>>>>  Hi,
>>>>>>>>
>>>>>>>> I think it' better using Copy in the same cluster while using
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB