Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: Copy Vs DistCP


+
Alexander Pivovarov 2013-04-11, 03:37
+
Jay Vyas 2013-04-11, 12:44
Copy link to this message
-
Re: Copy Vs DistCP
Summing up what would be the recommendations for copy

1) DistCP
2) shell cp command
3) Using File System API(FileUtils to be precise) inside of a Java program
4) A MR with an Identity Mapper and no Reducer (may be this is what DistCP
does)
I did not run any comparisons as my dev cluster is just a two node cluster
and not sure how this would perform on a production cluster.

Kay
On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <[EMAIL PROTECTED]> wrote:

> Yes makes sense...  cp is serialized and simpler, and does not rely on
> jobtracker- Whereas distcp actually only submits a job and waits for
> completion.
> So it can fail if tasks start to fail or timeout.
>  I Have seen distcp fail and hang before albeit not often.
>
> Sent from my iPhone
>
> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <[EMAIL PROTECTED]>
> wrote:
>
> if cluster is busy with other jobs distcp will wait for free map slots.
> Regular cp is more reliable and predictable. Especialy if you need to copy
> just several GB
> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <[EMAIL PROTECTED]> wrote:
>
>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>> has multi threads.
>>
>> DistCp can work well on the same cluster.
>>
>>
>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <[EMAIL PROTECTED]> wrote:
>>
>>> The File System Copy utility copies files byte by byte if I'm not wrong.
>>> Could it be possible that the cp command works with blocks and moves them
>>> which could be significantly efficient?
>>>
>>>
>>> Also how does the cp command work if the file is distributed on
>>> different data nodes??
>>>
>>> Thanks
>>> Kay
>>>
>>>
>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <[EMAIL PROTECTED]> wrote:
>>>
>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>>> a "fully" parallel copy to the detsination).
>>>>
>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>>> issues a copy command for every source file.
>>>>
>>>> I have an additional question: how is CP which is internal to a cluster
>>>> optimized (if at all) ?
>>>>
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> **
>>>>> Hi,
>>>>>
>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>> will not copy files locally.
>>>>>
>>>>> ------------------------------
>>>>>  麦树荣
>>>>>
>>>>>  *From:* KayVajj <[EMAIL PROTECTED]>
>>>>> *Date:* 2013-04-11 06:20
>>>>> *To:* [EMAIL PROTECTED]
>>>>> *Subject:* Copy Vs DistCP
>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>> files in the same cluster.
>>>>>
>>>>>
>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>
>>>>>  2) when we run a cp command like below from a  client node of the
>>>>> cluster (not a data node), How does the cp command work
>>>>>       i) like an MR job
>>>>>      ii) copy files locally and then it copy it back at the new
>>>>> location.
>>>>>
>>>>>  Example of the copy command
>>>>>
>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>
>>>>>  Thanks, your responses are appreciated.
>>>>>
>>>>>  -- Kay
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>
+
Lance Norskog 2013-04-12, 17:15
+
Ted Dunning 2013-04-14, 04:14
+
Mathias Herberts 2013-04-14, 08:13
+
Ted Dunning 2013-04-14, 17:00
+
Mathias Herberts 2013-04-14, 17:33
+
Ted Dunning 2013-04-14, 18:01
+
Hemanth Yamijala 2013-04-11, 07:40
+
Azuryy Yu 2013-04-11, 10:51