Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: Copy Vs DistCP


Copy link to this message
-
Re: Copy Vs DistCP
DistCP is prefer for your requirements.
On Fri, Apr 12, 2013 at 12:52 AM, KayVajj <[EMAIL PROTECTED]> wrote:

> Summing up what would be the recommendations for copy
>
> 1) DistCP
> 2) shell cp command
> 3) Using File System API(FileUtils to be precise) inside of a Java program
> 4) A MR with an Identity Mapper and no Reducer (may be this is what DistCP
> does)
>
>
> I did not run any comparisons as my dev cluster is just a two node cluster
> and not sure how this would perform on a production cluster.
>
> Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <[EMAIL PROTECTED]> wrote:
>
>> Yes makes sense...  cp is serialized and simpler, and does not rely on
>> jobtracker- Whereas distcp actually only submits a job and waits for
>> completion.
>> So it can fail if tasks start to fail or timeout.
>>  I Have seen distcp fail and hang before albeit not often.
>>
>> Sent from my iPhone
>>
>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <[EMAIL PROTECTED]>
>> wrote:
>>
>> if cluster is busy with other jobs distcp will wait for free map slots.
>> Regular cp is more reliable and predictable. Especialy if you need to copy
>> just several GB
>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <[EMAIL PROTECTED]> wrote:
>>
>>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>>> has multi threads.
>>>
>>> DistCp can work well on the same cluster.
>>>
>>>
>>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <[EMAIL PROTECTED]> wrote:
>>>
>>>> The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>> Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>> Thanks
>>>> Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>> I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> **
>>>>>> Hi,
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  麦树荣
>>>>>>
>>>>>>  *From:* KayVajj <[EMAIL PROTECTED]>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* [EMAIL PROTECTED]
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>>> files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>
+
Amal G Jose 2013-04-15, 18:10