Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Re: Copy Vs DistCP


Copy link to this message
-
Re: Copy Vs DistCP
Azuryy Yu 2013-04-11, 10:51
yes, you are right.
On Thu, Apr 11, 2013 at 3:40 PM, Hemanth Yamijala <[EMAIL PROTECTED]
> wrote:

> AFAIK, the cp command works fully from the DFS client. It reads bytes from
> the InputStream created when the file is opened and writes the same to the
> OutputStream of the file. It does not work at the level of data blocks. A
> configuration io.file.buffer.size is used as the size of the buffer used in
> copy - set to 4096 by default.
>
> Thanks
> Hemanth
>
>
> On Thu, Apr 11, 2013 at 9:42 AM, KayVajj <[EMAIL PROTECTED]> wrote:
>
>> If CP command is not parallel how does it work for a file partitioned on
>> various data nodes?
>>
>>
>> On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu <[EMAIL PROTECTED]> wrote:
>>
>>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>>> has multi threads.
>>>
>>> DistCp can work well on the same cluster.
>>>
>>>
>>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <[EMAIL PROTECTED]> wrote:
>>>
>>>> The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>> Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>> Thanks
>>>> Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>> I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> **
>>>>>> Hi,
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  麦树荣
>>>>>>
>>>>>>  *From:* KayVajj <[EMAIL PROTECTED]>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* [EMAIL PROTECTED]
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>>> files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>>
>