Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: Copy Vs DistCP


+
Alexander Pivovarov 2013-04-11, 03:37
+
Jay Vyas 2013-04-11, 12:44
+
KayVajj 2013-04-11, 16:52
+
Lance Norskog 2013-04-12, 17:15
+
Ted Dunning 2013-04-14, 04:14
+
Mathias Herberts 2013-04-14, 08:13
+
Ted Dunning 2013-04-14, 17:00
+
Mathias Herberts 2013-04-14, 17:33
+
Ted Dunning 2013-04-14, 18:01
Copy link to this message
-
Re: Copy Vs DistCP
AFAIK, the cp command works fully from the DFS client. It reads bytes from
the InputStream created when the file is opened and writes the same to the
OutputStream of the file. It does not work at the level of data blocks. A
configuration io.file.buffer.size is used as the size of the buffer used in
copy - set to 4096 by default.

Thanks
Hemanth
On Thu, Apr 11, 2013 at 9:42 AM, KayVajj <[EMAIL PROTECTED]> wrote:

> If CP command is not parallel how does it work for a file partitioned on
> various data nodes?
>
>
> On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu <[EMAIL PROTECTED]> wrote:
>
>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>> has multi threads.
>>
>> DistCp can work well on the same cluster.
>>
>>
>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <[EMAIL PROTECTED]> wrote:
>>
>>> The File System Copy utility copies files byte by byte if I'm not wrong.
>>> Could it be possible that the cp command works with blocks and moves them
>>> which could be significantly efficient?
>>>
>>>
>>> Also how does the cp command work if the file is distributed on
>>> different data nodes??
>>>
>>> Thanks
>>> Kay
>>>
>>>
>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <[EMAIL PROTECTED]> wrote:
>>>
>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>>> a "fully" parallel copy to the detsination).
>>>>
>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>>> issues a copy command for every source file.
>>>>
>>>> I have an additional question: how is CP which is internal to a cluster
>>>> optimized (if at all) ?
>>>>
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> **
>>>>> Hi,
>>>>>
>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>> will not copy files locally.
>>>>>
>>>>> ------------------------------
>>>>>  麦树荣
>>>>>
>>>>>  *From:* KayVajj <[EMAIL PROTECTED]>
>>>>> *Date:* 2013-04-11 06:20
>>>>> *To:* [EMAIL PROTECTED]
>>>>> *Subject:* Copy Vs DistCP
>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>> files in the same cluster.
>>>>>
>>>>>
>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>
>>>>>  2) when we run a cp command like below from a  client node of the
>>>>> cluster (not a data node), How does the cp command work
>>>>>       i) like an MR job
>>>>>      ii) copy files locally and then it copy it back at the new
>>>>> location.
>>>>>
>>>>>  Example of the copy command
>>>>>
>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>
>>>>>  Thanks, your responses are appreciated.
>>>>>
>>>>>  -- Kay
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>
>
+
Azuryy Yu 2013-04-11, 10:51
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB