Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Hadoop noob question


Copy link to this message
-
Re: Hadoop noob question
This is what I would say :

The number of maps is decided as follows. Since it’s a good idea to get
each map to copy a reasonable amount of data to minimize overheads in task
setup, each map copies at least 256 MB (unless the total size of the input
is less, in which case one map handles it all). For example, 1 GB of files
will be given four map tasks. When the data size is very large, it becomes
necessary to limit the number of maps in order to limit bandwidth and
cluster utilization. By default, the maximum number of maps is 20 per
(tasktracker) cluster node. For example, copying 1,000 GB of files to a
100-node cluster will allocate 2,000 maps (20 per node), so each will copy
512 MB on average. This can be reduced by specifying the-m argument to *
distcp*. For example, -m 1000 would allocate 1,000 maps, each copying 1 GB
on average.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com
On Sun, May 12, 2013 at 6:35 PM, Rahul Bhattacharjee <
[EMAIL PROTECTED]> wrote:

> Soon after replying I realized something else related to this.
>
> Say we have a single file in HDFS (hdfs configured for default block size
> 64 MB) and the size of the file is 1 GB. Now if we use distcp to move it
> from the current hdfs to another one , then
> whether there would be any parallelism or just a single map task would be
> fired?
>
> As per what I have read , a mapper is launcher for a complete file or a
> set of files. It doesn't operate at block level.So no parallelism even if
> the file resides in HDFS.
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:28 PM, Rahul Bhattacharjee <
> [EMAIL PROTECTED]> wrote:
>
>> yeah you are right I mis read your earlier post.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>>
>>> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
>>> a single machine. So no multiple TTs.
>>>
>>> Please comment if you think I am wring somewhere.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> Yes , it's a MR job under the hood . my question was that you wrote
>>>> that using distcp you loose the benefits  of parallel processing of Hadoop.
>>>> I think the MR job of distcp divides files into individual map tasks based
>>>> on the total size of the transfer , so multiple mappers would still be
>>>> spawned if the size of transfer is huge and they would work in parallel.
>>>>
>>>> Correct me if there is anything wrong!
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> No. distcp is actually a mapreduce job under the hood.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>>>> [EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> Thanks to both of you!
>>>>>>
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <[EMAIL PROTECTED]
>>>>>> > wrote:
>>>>>>
>>>>>>> you can do that using file:///
>>>>>>>
>>>>>>> example:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>
>>>>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>>>>> used to upload files from local to hdfs.
>>>>>>>>
>>>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>>>> present in the hadoop's fs?
>>>>>>>>
>>>>>>>>  Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <
>>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>>
>>>>>>>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB