Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: Need help about task slots


+
Mohammad Tariq 2013-05-11, 17:37
+
Mohammad Tariq 2013-05-12, 12:09
+
Rahul Bhattacharjee 2013-05-12, 12:33
+
Rahul Bhattacharjee 2013-05-12, 12:34
Copy link to this message
-
Re: Need help about task slots
Hahaha..I think we could continue this over there..

Warm Regards,
Tariq
cloudfront.blogspot.com
On Sun, May 12, 2013 at 6:04 PM, Rahul Bhattacharjee <
[EMAIL PROTECTED]> wrote:

> sorry for my blunder as well. my previous post for for Tariq in a wrong
> post.
>
> Thanks.
> Rahul
>
>
> On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee <
> [EMAIL PROTECTED]> wrote:
>
>> Oh! I though distcp works on complete files rather then mappers per
>> datablock.
>> So I guess parallelism would still be there if there are multipel files..
>> please correct if ther is anything wrong.
>>
>> Thank,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>>
>>> @Rahul : I'm sorry as I am not aware of any such document. But you could
>>> use distcp for local to HDFS copy :
>>> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
>>> *
>>> *
>>> And yes. When you use distcp from local to HDFS, you can't take the
>>> pleasure of parallelism as the data is stored in a non distributed fashion.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>>>
>>>> Hello guys,
>>>>
>>>>              My 2 cents :
>>>>
>>>> Actually no. of mappers is primarily governed by the no. of InputSplits
>>>> created by the InputFormat you are using and the no. of reducers by the no.
>>>> of partitions you get after the map phase. Having said that, you should
>>>> also keep the no of slots, available per slave, in mind, along with the
>>>> available memory. But as a general rule you could use this approach :
>>>>
>>>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>>>> configure. For example, if you have 12 physical cores (or 24 virtual
>>>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>>>> you could choose how many mappers and reducers you want to use. With 18 MR
>>>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>>>> or whatever you think is OK with you.
>>>>
>>>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>>>> [EMAIL PROTECTED]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am also new to Hadoop world , here is my take on your question , if
>>>>> there is something missing then others would surely correct that.
>>>>>
>>>>> For per-YARN , the slots are fixed and computed based on the crunching
>>>>> capacity of the datanode hardware , once the slots per data node is
>>>>> ascertained , they are divided into Map and reducer slots and that goes
>>>>> into the config files and remain fixed , until changed.In YARN , its
>>>>> decided at runtime based on the kind of requirement of particular task.Its
>>>>> very much possible that a datanode at certain point of time running  10
>>>>> tasks and another similar datanode is only running 4 tasks.
>>>>>
>>>>> Coming to your question. Based of the data set size , block size of
>>>>> dfs and input formater , the number of map tasks are decided , generally
>>>>> for file based inputformats its one mapper per data block , however there
>>>>> are way to change this using configuration settings.Reduce tasks are set
>>>>> using job configuration.
>>>>>
>>>>> General rule as I have read from various documents is that Mappers
>>>>> should run atleast a minute , so you can run a sample to find out a good
>>>>> size of data block which would make you mapper run more than a minute. Now
>>>>> it again depends on your SLA , in case you are not looking for a very small
>>>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>>>
>>>>> But again its all theory , not sure how these things are handled in
>>>>> actual prod clusters.
>>>>>
>>>>> HTH,
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <