In the hadoop eco system the number of map tasks is actually decided by the job basically based no of input splits . Setting mapred.map.tasks wouldn't assure that only that many number of map tasks are triggered. What worked out here for you is that you were specifying that a map tasks should process a min data volume by setting value for mapred.min.split size.
So in your case in real there were 9 input splits but when you imposed a constrain on the min data that a map task should handle, the map tasks came down to 3.
Bejoy K S
From: "Daniel,Wu" <[EMAIL PROTECTED]>
Date: Thu, 25 Aug 2011 20:02:43
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Re:Re:Re: Re: RE: Why a sql only use one map task?
after I set
Then it will kick off 3 map tasks (the file I have is 500M). So looks like we need to set mapred.min.split.size instead of mapred.map.tasks to control how many maps to kick off.
At 2011-08-25 19:38:30,"Daniel,Wu" <[EMAIL PROTECTED]> wrote:
It works, after I set as you said, but looks like I can't control the map task, it always use 9 maps, even if I set
Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
900900 / 0
100100 / 0
At 2011-08-25 06:35:38,"Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote:
This may be because CombineHiveInputFormat is combining your splits in one map task. If you don't want that to happen, do:
hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveI nputFormat
2011/8/24 Daniel,Wu<[EMAIL PROTECTED]>
I pasted the inform I pasted blow, the map capacity is 6. And no matter how I set mapred.map.tasks, such as 3, it doesn't work, as it always use 1 map task (please see the completed job information).
Cluster Summary (Heap Size is 16.81 MB/966.69 MB)
Running Map TasksRunning Reduce TasksTotal SubmissionsNodesOccupied Map SlotsOccupied Reduce SlotsReserved Map SlotsReserved Reduce SlotsMap Task CapacityReduce Task CapacityAvg. Tasks/NodeBlacklisted NodesExcluded Nodes
JobidPriorityUserNameMap % CompleteMap TotalMaps CompletedReduce % CompleteReduce TotalReduces CompletedJob Scheduling InformationDiagnostic Info
job_201108242119_0001NORMALoracleselect count(*) from test(Stage-1)100.00%
job_201108242119_0002NORMALoracleselect count(*) from test(Stage-1)100.00%
job_201108242119_0003NORMALoracleselect count(*) from test(Stage-1)100.00%
job_201108242119_0004NORMALoracleselect period_key,count(*) from...period_key(Stage-1)100.00%
job_201108242119_0005NORMALoracleselect period_key,count(*) from...period_key(Stage-1)100.00%
job_201108242119_0006NORMALoracleselect period_key,count(*) from...period_key(Stage-1)100.00%
At 2011-08-24 18:19:38,wd <[EMAIL PROTECTED]> wrote:
>What about your total Map Task Capacity?
>you may check it from http://your_jobtracker:50030/jobtracker.jsp
>2011/8/24 Daniel,Wu <[EMAIL PROTECTED]>:
>> I checked my setting, all are with the default value.So per the book of
>> "Hadoop the definitive guide", the split size should be 64M. And the file
>> size is about 500M, so that's about 8 splits. And from the map job
>> information (after the map job is done), I can see it gets 8 split from one
>> node. But anyhow it starts only one map task.
>> At 2011-08-24 02:28:18,"Aggarwal, Vaibhav" <[EMAIL PROTECTED]> wrote:
>> If you actually have splittable files you can set the following setting to
>> create more splits:
>> mapred.max.split.size appropriately.
>> From: Daniel,Wu [mailto:[EMAIL PROTECTED]]
>> Sent: Tuesday, August 23, 2011 6:51 AM
>> To: hive
>> Subject: Why a sql only use one map task?
>> I run the following simple sql
>> select count(*) from sales;