|
|
-
replicated join amd map tasks
shan s 2012-04-30, 14:54
Sorry for the previous incomplete message. Here is the take 2:
When I use a Replicated Join only 2 map tasks get scheduled (compared to 100+ tasks for the other steps) What is the idea behind this? What setting do I use to override this behaviour? Also, a basic question. Does hadoop decide the map task capacity or it simply follows the configuration?
Map Task Capacity Reduce Task Capacity Avg. Tasks/Node Blacklisted Nodes Excluded Nodes 64 20 1.00
Thanks, Prashant.
-
Re: replicated join amd map tasks
Ashish Gite 2012-04-30, 17:38
Hadoop decides map task capacity based on file size & hdfs block size (usual default either 64mb or 128mb).
Via pig config, reducer capacity can be configured.
Sent from Mobile
-
Re: replicated join amd map tasks
Prashant Kommireddi 2012-04-30, 18:29
2 map tasks for join vs 100+ in other steps, what are "other" steps here?
Your 2nd question, I think you are asking about Map and Reduce Task capacity mentioned on the JobTracker page? That is governed based on configuration properties set before hadoop is started on cluster. On Mon, Apr 30, 2012 at 7:54 AM, shan s <[EMAIL PROTECTED]> wrote:
> Sorry for the previous incomplete message. > Here is the take 2: > > When I use a Replicated Join only 2 map tasks get scheduled (compared to > 100+ tasks for the other steps) > What is the idea behind this? What setting do I use to override this > behaviour? > > > Also, a basic question. > Does hadoop decide the map task capacity or it simply follows the > configuration? > > Map Task Capacity Reduce Task Capacity Avg. Tasks/Node Blacklisted Nodes > Excluded Nodes > 64 20 1.00 > > Thanks, Prashant. >
-
Re: replicated join amd map tasks
shan s 2012-05-02, 05:58
By other steps, I mainly mean other default joins in the script.
The point is that when I use 'Replicated' join, 2 maps tasks are scheduled. When I use "default" join, 100+ map jobs are scheduled. How do we explain this decision process? How can I increase actual no. of maps scheduled for Replicated joins?
On Mon, Apr 30, 2012 at 11:59 PM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote: > > 2 map tasks for join vs 100+ in other steps, what are "other" steps here? > > Your 2nd question, I think you are asking about Map and Reduce Task > capacity mentioned on the JobTracker page? That is governed based on > configuration properties set before hadoop is started on cluster. > > > > > On Mon, Apr 30, 2012 at 7:54 AM, shan s <[EMAIL PROTECTED]> wrote: > > > Sorry for the previous incomplete message. > > Here is the take 2: > > > > When I use a Replicated Join only 2 map tasks get scheduled (compared to > > 100+ tasks for the other steps) > > What is the idea behind this? What setting do I use to override this > > behaviour? > > > > > > Also, a basic question. > > Does hadoop decide the map task capacity or it simply follows the > > configuration? > > > > Map Task Capacity Reduce Task Capacity Avg. Tasks/Node Blacklisted Nodes > > Excluded Nodes > > 64 20 1.00 > > > > Thanks, Prashant. > >
-
Re: replicated join amd map tasks
Rajgopal Vaithiyanathan 2012-05-02, 10:01
That doesn't seem right. try doing an `EXPLAIN` on your script. Could you please post the PIG script here ?
On Wed, May 2, 2012 at 11:28 AM, shan s <[EMAIL PROTECTED]> wrote:
> By other steps, I mainly mean other default joins in the script. > > The point is that when I use 'Replicated' join, 2 maps tasks are > scheduled. When I use "default" join, 100+ map jobs are scheduled. > How do we explain this decision process? > How can I increase actual no. of maps scheduled for Replicated joins? > > On Mon, Apr 30, 2012 at 11:59 PM, Prashant Kommireddi <[EMAIL PROTECTED] > > > wrote: > > > > 2 map tasks for join vs 100+ in other steps, what are "other" steps here? > > > > Your 2nd question, I think you are asking about Map and Reduce Task > > capacity mentioned on the JobTracker page? That is governed based on > > configuration properties set before hadoop is started on cluster. > > > > > > > > > > On Mon, Apr 30, 2012 at 7:54 AM, shan s <[EMAIL PROTECTED]> wrote: > > > > > Sorry for the previous incomplete message. > > > Here is the take 2: > > > > > > When I use a Replicated Join only 2 map tasks get scheduled (compared > to > > > 100+ tasks for the other steps) > > > What is the idea behind this? What setting do I use to override this > > > behaviour? > > > > > > > > > Also, a basic question. > > > Does hadoop decide the map task capacity or it simply follows the > > > configuration? > > > > > > Map Task Capacity Reduce Task Capacity Avg. Tasks/Node Blacklisted > Nodes > > > Excluded Nodes > > > 64 20 1.00 > > > > > > Thanks, Prashant. > > > >
-
Re: replicated join amd map tasks
Thejas Nair 2012-05-02, 21:30
in replicated join , the number of maps spawned should be same as the number of splits for the first join input. In case of default join, there would be additional map tasks for the 2nd input's splits. But if you are able to run the replicated join without running out of memory, then the 2nd input is likely to have only a handful of splits.
questions for you 1. was the replicated join successful ? 2. do you have pig.splitCombination turned on ? (its on by default). 3. what version of pig are you using ? 4. what is the size of each input to join ?
Thanks, Thejas
On 5/1/12 10:58 PM, shan s wrote: > By other steps, I mainly mean other default joins in the script. > > The point is that when I use 'Replicated' join, 2 maps tasks are > scheduled. When I use "default" join, 100+ map jobs are scheduled. > How do we explain this decision process? > How can I increase actual no. of maps scheduled for Replicated joins? > > On Mon, Apr 30, 2012 at 11:59 PM, Prashant Kommireddi<[EMAIL PROTECTED]> > wrote: >> >> 2 map tasks for join vs 100+ in other steps, what are "other" steps here? >> >> Your 2nd question, I think you are asking about Map and Reduce Task >> capacity mentioned on the JobTracker page? That is governed based on >> configuration properties set before hadoop is started on cluster. >> >> >> >> >> On Mon, Apr 30, 2012 at 7:54 AM, shan s<[EMAIL PROTECTED]> wrote: >> >>> Sorry for the previous incomplete message. >>> Here is the take 2: >>> >>> When I use a Replicated Join only 2 map tasks get scheduled (compared to >>> 100+ tasks for the other steps) >>> What is the idea behind this? What setting do I use to override this >>> behaviour? >>> >>> >>> Also, a basic question. >>> Does hadoop decide the map task capacity or it simply follows the >>> configuration? >>> >>> Map Task Capacity Reduce Task Capacity Avg. Tasks/Node Blacklisted Nodes >>> Excluded Nodes >>> 64 20 1.00 >>> >>> Thanks, Prashant. >>> >
|
|