in replicated join , the number of maps spawned should be same as the
number of splits for the first join input.
In case of default join, there would be additional map tasks for the 2nd
input's splits. But if you are able to run the replicated join without
running out of memory, then the 2nd input is likely to have only a
handful of splits.
questions for you
1. was the replicated join successful ?
2. do you have pig.splitCombination turned on ? (its on by default).
3. what version of pig are you using ?
4. what is the size of each input to join ?
On 5/1/12 10:58 PM, shan s wrote:
> By other steps, I mainly mean other default joins in the script.
> The point is that when I use 'Replicated' join, 2 maps tasks are
> scheduled. When I use "default" join, 100+ map jobs are scheduled.
> How do we explain this decision process?
> How can I increase actual no. of maps scheduled for Replicated joins?
> On Mon, Apr 30, 2012 at 11:59 PM, Prashant Kommireddi<[EMAIL PROTECTED]>
>> 2 map tasks for join vs 100+ in other steps, what are "other" steps here?
>> Your 2nd question, I think you are asking about Map and Reduce Task
>> capacity mentioned on the JobTracker page? That is governed based on
>> configuration properties set before hadoop is started on cluster.
>> On Mon, Apr 30, 2012 at 7:54 AM, shan s<[EMAIL PROTECTED]> wrote:
>>> Sorry for the previous incomplete message.
>>> Here is the take 2:
>>> When I use a Replicated Join only 2 map tasks get scheduled (compared to
>>> 100+ tasks for the other steps)
>>> What is the idea behind this? What setting do I use to override this
>>> Also, a basic question.
>>> Does hadoop decide the map task capacity or it simply follows the
>>> Map Task Capacity Reduce Task Capacity Avg. Tasks/Node Blacklisted Nodes
>>> Excluded Nodes
>>> 64 20 1.00
>>> Thanks, Prashant.