Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> limit map tasks for load function


Copy link to this message
-
Re: limit map tasks for load function
You would only be able to set it for the script... which means it will
apply to all 8 jobs. However, my guess is that you don't need to control
the number of map tasks per machine.
On Sun, Nov 3, 2013 at 4:21 PM, John <[EMAIL PROTECTED]> wrote:

> Thanks for your answer! How can I set the mapred.tasktracker.map.tasks.
> maxiumum value only for this speficic job? For example the pig script is
> creating 8 jobs, and I only want to modify this value for the first job? I
> think there is no option in PigLatin to influence this value?
>
> kind regards
>
>
>
>
> 2013/11/4 Pradeep Gollakota <[EMAIL PROTECTED]>
>
> > I think you’re misunderstanding how HBaseStorage works. HBaseStorage uses
> > the HBaseInputFormat underneath the hood. The number of map tasks that
> are
> > spawned is dependent on the number of regions you have. The map tasks are
> > spawned such that the tasks are local to the regions they’re reading
> from.
> > You will typically not have to worry about problems such as this with
> > MapReduce. If you do have some performance concerns, you can set the
> > mapred.tasktracker.map.tasks.maxiumum setting in the job conf and it will
> > not affect all the other jobs.
> >
> >
> > On Sun, Nov 3, 2013 at 3:04 PM, John <[EMAIL PROTECTED]> wrote:
> >
> > > Hi,
> > >
> > > is it possible to limit the number of map slots used for the load
> > function?
> > > For example I have 5 nodes with 10 map slots (each node has 2 slots for
> > > every cpu) I want only one map task for every node. Is there a way to
> set
> > > this only for the load function? I know there is a option called
> > > "mapred.tasktracker.map.tasks.maximum",
> > > but this would influence every MapReduce job. I want to influence the
> > > number only for this specific job.
> > >
> > > My use case is the following: I'm using a modified version of the
> > > HBaseStorage function. I try to load for example 10 different rowkeys
> > with
> > > very big column sizes and join them afterwords. Since the columns all
> > have
> > > the same column family every row can be stored to a different server.
> For
> > > example rowkey rowkey 1-5 is stored on node1 and the other rowkeys on
> the
> > > other nodes. So If I create a Pig script to load the 10 keys and join
> > them
> > > afterwards this will end up in 1 MapReduce Job with 10 map task and
> some
> > > reduce tasks (depends on the parallel factor).  The problem is that
> there
> > > will be created 2 map tasks on node1, because there are 2 slots
> > available.
> > > This means every task is reading simultaneously a large number of
> columns
> > > from the local hard drive. Maybe I'm wrong, but this should be a
> > > performance issue?! It should be faster if to read each rowkey one
> after
> > > another!?
> > >
> > > kind regards
> > >
> >
>