Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> limit map tasks for load function


Copy link to this message
-
Re: limit map tasks for load function
okay. maybe you are right. thanks
2013/11/4 Pradeep Gollakota <[EMAIL PROTECTED]>

> You would only be able to set it for the script... which means it will
> apply to all 8 jobs. However, my guess is that you don't need to control
> the number of map tasks per machine.
>
>
> On Sun, Nov 3, 2013 at 4:21 PM, John <[EMAIL PROTECTED]> wrote:
>
> > Thanks for your answer! How can I set the mapred.tasktracker.map.tasks.
> > maxiumum value only for this speficic job? For example the pig script is
> > creating 8 jobs, and I only want to modify this value for the first job?
> I
> > think there is no option in PigLatin to influence this value?
> >
> > kind regards
> >
> >
> >
> >
> > 2013/11/4 Pradeep Gollakota <[EMAIL PROTECTED]>
> >
> > > I think you’re misunderstanding how HBaseStorage works. HBaseStorage
> uses
> > > the HBaseInputFormat underneath the hood. The number of map tasks that
> > are
> > > spawned is dependent on the number of regions you have. The map tasks
> are
> > > spawned such that the tasks are local to the regions they’re reading
> > from.
> > > You will typically not have to worry about problems such as this with
> > > MapReduce. If you do have some performance concerns, you can set the
> > > mapred.tasktracker.map.tasks.maxiumum setting in the job conf and it
> will
> > > not affect all the other jobs.
> > >
> > >
> > > On Sun, Nov 3, 2013 at 3:04 PM, John <[EMAIL PROTECTED]>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > is it possible to limit the number of map slots used for the load
> > > function?
> > > > For example I have 5 nodes with 10 map slots (each node has 2 slots
> for
> > > > every cpu) I want only one map task for every node. Is there a way to
> > set
> > > > this only for the load function? I know there is a option called
> > > > "mapred.tasktracker.map.tasks.maximum",
> > > > but this would influence every MapReduce job. I want to influence the
> > > > number only for this specific job.
> > > >
> > > > My use case is the following: I'm using a modified version of the
> > > > HBaseStorage function. I try to load for example 10 different rowkeys
> > > with
> > > > very big column sizes and join them afterwords. Since the columns all
> > > have
> > > > the same column family every row can be stored to a different server.
> > For
> > > > example rowkey rowkey 1-5 is stored on node1 and the other rowkeys on
> > the
> > > > other nodes. So If I create a Pig script to load the 10 keys and join
> > > them
> > > > afterwards this will end up in 1 MapReduce Job with 10 map task and
> > some
> > > > reduce tasks (depends on the parallel factor).  The problem is that
> > there
> > > > will be created 2 map tasks on node1, because there are 2 slots
> > > available.
> > > > This means every task is reading simultaneously a large number of
> > columns
> > > > from the local hard drive. Maybe I'm wrong, but this should be a
> > > > performance issue?! It should be faster if to read each rowkey one
> > after
> > > > another!?
> > > >
> > > > kind regards
> > > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB