Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Duplicate rows when using regular expression


Copy link to this message
-
Re: Duplicate rows when using regular expression
I disabled it and it worked. However, in order to see number of tasks that
go re-scheduled I went to map/reduce admin page->Completed Job->click one
job and tried to look inside map tasks, reducers but I couldn't see
anything related to speculative execution. Can you please let me know where
exactly I should look for it? I am trying to see number of tasks that were
re-scheduled or scheduled in parallel.

On Sat, Mar 24, 2012 at 8:19 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:

> JobTracker
>
> On Sat, Mar 24, 2012 at 8:15 PM, Mohit Anchlia <[EMAIL PROTECTED]
> >wrote:
>
> > Thanks!! Is there a place where I can see if task was re-scheduled?
> >
> > On Sat, Mar 24, 2012 at 6:28 PM, Prashant Kommireddi <
> [EMAIL PROTECTED]
> > >wrote:
> >
> > > Read about it here
> > http://developer.yahoo.com/hadoop/tutorial/module4.html
> > >
> > > A task could get rescheduled and run in parallel, this happens when
> > Hadoop
> > > "thinks" the task is slower relative to other tasks in the job. This is
> > to
> > > make sure the free slots in the cluster can be used to run tasks that
> > > (hadoop thinks) have slowed down due to issues with a particular node
> > > having issues (slow disk, bad memory ...).
> > >
> > > In your case, my guess is 1 of the parts is larger relative to others
> and
> > > the corresponding task is being rescheduled. It's a guess and I might
> be
> > > wrong, but worth trying.
> > >
> > > Based on the phase that is writing to DB, you can set
> > > "*mapred.map.tasks.speculative.execution"
> > > or "**mapred.reduce.tasks.speculative.execution"* to false.
> > >
> > > Thanks,
> > > Prashant
> > >
> > >
> > >
> > > On Sat, Mar 24, 2012 at 6:00 PM, Mohit Anchlia <[EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > No I don't have it turned off. Can you please explain what might be
> > > > happening because of that? And how to debug if that indeed is the
> > > problem.
> > > >
> > > >
> > > > On Sat, Mar 24, 2012 at 5:30 PM, Prashant Kommireddi <
> > > [EMAIL PROTECTED]
> > > > >wrote:
> > > >
> > > > > Do you have speculative execution turned off?
> > > > >
> > > > > On Sat, Mar 24, 2012 at 5:25 PM, Mohit Anchlia <
> > [EMAIL PROTECTED]
> > > > > >wrote:
> > > > >
> > > > > > I don't have my script handy but all I am doing is something
> like:
> > > > > >
> > > > > > A = LOAD $in using PigStorage("\t") as (col:chararray,
> > > col2:chararray);
> > > > > > STORE A INTO '{Table}' USING using
> > > > > > com.vertica.pig.VerticaStorer(‘localhost’,'verticadb502′,’5935′,
> > > > 'user');
> > > > > >
> > > > > >
> > > > > > When I run as pig -f script6.pig -p
> > in="/examples/2/part-m-0000[0-4]"
> > > > it
> > > > > > creates 2 rows
> > > > > >
> > > > > > but if I run them individually 4 times giving the actual file
> names
> > > > then
> > > > > it
> > > > > > doesn't have any duplicates
> > > > > > On Sat, Mar 24, 2012 at 1:36 PM, Bill Graham <
> [EMAIL PROTECTED]
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Can you provide the script you're running? That will help
> people
> > > > better
> > > > > > > understand what you're doing.
> > > > > > >
> > > > > > > On Saturday, March 24, 2012, Mohit Anchlia <
> > [EMAIL PROTECTED]
> > > >
> > > > > > wrote:
> > > > > > > > Could someone please help me understand or give some pointers
> > to
> > > > me,
> > > > > > > >
> > > > > > > > On Fri, Mar 23, 2012 at 4:57 PM, Mohit Anchlia <
> > > > > [EMAIL PROTECTED]
> > > > > > > >wrote:
> > > > > > > >
> > > > > > > >> I am running a script to load data in the database. When I
> use
> > > > > [0-4] I
> > > > > > > see
> > > > > > > >> 2 rows being created for every record that I process. But
> > when I
> > > > run
> > > > > > > them
> > > > > > > >> individually then it works. Could someone please help me
> > > > understand
> > > > > or
> > > > > > > >> troubleshoot this behaviour?
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]"
> > > --creates
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB