Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Duplicate rows when using regular expression


Copy link to this message
-
Re: Duplicate rows when using regular expression
Prashant Kommireddi 2012-03-25, 03:19
JobTracker

On Sat, Mar 24, 2012 at 8:15 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:

> Thanks!! Is there a place where I can see if task was re-scheduled?
>
> On Sat, Mar 24, 2012 at 6:28 PM, Prashant Kommireddi <[EMAIL PROTECTED]
> >wrote:
>
> > Read about it here
> http://developer.yahoo.com/hadoop/tutorial/module4.html
> >
> > A task could get rescheduled and run in parallel, this happens when
> Hadoop
> > "thinks" the task is slower relative to other tasks in the job. This is
> to
> > make sure the free slots in the cluster can be used to run tasks that
> > (hadoop thinks) have slowed down due to issues with a particular node
> > having issues (slow disk, bad memory ...).
> >
> > In your case, my guess is 1 of the parts is larger relative to others and
> > the corresponding task is being rescheduled. It's a guess and I might be
> > wrong, but worth trying.
> >
> > Based on the phase that is writing to DB, you can set
> > "*mapred.map.tasks.speculative.execution"
> > or "**mapred.reduce.tasks.speculative.execution"* to false.
> >
> > Thanks,
> > Prashant
> >
> >
> >
> > On Sat, Mar 24, 2012 at 6:00 PM, Mohit Anchlia <[EMAIL PROTECTED]
> > >wrote:
> >
> > > No I don't have it turned off. Can you please explain what might be
> > > happening because of that? And how to debug if that indeed is the
> > problem.
> > >
> > >
> > > On Sat, Mar 24, 2012 at 5:30 PM, Prashant Kommireddi <
> > [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > Do you have speculative execution turned off?
> > > >
> > > > On Sat, Mar 24, 2012 at 5:25 PM, Mohit Anchlia <
> [EMAIL PROTECTED]
> > > > >wrote:
> > > >
> > > > > I don't have my script handy but all I am doing is something like:
> > > > >
> > > > > A = LOAD $in using PigStorage("\t") as (col:chararray,
> > col2:chararray);
> > > > > STORE A INTO '{Table}' USING using
> > > > > com.vertica.pig.VerticaStorer(‘localhost’,'verticadb502′,’5935′,
> > > 'user');
> > > > >
> > > > >
> > > > > When I run as pig -f script6.pig -p
> in="/examples/2/part-m-0000[0-4]"
> > > it
> > > > > creates 2 rows
> > > > >
> > > > > but if I run them individually 4 times giving the actual file names
> > > then
> > > > it
> > > > > doesn't have any duplicates
> > > > > On Sat, Mar 24, 2012 at 1:36 PM, Bill Graham <[EMAIL PROTECTED]
> >
> > > > wrote:
> > > > >
> > > > > > Can you provide the script you're running? That will help people
> > > better
> > > > > > understand what you're doing.
> > > > > >
> > > > > > On Saturday, March 24, 2012, Mohit Anchlia <
> [EMAIL PROTECTED]
> > >
> > > > > wrote:
> > > > > > > Could someone please help me understand or give some pointers
> to
> > > me,
> > > > > > >
> > > > > > > On Fri, Mar 23, 2012 at 4:57 PM, Mohit Anchlia <
> > > > [EMAIL PROTECTED]
> > > > > > >wrote:
> > > > > > >
> > > > > > >> I am running a script to load data in the database. When I use
> > > > [0-4] I
> > > > > > see
> > > > > > >> 2 rows being created for every record that I process. But
> when I
> > > run
> > > > > > them
> > > > > > >> individually then it works. Could someone please help me
> > > understand
> > > > or
> > > > > > >> troubleshoot this behaviour?
> > > > > > >>
> > > > > > >>
> > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]"
> > --creates
> > > 2
> > > > > rows
> > > > > > >>
> > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00000 --works
> > > > > > >>
> > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00001 --works
> > > > > > >>
> > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00002 --works
> > > > > > >>
> > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00003 --works
> > > > > > >>
> > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00004 --works
> > > > > > >>
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > *Note that I'm no longer using my Yahoo! email address. Please
> > email
> > > me
> > > > > at
> > > > > > [EMAIL PROTECTED] going forward.*
> > > > > >
> > > > >
> > > >
> > >