Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Duplicate rows when using regular expression


Copy link to this message
-
Re: Duplicate rows when using regular expression
Read about it here http://developer.yahoo.com/hadoop/tutorial/module4.html

A task could get rescheduled and run in parallel, this happens when Hadoop
"thinks" the task is slower relative to other tasks in the job. This is to
make sure the free slots in the cluster can be used to run tasks that
(hadoop thinks) have slowed down due to issues with a particular node
having issues (slow disk, bad memory ...).

In your case, my guess is 1 of the parts is larger relative to others and
the corresponding task is being rescheduled. It's a guess and I might be
wrong, but worth trying.

Based on the phase that is writing to DB, you can set
"*mapred.map.tasks.speculative.execution"
or "**mapred.reduce.tasks.speculative.execution"* to false.

Thanks,
Prashant

On Sat, Mar 24, 2012 at 6:00 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:

> No I don't have it turned off. Can you please explain what might be
> happening because of that? And how to debug if that indeed is the problem.
>
>
> On Sat, Mar 24, 2012 at 5:30 PM, Prashant Kommireddi <[EMAIL PROTECTED]
> >wrote:
>
> > Do you have speculative execution turned off?
> >
> > On Sat, Mar 24, 2012 at 5:25 PM, Mohit Anchlia <[EMAIL PROTECTED]
> > >wrote:
> >
> > > I don't have my script handy but all I am doing is something like:
> > >
> > > A = LOAD $in using PigStorage("\t") as (col:chararray, col2:chararray);
> > > STORE A INTO '{Table}' USING using
> > > com.vertica.pig.VerticaStorer(‘localhost’,'verticadb502′,’5935′,
> 'user');
> > >
> > >
> > > When I run as pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]"
> it
> > > creates 2 rows
> > >
> > > but if I run them individually 4 times giving the actual file names
> then
> > it
> > > doesn't have any duplicates
> > > On Sat, Mar 24, 2012 at 1:36 PM, Bill Graham <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > Can you provide the script you're running? That will help people
> better
> > > > understand what you're doing.
> > > >
> > > > On Saturday, March 24, 2012, Mohit Anchlia <[EMAIL PROTECTED]>
> > > wrote:
> > > > > Could someone please help me understand or give some pointers to
> me,
> > > > >
> > > > > On Fri, Mar 23, 2012 at 4:57 PM, Mohit Anchlia <
> > [EMAIL PROTECTED]
> > > > >wrote:
> > > > >
> > > > >> I am running a script to load data in the database. When I use
> > [0-4] I
> > > > see
> > > > >> 2 rows being created for every record that I process. But when I
> run
> > > > them
> > > > >> individually then it works. Could someone please help me
> understand
> > or
> > > > >> troubleshoot this behaviour?
> > > > >>
> > > > >>
> > > > >> pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" --creates
> 2
> > > rows
> > > > >>
> > > > >> pig -f script6.pig -p in="/examples/2/part-m-00000 --works
> > > > >>
> > > > >> pig -f script6.pig -p in="/examples/2/part-m-00001 --works
> > > > >>
> > > > >> pig -f script6.pig -p in="/examples/2/part-m-00002 --works
> > > > >>
> > > > >> pig -f script6.pig -p in="/examples/2/part-m-00003 --works
> > > > >>
> > > > >> pig -f script6.pig -p in="/examples/2/part-m-00004 --works
> > > > >>
> > > > >
> > > >
> > > > --
> > > > *Note that I'm no longer using my Yahoo! email address. Please email
> me
> > > at
> > > > [EMAIL PROTECTED] going forward.*
> > > >
> > >
> >
>