Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Duplicate rows when using regular expression


Copy link to this message
-
Re: Duplicate rows when using regular expression
Thanks!! Is there a place where I can see if task was re-scheduled?

On Sat, Mar 24, 2012 at 6:28 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:

> Read about it here http://developer.yahoo.com/hadoop/tutorial/module4.html
>
> A task could get rescheduled and run in parallel, this happens when Hadoop
> "thinks" the task is slower relative to other tasks in the job. This is to
> make sure the free slots in the cluster can be used to run tasks that
> (hadoop thinks) have slowed down due to issues with a particular node
> having issues (slow disk, bad memory ...).
>
> In your case, my guess is 1 of the parts is larger relative to others and
> the corresponding task is being rescheduled. It's a guess and I might be
> wrong, but worth trying.
>
> Based on the phase that is writing to DB, you can set
> "*mapred.map.tasks.speculative.execution"
> or "**mapred.reduce.tasks.speculative.execution"* to false.
>
> Thanks,
> Prashant
>
>
>
> On Sat, Mar 24, 2012 at 6:00 PM, Mohit Anchlia <[EMAIL PROTECTED]
> >wrote:
>
> > No I don't have it turned off. Can you please explain what might be
> > happening because of that? And how to debug if that indeed is the
> problem.
> >
> >
> > On Sat, Mar 24, 2012 at 5:30 PM, Prashant Kommireddi <
> [EMAIL PROTECTED]
> > >wrote:
> >
> > > Do you have speculative execution turned off?
> > >
> > > On Sat, Mar 24, 2012 at 5:25 PM, Mohit Anchlia <[EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > I don't have my script handy but all I am doing is something like:
> > > >
> > > > A = LOAD $in using PigStorage("\t") as (col:chararray,
> col2:chararray);
> > > > STORE A INTO '{Table}' USING using
> > > > com.vertica.pig.VerticaStorer(‘localhost’,'verticadb502′,’5935′,
> > 'user');
> > > >
> > > >
> > > > When I run as pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]"
> > it
> > > > creates 2 rows
> > > >
> > > > but if I run them individually 4 times giving the actual file names
> > then
> > > it
> > > > doesn't have any duplicates
> > > > On Sat, Mar 24, 2012 at 1:36 PM, Bill Graham <[EMAIL PROTECTED]>
> > > wrote:
> > > >
> > > > > Can you provide the script you're running? That will help people
> > better
> > > > > understand what you're doing.
> > > > >
> > > > > On Saturday, March 24, 2012, Mohit Anchlia <[EMAIL PROTECTED]
> >
> > > > wrote:
> > > > > > Could someone please help me understand or give some pointers to
> > me,
> > > > > >
> > > > > > On Fri, Mar 23, 2012 at 4:57 PM, Mohit Anchlia <
> > > [EMAIL PROTECTED]
> > > > > >wrote:
> > > > > >
> > > > > >> I am running a script to load data in the database. When I use
> > > [0-4] I
> > > > > see
> > > > > >> 2 rows being created for every record that I process. But when I
> > run
> > > > > them
> > > > > >> individually then it works. Could someone please help me
> > understand
> > > or
> > > > > >> troubleshoot this behaviour?
> > > > > >>
> > > > > >>
> > > > > >> pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]"
> --creates
> > 2
> > > > rows
> > > > > >>
> > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00000 --works
> > > > > >>
> > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00001 --works
> > > > > >>
> > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00002 --works
> > > > > >>
> > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00003 --works
> > > > > >>
> > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00004 --works
> > > > > >>
> > > > > >
> > > > >
> > > > > --
> > > > > *Note that I'm no longer using my Yahoo! email address. Please
> email
> > me
> > > > at
> > > > > [EMAIL PROTECTED] going forward.*
> > > > >
> > > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB