|
|
-
Duplicate rows when using regular expression
Mohit Anchlia 2012-03-23, 23:57
I am running a script to load data in the database. When I use [0-4] I see 2 rows being created for every record that I process. But when I run them individually then it works. Could someone please help me understand or troubleshoot this behaviour? pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" --creates 2 rows
pig -f script6.pig -p in="/examples/2/part-m-00000 --works
pig -f script6.pig -p in="/examples/2/part-m-00001 --works
pig -f script6.pig -p in="/examples/2/part-m-00002 --works
pig -f script6.pig -p in="/examples/2/part-m-00003 --works
pig -f script6.pig -p in="/examples/2/part-m-00004 --works
-
Re: Duplicate rows when using regular expression
Mohit Anchlia 2012-03-24, 18:48
Could someone please help me understand or give some pointers to me,
On Fri, Mar 23, 2012 at 4:57 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:
> I am running a script to load data in the database. When I use [0-4] I see > 2 rows being created for every record that I process. But when I run them > individually then it works. Could someone please help me understand or > troubleshoot this behaviour? > > > pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" --creates 2 rows > > pig -f script6.pig -p in="/examples/2/part-m-00000 --works > > pig -f script6.pig -p in="/examples/2/part-m-00001 --works > > pig -f script6.pig -p in="/examples/2/part-m-00002 --works > > pig -f script6.pig -p in="/examples/2/part-m-00003 --works > > pig -f script6.pig -p in="/examples/2/part-m-00004 --works >
-
Re: Duplicate rows when using regular expression
Bill Graham 2012-03-24, 20:36
Can you provide the script you're running? That will help people better understand what you're doing.
On Saturday, March 24, 2012, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > Could someone please help me understand or give some pointers to me, > > On Fri, Mar 23, 2012 at 4:57 PM, Mohit Anchlia <[EMAIL PROTECTED] >wrote: > >> I am running a script to load data in the database. When I use [0-4] I see >> 2 rows being created for every record that I process. But when I run them >> individually then it works. Could someone please help me understand or >> troubleshoot this behaviour? >> >> >> pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" --creates 2 rows >> >> pig -f script6.pig -p in="/examples/2/part-m-00000 --works >> >> pig -f script6.pig -p in="/examples/2/part-m-00001 --works >> >> pig -f script6.pig -p in="/examples/2/part-m-00002 --works >> >> pig -f script6.pig -p in="/examples/2/part-m-00003 --works >> >> pig -f script6.pig -p in="/examples/2/part-m-00004 --works >> >
-- *Note that I'm no longer using my Yahoo! email address. Please email me at [EMAIL PROTECTED] going forward.*
-
Re: Duplicate rows when using regular expression
Mohit Anchlia 2012-03-25, 00:25
I don't have my script handy but all I am doing is something like:
A = LOAD $in using PigStorage("\t") as (col:chararray, col2:chararray); STORE A INTO '{Table}' USING using com.vertica.pig.VerticaStorer(‘localhost’,'verticadb502���,’5935′, 'user'); When I run as pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" it creates 2 rows
but if I run them individually 4 times giving the actual file names then it doesn't have any duplicates On Sat, Mar 24, 2012 at 1:36 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
> Can you provide the script you're running? That will help people better > understand what you're doing. > > On Saturday, March 24, 2012, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > > Could someone please help me understand or give some pointers to me, > > > > On Fri, Mar 23, 2012 at 4:57 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > >> I am running a script to load data in the database. When I use [0-4] I > see > >> 2 rows being created for every record that I process. But when I run > them > >> individually then it works. Could someone please help me understand or > >> troubleshoot this behaviour? > >> > >> > >> pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" --creates 2 rows > >> > >> pig -f script6.pig -p in="/examples/2/part-m-00000 --works > >> > >> pig -f script6.pig -p in="/examples/2/part-m-00001 --works > >> > >> pig -f script6.pig -p in="/examples/2/part-m-00002 --works > >> > >> pig -f script6.pig -p in="/examples/2/part-m-00003 --works > >> > >> pig -f script6.pig -p in="/examples/2/part-m-00004 --works > >> > > > > -- > *Note that I'm no longer using my Yahoo! email address. Please email me at > [EMAIL PROTECTED] going forward.* >
-
Re: Duplicate rows when using regular expression
Prashant Kommireddi 2012-03-25, 00:30
Do you have speculative execution turned off?
On Sat, Mar 24, 2012 at 5:25 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:
> I don't have my script handy but all I am doing is something like: > > A = LOAD $in using PigStorage("\t") as (col:chararray, col2:chararray); > STORE A INTO '{Table}' USING using > com.vertica.pig.VerticaStorer(‘localhost’,'verticadb502′,’5935′, 'user'); > > > When I run as pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" it > creates 2 rows > > but if I run them individually 4 times giving the actual file names then it > doesn't have any duplicates > On Sat, Mar 24, 2012 at 1:36 PM, Bill Graham <[EMAIL PROTECTED]> wrote: > > > Can you provide the script you're running? That will help people better > > understand what you're doing. > > > > On Saturday, March 24, 2012, Mohit Anchlia <[EMAIL PROTECTED]> > wrote: > > > Could someone please help me understand or give some pointers to me, > > > > > > On Fri, Mar 23, 2012 at 4:57 PM, Mohit Anchlia <[EMAIL PROTECTED] > > >wrote: > > > > > >> I am running a script to load data in the database. When I use [0-4] I > > see > > >> 2 rows being created for every record that I process. But when I run > > them > > >> individually then it works. Could someone please help me understand or > > >> troubleshoot this behaviour? > > >> > > >> > > >> pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" --creates 2 > rows > > >> > > >> pig -f script6.pig -p in="/examples/2/part-m-00000 --works > > >> > > >> pig -f script6.pig -p in="/examples/2/part-m-00001 --works > > >> > > >> pig -f script6.pig -p in="/examples/2/part-m-00002 --works > > >> > > >> pig -f script6.pig -p in="/examples/2/part-m-00003 --works > > >> > > >> pig -f script6.pig -p in="/examples/2/part-m-00004 --works > > >> > > > > > > > -- > > *Note that I'm no longer using my Yahoo! email address. Please email me > at > > [EMAIL PROTECTED] going forward.* > > >
-
Re: Duplicate rows when using regular expression
Mohit Anchlia 2012-03-25, 01:00
No I don't have it turned off. Can you please explain what might be happening because of that? And how to debug if that indeed is the problem. On Sat, Mar 24, 2012 at 5:30 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:
> Do you have speculative execution turned off? > > On Sat, Mar 24, 2012 at 5:25 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > I don't have my script handy but all I am doing is something like: > > > > A = LOAD $in using PigStorage("\t") as (col:chararray, col2:chararray); > > STORE A INTO '{Table}' USING using > > com.vertica.pig.VerticaStorer(‘localhost’,'verticadb502′,’5935′, 'user'); > > > > > > When I run as pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" it > > creates 2 rows > > > > but if I run them individually 4 times giving the actual file names then > it > > doesn't have any duplicates > > On Sat, Mar 24, 2012 at 1:36 PM, Bill Graham <[EMAIL PROTECTED]> > wrote: > > > > > Can you provide the script you're running? That will help people better > > > understand what you're doing. > > > > > > On Saturday, March 24, 2012, Mohit Anchlia <[EMAIL PROTECTED]> > > wrote: > > > > Could someone please help me understand or give some pointers to me, > > > > > > > > On Fri, Mar 23, 2012 at 4:57 PM, Mohit Anchlia < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > >> I am running a script to load data in the database. When I use > [0-4] I > > > see > > > >> 2 rows being created for every record that I process. But when I run > > > them > > > >> individually then it works. Could someone please help me understand > or > > > >> troubleshoot this behaviour? > > > >> > > > >> > > > >> pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" --creates 2 > > rows > > > >> > > > >> pig -f script6.pig -p in="/examples/2/part-m-00000 --works > > > >> > > > >> pig -f script6.pig -p in="/examples/2/part-m-00001 --works > > > >> > > > >> pig -f script6.pig -p in="/examples/2/part-m-00002 --works > > > >> > > > >> pig -f script6.pig -p in="/examples/2/part-m-00003 --works > > > >> > > > >> pig -f script6.pig -p in="/examples/2/part-m-00004 --works > > > >> > > > > > > > > > > -- > > > *Note that I'm no longer using my Yahoo! email address. Please email me > > at > > > [EMAIL PROTECTED] going forward.* > > > > > >
-
Re: Duplicate rows when using regular expression
Prashant Kommireddi 2012-03-25, 01:28
Read about it here http://developer.yahoo.com/hadoop/tutorial/module4.htmlA task could get rescheduled and run in parallel, this happens when Hadoop "thinks" the task is slower relative to other tasks in the job. This is to make sure the free slots in the cluster can be used to run tasks that (hadoop thinks) have slowed down due to issues with a particular node having issues (slow disk, bad memory ...). In your case, my guess is 1 of the parts is larger relative to others and the corresponding task is being rescheduled. It's a guess and I might be wrong, but worth trying. Based on the phase that is writing to DB, you can set "*mapred.map.tasks.speculative.execution" or "**mapred.reduce.tasks.speculative.execution"* to false. Thanks, Prashant On Sat, Mar 24, 2012 at 6:00 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > No I don't have it turned off. Can you please explain what might be > happening because of that? And how to debug if that indeed is the problem. > > > On Sat, Mar 24, 2012 at 5:30 PM, Prashant Kommireddi <[EMAIL PROTECTED] > >wrote: > > > Do you have speculative execution turned off? > > > > On Sat, Mar 24, 2012 at 5:25 PM, Mohit Anchlia <[EMAIL PROTECTED] > > >wrote: > > > > > I don't have my script handy but all I am doing is something like: > > > > > > A = LOAD $in using PigStorage("\t") as (col:chararray, col2:chararray); > > > STORE A INTO '{Table}' USING using > > > com.vertica.pig.VerticaStorer(‘localhost’,'verticadb502′,’5935′, > 'user'); > > > > > > > > > When I run as pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" > it > > > creates 2 rows > > > > > > but if I run them individually 4 times giving the actual file names > then > > it > > > doesn't have any duplicates > > > On Sat, Mar 24, 2012 at 1:36 PM, Bill Graham <[EMAIL PROTECTED]> > > wrote: > > > > > > > Can you provide the script you're running? That will help people > better > > > > understand what you're doing. > > > > > > > > On Saturday, March 24, 2012, Mohit Anchlia <[EMAIL PROTECTED]> > > > wrote: > > > > > Could someone please help me understand or give some pointers to > me, > > > > > > > > > > On Fri, Mar 23, 2012 at 4:57 PM, Mohit Anchlia < > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > >> I am running a script to load data in the database. When I use > > [0-4] I > > > > see > > > > >> 2 rows being created for every record that I process. But when I > run > > > > them > > > > >> individually then it works. Could someone please help me > understand > > or > > > > >> troubleshoot this behaviour? > > > > >> > > > > >> > > > > >> pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" --creates > 2 > > > rows > > > > >> > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00000 --works > > > > >> > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00001 --works > > > > >> > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00002 --works > > > > >> > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00003 --works > > > > >> > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00004 --works > > > > >> > > > > > > > > > > > > > -- > > > > *Note that I'm no longer using my Yahoo! email address. Please email > me > > > at > > > > [EMAIL PROTECTED] going forward.* > > > > > > > > > >
-
Re: Duplicate rows when using regular expression
Mohit Anchlia 2012-03-25, 03:15
Thanks!! Is there a place where I can see if task was re-scheduled? On Sat, Mar 24, 2012 at 6:28 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > Read about it here http://developer.yahoo.com/hadoop/tutorial/module4.html> > A task could get rescheduled and run in parallel, this happens when Hadoop > "thinks" the task is slower relative to other tasks in the job. This is to > make sure the free slots in the cluster can be used to run tasks that > (hadoop thinks) have slowed down due to issues with a particular node > having issues (slow disk, bad memory ...). > > In your case, my guess is 1 of the parts is larger relative to others and > the corresponding task is being rescheduled. It's a guess and I might be > wrong, but worth trying. > > Based on the phase that is writing to DB, you can set > "*mapred.map.tasks.speculative.execution" > or "**mapred.reduce.tasks.speculative.execution"* to false. > > Thanks, > Prashant > > > > On Sat, Mar 24, 2012 at 6:00 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > No I don't have it turned off. Can you please explain what might be > > happening because of that? And how to debug if that indeed is the > problem. > > > > > > On Sat, Mar 24, 2012 at 5:30 PM, Prashant Kommireddi < > [EMAIL PROTECTED] > > >wrote: > > > > > Do you have speculative execution turned off? > > > > > > On Sat, Mar 24, 2012 at 5:25 PM, Mohit Anchlia <[EMAIL PROTECTED] > > > >wrote: > > > > > > > I don't have my script handy but all I am doing is something like: > > > > > > > > A = LOAD $in using PigStorage("\t") as (col:chararray, > col2:chararray); > > > > STORE A INTO '{Table}' USING using > > > > com.vertica.pig.VerticaStorer(‘localhost’,'verticadb502′,’5935′, > > 'user'); > > > > > > > > > > > > When I run as pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" > > it > > > > creates 2 rows > > > > > > > > but if I run them individually 4 times giving the actual file names > > then > > > it > > > > doesn't have any duplicates > > > > On Sat, Mar 24, 2012 at 1:36 PM, Bill Graham <[EMAIL PROTECTED]> > > > wrote: > > > > > > > > > Can you provide the script you're running? That will help people > > better > > > > > understand what you're doing. > > > > > > > > > > On Saturday, March 24, 2012, Mohit Anchlia <[EMAIL PROTECTED] > > > > > > wrote: > > > > > > Could someone please help me understand or give some pointers to > > me, > > > > > > > > > > > > On Fri, Mar 23, 2012 at 4:57 PM, Mohit Anchlia < > > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > >> I am running a script to load data in the database. When I use > > > [0-4] I > > > > > see > > > > > >> 2 rows being created for every record that I process. But when I > > run > > > > > them > > > > > >> individually then it works. Could someone please help me > > understand > > > or > > > > > >> troubleshoot this behaviour? > > > > > >> > > > > > >> > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" > --creates > > 2 > > > > rows > > > > > >> > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00000 --works > > > > > >> > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00001 --works > > > > > >> > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00002 --works > > > > > >> > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00003 --works > > > > > >> > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00004 --works > > > > > >> > > > > > > > > > > > > > > > > -- > > > > > *Note that I'm no longer using my Yahoo! email address. Please > email > > me > > > > at > > > > > [EMAIL PROTECTED] going forward.* > > > > > > > > > > > > > > >
-
Re: Duplicate rows when using regular expression
Prashant Kommireddi 2012-03-25, 03:19
JobTracker On Sat, Mar 24, 2012 at 8:15 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > Thanks!! Is there a place where I can see if task was re-scheduled? > > On Sat, Mar 24, 2012 at 6:28 PM, Prashant Kommireddi <[EMAIL PROTECTED] > >wrote: > > > Read about it here > http://developer.yahoo.com/hadoop/tutorial/module4.html> > > > A task could get rescheduled and run in parallel, this happens when > Hadoop > > "thinks" the task is slower relative to other tasks in the job. This is > to > > make sure the free slots in the cluster can be used to run tasks that > > (hadoop thinks) have slowed down due to issues with a particular node > > having issues (slow disk, bad memory ...). > > > > In your case, my guess is 1 of the parts is larger relative to others and > > the corresponding task is being rescheduled. It's a guess and I might be > > wrong, but worth trying. > > > > Based on the phase that is writing to DB, you can set > > "*mapred.map.tasks.speculative.execution" > > or "**mapred.reduce.tasks.speculative.execution"* to false. > > > > Thanks, > > Prashant > > > > > > > > On Sat, Mar 24, 2012 at 6:00 PM, Mohit Anchlia <[EMAIL PROTECTED] > > >wrote: > > > > > No I don't have it turned off. Can you please explain what might be > > > happening because of that? And how to debug if that indeed is the > > problem. > > > > > > > > > On Sat, Mar 24, 2012 at 5:30 PM, Prashant Kommireddi < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > Do you have speculative execution turned off? > > > > > > > > On Sat, Mar 24, 2012 at 5:25 PM, Mohit Anchlia < > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > I don't have my script handy but all I am doing is something like: > > > > > > > > > > A = LOAD $in using PigStorage("\t") as (col:chararray, > > col2:chararray); > > > > > STORE A INTO '{Table}' USING using > > > > > com.vertica.pig.VerticaStorer(‘localhost’,'verticadb502′,’5935′, > > > 'user'); > > > > > > > > > > > > > > > When I run as pig -f script6.pig -p > in="/examples/2/part-m-0000[0-4]" > > > it > > > > > creates 2 rows > > > > > > > > > > but if I run them individually 4 times giving the actual file names > > > then > > > > it > > > > > doesn't have any duplicates > > > > > On Sat, Mar 24, 2012 at 1:36 PM, Bill Graham <[EMAIL PROTECTED] > > > > > > wrote: > > > > > > > > > > > Can you provide the script you're running? That will help people > > > better > > > > > > understand what you're doing. > > > > > > > > > > > > On Saturday, March 24, 2012, Mohit Anchlia < > [EMAIL PROTECTED] > > > > > > > > wrote: > > > > > > > Could someone please help me understand or give some pointers > to > > > me, > > > > > > > > > > > > > > On Fri, Mar 23, 2012 at 4:57 PM, Mohit Anchlia < > > > > [EMAIL PROTECTED] > > > > > > >wrote: > > > > > > > > > > > > > >> I am running a script to load data in the database. When I use > > > > [0-4] I > > > > > > see > > > > > > >> 2 rows being created for every record that I process. But > when I > > > run > > > > > > them > > > > > > >> individually then it works. Could someone please help me > > > understand > > > > or > > > > > > >> troubleshoot this behaviour? > > > > > > >> > > > > > > >> > > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" > > --creates > > > 2 > > > > > rows > > > > > > >> > > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00000 --works > > > > > > >> > > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00001 --works > > > > > > >> > > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00002 --works > > > > > > >> > > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00003 --works > > > > > > >> > > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-00004 --works > > > > > > >> > > > > > > > > > > > > > > > > > > > -- > > > > > > *Note that I'm no longer using my Yahoo! email address. Please > > email > > > me > > > > > at > > > > > > [EMAIL PROTECTED] going forward.* > > > > > > > > > > > > > > > > > >
-
Re: Duplicate rows when using regular expression
Mohit Anchlia 2012-03-27, 21:42
I disabled it and it worked. However, in order to see number of tasks that go re-scheduled I went to map/reduce admin page->Completed Job->click one job and tried to look inside map tasks, reducers but I couldn't see anything related to speculative execution. Can you please let me know where exactly I should look for it? I am trying to see number of tasks that were re-scheduled or scheduled in parallel. On Sat, Mar 24, 2012 at 8:19 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > JobTracker > > On Sat, Mar 24, 2012 at 8:15 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > Thanks!! Is there a place where I can see if task was re-scheduled? > > > > On Sat, Mar 24, 2012 at 6:28 PM, Prashant Kommireddi < > [EMAIL PROTECTED] > > >wrote: > > > > > Read about it here > > http://developer.yahoo.com/hadoop/tutorial/module4.html> > > > > > A task could get rescheduled and run in parallel, this happens when > > Hadoop > > > "thinks" the task is slower relative to other tasks in the job. This is > > to > > > make sure the free slots in the cluster can be used to run tasks that > > > (hadoop thinks) have slowed down due to issues with a particular node > > > having issues (slow disk, bad memory ...). > > > > > > In your case, my guess is 1 of the parts is larger relative to others > and > > > the corresponding task is being rescheduled. It's a guess and I might > be > > > wrong, but worth trying. > > > > > > Based on the phase that is writing to DB, you can set > > > "*mapred.map.tasks.speculative.execution" > > > or "**mapred.reduce.tasks.speculative.execution"* to false. > > > > > > Thanks, > > > Prashant > > > > > > > > > > > > On Sat, Mar 24, 2012 at 6:00 PM, Mohit Anchlia <[EMAIL PROTECTED] > > > >wrote: > > > > > > > No I don't have it turned off. Can you please explain what might be > > > > happening because of that? And how to debug if that indeed is the > > > problem. > > > > > > > > > > > > On Sat, Mar 24, 2012 at 5:30 PM, Prashant Kommireddi < > > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > Do you have speculative execution turned off? > > > > > > > > > > On Sat, Mar 24, 2012 at 5:25 PM, Mohit Anchlia < > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > I don't have my script handy but all I am doing is something > like: > > > > > > > > > > > > A = LOAD $in using PigStorage("\t") as (col:chararray, > > > col2:chararray); > > > > > > STORE A INTO '{Table}' USING using > > > > > > com.vertica.pig.VerticaStorer(‘localhost’,'verticadb502′,’5935′, > > > > 'user'); > > > > > > > > > > > > > > > > > > When I run as pig -f script6.pig -p > > in="/examples/2/part-m-0000[0-4]" > > > > it > > > > > > creates 2 rows > > > > > > > > > > > > but if I run them individually 4 times giving the actual file > names > > > > then > > > > > it > > > > > > doesn't have any duplicates > > > > > > On Sat, Mar 24, 2012 at 1:36 PM, Bill Graham < > [EMAIL PROTECTED] > > > > > > > > wrote: > > > > > > > > > > > > > Can you provide the script you're running? That will help > people > > > > better > > > > > > > understand what you're doing. > > > > > > > > > > > > > > On Saturday, March 24, 2012, Mohit Anchlia < > > [EMAIL PROTECTED] > > > > > > > > > > wrote: > > > > > > > > Could someone please help me understand or give some pointers > > to > > > > me, > > > > > > > > > > > > > > > > On Fri, Mar 23, 2012 at 4:57 PM, Mohit Anchlia < > > > > > [EMAIL PROTECTED] > > > > > > > >wrote: > > > > > > > > > > > > > > > >> I am running a script to load data in the database. When I > use > > > > > [0-4] I > > > > > > > see > > > > > > > >> 2 rows being created for every record that I process. But > > when I > > > > run > > > > > > > them > > > > > > > >> individually then it works. Could someone please help me > > > > understand > > > > > or > > > > > > > >> troubleshoot this behaviour? > > > > > > > >> > > > > > > > >> > > > > > > > >> pig -f script6.pig -p in="/examples/2/part-m-0000[0-4]" > > > --creates
-
Re: Duplicate rows when using regular expression
Prashant Kommireddi 2012-03-28, 05:03
It usually shows up as KILLED tasks. Take a look under "FAILED/KILLED Task Attempts" and drill down to "task_". -Prashant On Tue, Mar 27, 2012 at 2:42 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > I disabled it and it worked. However, in order to see number of tasks that > go re-scheduled I went to map/reduce admin page->Completed Job->click one > job and tried to look inside map tasks, reducers but I couldn't see > anything related to speculative execution. Can you please let me know where > exactly I should look for it? I am trying to see number of tasks that were > re-scheduled or scheduled in parallel. > > On Sat, Mar 24, 2012 at 8:19 PM, Prashant Kommireddi <[EMAIL PROTECTED] > >wrote: > > > JobTracker > > > > On Sat, Mar 24, 2012 at 8:15 PM, Mohit Anchlia <[EMAIL PROTECTED] > > >wrote: > > > > > Thanks!! Is there a place where I can see if task was re-scheduled? > > > > > > On Sat, Mar 24, 2012 at 6:28 PM, Prashant Kommireddi < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > Read about it here > > > http://developer.yahoo.com/hadoop/tutorial/module4.html> > > > > > > > A task could get rescheduled and run in parallel, this happens when > > > Hadoop > > > > "thinks" the task is slower relative to other tasks in the job. This > is > > > to > > > > make sure the free slots in the cluster can be used to run tasks that > > > > (hadoop thinks) have slowed down due to issues with a particular node > > > > having issues (slow disk, bad memory ...). > > > > > > > > In your case, my guess is 1 of the parts is larger relative to others > > and > > > > the corresponding task is being rescheduled. It's a guess and I might > > be > > > > wrong, but worth trying. > > > > > > > > Based on the phase that is writing to DB, you can set > > > > "*mapred.map.tasks.speculative.execution" > > > > or "**mapred.reduce.tasks.speculative.execution"* to false. > > > > > > > > Thanks, > > > > Prashant > > > > > > > > > > > > > > > > On Sat, Mar 24, 2012 at 6:00 PM, Mohit Anchlia < > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > No I don't have it turned off. Can you please explain what might be > > > > > happening because of that? And how to debug if that indeed is the > > > > problem. > > > > > > > > > > > > > > > On Sat, Mar 24, 2012 at 5:30 PM, Prashant Kommireddi < > > > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > Do you have speculative execution turned off? > > > > > > > > > > > > On Sat, Mar 24, 2012 at 5:25 PM, Mohit Anchlia < > > > [EMAIL PROTECTED] > > > > > > >wrote: > > > > > > > > > > > > > I don't have my script handy but all I am doing is something > > like: > > > > > > > > > > > > > > A = LOAD $in using PigStorage("\t") as (col:chararray, > > > > col2:chararray); > > > > > > > STORE A INTO '{Table}' USING using > > > > > > > > com.vertica.pig.VerticaStorer(‘localhost’,'verticadb502′,’5935′, > > > > > 'user'); > > > > > > > > > > > > > > > > > > > > > When I run as pig -f script6.pig -p > > > in="/examples/2/part-m-0000[0-4]" > > > > > it > > > > > > > creates 2 rows > > > > > > > > > > > > > > but if I run them individually 4 times giving the actual file > > names > > > > > then > > > > > > it > > > > > > > doesn't have any duplicates > > > > > > > On Sat, Mar 24, 2012 at 1:36 PM, Bill Graham < > > [EMAIL PROTECTED] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Can you provide the script you're running? That will help > > people > > > > > better > > > > > > > > understand what you're doing. > > > > > > > > > > > > > > > > On Saturday, March 24, 2012, Mohit Anchlia < > > > [EMAIL PROTECTED] > > > > > > > > > > > > wrote: > > > > > > > > > Could someone please help me understand or give some > pointers > > > to > > > > > me, > > > > > > > > > > > > > > > > > > On Fri, Mar 23, 2012 at 4:57 PM, Mohit Anchlia < > > > > > > [EMAIL PROTECTED] > > > > > > > > >wrote: > > > > > > > > > > > > > > > > > >> I am running a script to load data in the database. When I > > use
|
|