|
Tom Melendez
2012-01-18, 22:33
Steve Lewis
2012-01-18, 23:46
Alex Kozlov
2012-01-18, 23:51
Steve Lewis
2012-01-19, 00:00
Leonardo Urbina
2012-01-19, 00:01
Steve Lewis
2012-01-19, 00:21
Raj V
2012-01-19, 00:50
Steve Lewis
2012-01-19, 02:10
Michael Segel
2012-01-19, 03:08
Raj Vishwanthan
2012-01-19, 03:28
Michel Segel
2012-01-19, 12:08
|
-
Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 secTom Melendez 2012-01-18, 22:33
Sounds like mapred.task.timeout? The default is 10 minutes.
http://hadoop.apache.org/common/docs/current/mapred-default.html Thanks, Tom On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <[EMAIL PROTECTED]> wrote: > The map tasks fail timing out after 600 sec. > I am processing one 9 GB file with 16,000,000 records. Each record (think > is it as a line) generates hundreds of key value pairs. > The job is unusual in that the output of the mapper in terms of records or > bytes orders of magnitude larger than the input. > I have no idea what is slowing down the job except that the problem is in > the writes. > > If I change the job to merely bypass a fraction of the context.write > statements the job succeeds. > This is one map task that failed and one that succeeded - I cannot > understand how a write can take so long > or what else the mapper might be doing > > JOB FAILED WITH TIMEOUT > > *Parser*TotalProteins90,103NumberFragments10,933,089 > *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807 > *Map-Reduce Framework*Combine output records10,033,499Map input records > 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine input > records10,844,881Map output records10,933,089 > Same code but fewer writes > JOB SUCCEEDED > > *Parser*TotalProteins90,103NumberFragments206,658,758 > *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607 > FILE_BYTES_WRITTEN220,169,922 > *Map-Reduce Framework*Combine output records4,046,128Map input > records90,103Spilled > Records4,046,128Map output bytes662,354,413Combine input records4,098,609Map > output records2,066,588 > Any bright ideas > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com
-
Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 secSteve Lewis 2012-01-18, 23:46
I KNOW is is a task timeout - what I do NOT know is WHY merely cutting the
number of writes causes it to go away. It seems to imply that some context.write operation or something downstream from that is taking a huge amount of time and that is all hadoop internal code - not mine so my question is why should increasing the number and volume of wriotes cause a task to time out On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <[EMAIL PROTECTED]> wrote: > Sounds like mapred.task.timeout? The default is 10 minutes. > > http://hadoop.apache.org/common/docs/current/mapred-default.html > > Thanks, > > Tom > > On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <[EMAIL PROTECTED]> > wrote: > > The map tasks fail timing out after 600 sec. > > I am processing one 9 GB file with 16,000,000 records. Each record (think > > is it as a line) generates hundreds of key value pairs. > > The job is unusual in that the output of the mapper in terms of records > or > > bytes orders of magnitude larger than the input. > > I have no idea what is slowing down the job except that the problem is in > > the writes. > > > > If I change the job to merely bypass a fraction of the context.write > > statements the job succeeds. > > This is one map task that failed and one that succeeded - I cannot > > understand how a write can take so long > > or what else the mapper might be doing > > > > JOB FAILED WITH TIMEOUT > > > > *Parser*TotalProteins90,103NumberFragments10,933,089 > > > *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807 > > *Map-Reduce Framework*Combine output records10,033,499Map input records > > 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine input > > records10,844,881Map output records10,933,089 > > Same code but fewer writes > > JOB SUCCEEDED > > > > *Parser*TotalProteins90,103NumberFragments206,658,758 > > *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607 > > FILE_BYTES_WRITTEN220,169,922 > > *Map-Reduce Framework*Combine output records4,046,128Map input > > records90,103Spilled > > Records4,046,128Map output bytes662,354,413Combine input > records4,098,609Map > > output records2,066,588 > > Any bright ideas > > -- > > Steven M. Lewis PhD > > 4221 105th Ave NE > > Kirkland, WA 98033 > > 206-384-1340 (cell) > > Skype lordjoe_com > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
-
Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 secAlex Kozlov 2012-01-18, 23:51
Does it always fail at the same place? Does the task log shows something
unusual? On Wed, Jan 18, 2012 at 3:46 PM, Steve Lewis <[EMAIL PROTECTED]> wrote: > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting the > number of writes causes it to go away. It seems to imply that some > context.write operation or something downstream from that is taking a huge > amount of time and that is all hadoop internal code - not mine so my > question is why should increasing the number and volume of wriotes cause a > task to time out > > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <[EMAIL PROTECTED]> wrote: > > > Sounds like mapred.task.timeout? The default is 10 minutes. > > > > http://hadoop.apache.org/common/docs/current/mapred-default.html > > > > Thanks, > > > > Tom > > > > On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <[EMAIL PROTECTED]> > > wrote: > > > The map tasks fail timing out after 600 sec. > > > I am processing one 9 GB file with 16,000,000 records. Each record > (think > > > is it as a line) generates hundreds of key value pairs. > > > The job is unusual in that the output of the mapper in terms of records > > or > > > bytes orders of magnitude larger than the input. > > > I have no idea what is slowing down the job except that the problem is > in > > > the writes. > > > > > > If I change the job to merely bypass a fraction of the context.write > > > statements the job succeeds. > > > This is one map task that failed and one that succeeded - I cannot > > > understand how a write can take so long > > > or what else the mapper might be doing > > > > > > JOB FAILED WITH TIMEOUT > > > > > > *Parser*TotalProteins90,103NumberFragments10,933,089 > > > > > > *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807 > > > *Map-Reduce Framework*Combine output records10,033,499Map input records > > > 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine > input > > > records10,844,881Map output records10,933,089 > > > Same code but fewer writes > > > JOB SUCCEEDED > > > > > > *Parser*TotalProteins90,103NumberFragments206,658,758 > > > *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607 > > > FILE_BYTES_WRITTEN220,169,922 > > > *Map-Reduce Framework*Combine output records4,046,128Map input > > > records90,103Spilled > > > Records4,046,128Map output bytes662,354,413Combine input > > records4,098,609Map > > > output records2,066,588 > > > Any bright ideas > > > -- > > > Steven M. Lewis PhD > > > 4221 105th Ave NE > > > Kirkland, WA 98033 > > > 206-384-1340 (cell) > > > Skype lordjoe_com > > > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com >
-
Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 secSteve Lewis 2012-01-19, 00:00
It always fails with a task timeout and that error gives me very little
indication of where the error occurs. The one piece of data I have is that if I only call context.write 1 in 100 times it does not time out suggesting that it is not MY code that is timing out. I could try to time the write statements and see if they get slow although those might to something slow in another thread?? Or it might be in the internal hadoop data handling code. On Wed, Jan 18, 2012 at 3:51 PM, Alex Kozlov <[EMAIL PROTECTED]> wrote: > Does it always fail at the same place? Does the task log shows something > unusual? > > On Wed, Jan 18, 2012 at 3:46 PM, Steve Lewis <[EMAIL PROTECTED]> > wrote: > > > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting > the > > number of writes causes it to go away. It seems to imply that some > > context.write operation or something downstream from that is taking a > huge > > amount of time and that is all hadoop internal code - not mine so my > > question is why should increasing the number and volume of wriotes cause > a > > task to time out > > > > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <[EMAIL PROTECTED]> wrote: > > > > > Sounds like mapred.task.timeout? The default is 10 minutes. > > > > > > http://hadoop.apache.org/common/docs/current/mapred-default.html > > > > > > Thanks, > > > > > > Tom > > > > > > On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <[EMAIL PROTECTED]> > > > wrote: > > > > The map tasks fail timing out after 600 sec. > > > > I am processing one 9 GB file with 16,000,000 records. Each record > > (think > > > > is it as a line) generates hundreds of key value pairs. > > > > The job is unusual in that the output of the mapper in terms of > records > > > or > > > > bytes orders of magnitude larger than the input. > > > > I have no idea what is slowing down the job except that the problem > is > > in > > > > the writes. > > > > > > > > If I change the job to merely bypass a fraction of the context.write > > > > statements the job succeeds. > > > > This is one map task that failed and one that succeeded - I cannot > > > > understand how a write can take so long > > > > or what else the mapper might be doing > > > > > > > > JOB FAILED WITH TIMEOUT > > > > > > > > *Parser*TotalProteins90,103NumberFragments10,933,089 > > > > > > > > > > *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807 > > > > *Map-Reduce Framework*Combine output records10,033,499Map input > records > > > > 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine > > input > > > > records10,844,881Map output records10,933,089 > > > > Same code but fewer writes > > > > JOB SUCCEEDED > > > > > > > > *Parser*TotalProteins90,103NumberFragments206,658,758 > > > > > *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607 > > > > FILE_BYTES_WRITTEN220,169,922 > > > > *Map-Reduce Framework*Combine output records4,046,128Map input > > > > records90,103Spilled > > > > Records4,046,128Map output bytes662,354,413Combine input > > > records4,098,609Map > > > > output records2,066,588 > > > > Any bright ideas > > > > -- > > > > Steven M. Lewis PhD > > > > 4221 105th Ave NE > > > > Kirkland, WA 98033 > > > > 206-384-1340 (cell) > > > > Skype lordjoe_com > > > > > > > > > > > -- > > Steven M. Lewis PhD > > 4221 105th Ave NE > > Kirkland, WA 98033 > > 206-384-1340 (cell) > > Skype lordjoe_com > > > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
-
Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 secLeonardo Urbina 2012-01-19, 00:01
Perhaps you are not reporting progress throughout your task. If you
happen to run a job large enough job you hit the the default timeout mapred.task.timeout (that defaults to 10 min). Perhaps you should consider reporting progress in your mapper/reducer by calling progress() on the Reporter object. Check tip 7 of this link: http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/ Hope that helps, -Leo Sent from my phone On Jan 18, 2012, at 6:46 PM, Steve Lewis <[EMAIL PROTECTED]> wrote: > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting the > number of writes causes it to go away. It seems to imply that some > context.write operation or something downstream from that is taking a huge > amount of time and that is all hadoop internal code - not mine so my > question is why should increasing the number and volume of wriotes cause a > task to time out > > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <[EMAIL PROTECTED]> wrote: > >> Sounds like mapred.task.timeout? The default is 10 minutes. >> >> http://hadoop.apache.org/common/docs/current/mapred-default.html >> >> Thanks, >> >> Tom >> >> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <[EMAIL PROTECTED]> >> wrote: >>> The map tasks fail timing out after 600 sec. >>> I am processing one 9 GB file with 16,000,000 records. Each record (think >>> is it as a line) generates hundreds of key value pairs. >>> The job is unusual in that the output of the mapper in terms of records >> or >>> bytes orders of magnitude larger than the input. >>> I have no idea what is slowing down the job except that the problem is in >>> the writes. >>> >>> If I change the job to merely bypass a fraction of the context.write >>> statements the job succeeds. >>> This is one map task that failed and one that succeeded - I cannot >>> understand how a write can take so long >>> or what else the mapper might be doing >>> >>> JOB FAILED WITH TIMEOUT >>> >>> *Parser*TotalProteins90,103NumberFragments10,933,089 >>> >> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807 >>> *Map-Reduce Framework*Combine output records10,033,499Map input records >>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine input >>> records10,844,881Map output records10,933,089 >>> Same code but fewer writes >>> JOB SUCCEEDED >>> >>> *Parser*TotalProteins90,103NumberFragments206,658,758 >>> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607 >>> FILE_BYTES_WRITTEN220,169,922 >>> *Map-Reduce Framework*Combine output records4,046,128Map input >>> records90,103Spilled >>> Records4,046,128Map output bytes662,354,413Combine input >> records4,098,609Map >>> output records2,066,588 >>> Any bright ideas >>> -- >>> Steven M. Lewis PhD >>> 4221 105th Ave NE >>> Kirkland, WA 98033 >>> 206-384-1340 (cell) >>> Skype lordjoe_com >> > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com
-
Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 secSteve Lewis 2012-01-19, 00:21
1) I do a lot of progress reporting
2) Why would the job succeed when the only change in the code is if(NumberWrites++ % 100 == 0) context.write(key,value); comment out the test allowing full writes and the job fails Since every write is a report I assume that something in the write code or other hadoop code for dealing with output if failing. I do increment a counter for every write or in the case of the above code potential write What I am seeing is that where ever the timeout occurs it is not in a place where I am capable of inserting more reporting On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <[EMAIL PROTECTED]> wrote: > Perhaps you are not reporting progress throughout your task. If you > happen to run a job large enough job you hit the the default timeout > mapred.task.timeout (that defaults to 10 min). Perhaps you should > consider reporting progress in your mapper/reducer by calling > progress() on the Reporter object. Check tip 7 of this link: > > http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/ > > Hope that helps, > -Leo > > Sent from my phone > > On Jan 18, 2012, at 6:46 PM, Steve Lewis <[EMAIL PROTECTED]> wrote: > > > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting > the > > number of writes causes it to go away. It seems to imply that some > > context.write operation or something downstream from that is taking a > huge > > amount of time and that is all hadoop internal code - not mine so my > > question is why should increasing the number and volume of wriotes cause > a > > task to time out > > > > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <[EMAIL PROTECTED]> wrote: > > > >> Sounds like mapred.task.timeout? The default is 10 minutes. > >> > >> http://hadoop.apache.org/common/docs/current/mapred-default.html > >> > >> Thanks, > >> > >> Tom > >> > >> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <[EMAIL PROTECTED]> > >> wrote: > >>> The map tasks fail timing out after 600 sec. > >>> I am processing one 9 GB file with 16,000,000 records. Each record > (think > >>> is it as a line) generates hundreds of key value pairs. > >>> The job is unusual in that the output of the mapper in terms of records > >> or > >>> bytes orders of magnitude larger than the input. > >>> I have no idea what is slowing down the job except that the problem is > in > >>> the writes. > >>> > >>> If I change the job to merely bypass a fraction of the context.write > >>> statements the job succeeds. > >>> This is one map task that failed and one that succeeded - I cannot > >>> understand how a write can take so long > >>> or what else the mapper might be doing > >>> > >>> JOB FAILED WITH TIMEOUT > >>> > >>> *Parser*TotalProteins90,103NumberFragments10,933,089 > >>> > >> > *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807 > >>> *Map-Reduce Framework*Combine output records10,033,499Map input records > >>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine > input > >>> records10,844,881Map output records10,933,089 > >>> Same code but fewer writes > >>> JOB SUCCEEDED > >>> > >>> *Parser*TotalProteins90,103NumberFragments206,658,758 > >>> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607 > >>> FILE_BYTES_WRITTEN220,169,922 > >>> *Map-Reduce Framework*Combine output records4,046,128Map input > >>> records90,103Spilled > >>> Records4,046,128Map output bytes662,354,413Combine input > >> records4,098,609Map > >>> output records2,066,588 > >>> Any bright ideas > >>> -- > >>> Steven M. Lewis PhD > >>> 4221 105th Ave NE > >>> Kirkland, WA 98033 > >>> 206-384-1340 (cell) > >>> Skype lordjoe_com > >> > > > > > > > > -- > > Steven M. Lewis PhD > > 4221 105th Ave NE > > Kirkland, WA 98033 > > 206-384-1340 (cell) > > Skype lordjoe_com > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
-
Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 secRaj V 2012-01-19, 00:50
Steve
Does the timeout happen for all the map jobs? Are you using some kind of shared storage for map outputs? Any problems with the physical disks? If the shuffle phase has started could the disks be I/O waiting between the read and write? Raj >________________________________ > From: Steve Lewis <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Wednesday, January 18, 2012 4:21 PM >Subject: Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec > >1) I do a lot of progress reporting >2) Why would the job succeed when the only change in the code is > if(NumberWrites++ % 100 == 0) > context.write(key,value); >comment out the test allowing full writes and the job fails >Since every write is a report I assume that something in the write code or >other hadoop code for dealing with output if failing. I do increment a >counter for every write or in the case of the above code potential write >What I am seeing is that where ever the timeout occurs it is not in a place >where I am capable of inserting more reporting > > > >On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <[EMAIL PROTECTED]> wrote: > >> Perhaps you are not reporting progress throughout your task. If you >> happen to run a job large enough job you hit the the default timeout >> mapred.task.timeout (that defaults to 10 min). Perhaps you should >> consider reporting progress in your mapper/reducer by calling >> progress() on the Reporter object. Check tip 7 of this link: >> >> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/ >> >> Hope that helps, >> -Leo >> >> Sent from my phone >> >> On Jan 18, 2012, at 6:46 PM, Steve Lewis <[EMAIL PROTECTED]> wrote: >> >> > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting >> the >> > number of writes causes it to go away. It seems to imply that some >> > context.write operation or something downstream from that is taking a >> huge >> > amount of time and that is all hadoop internal code - not mine so my >> > question is why should increasing the number and volume of wriotes cause >> a >> > task to time out >> > >> > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <[EMAIL PROTECTED]> wrote: >> > >> >> Sounds like mapred.task.timeout? The default is 10 minutes. >> >> >> >> http://hadoop.apache.org/common/docs/current/mapred-default.html >> >> >> >> Thanks, >> >> >> >> Tom >> >> >> >> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <[EMAIL PROTECTED]> >> >> wrote: >> >>> The map tasks fail timing out after 600 sec. >> >>> I am processing one 9 GB file with 16,000,000 records. Each record >> (think >> >>> is it as a line) generates hundreds of key value pairs. >> >>> The job is unusual in that the output of the mapper in terms of records >> >> or >> >>> bytes orders of magnitude larger than the input. >> >>> I have no idea what is slowing down the job except that the problem is >> in >> >>> the writes. >> >>> >> >>> If I change the job to merely bypass a fraction of the context.write >> >>> statements the job succeeds. >> >>> This is one map task that failed and one that succeeded - I cannot >> >>> understand how a write can take so long >> >>> or what else the mapper might be doing >> >>> >> >>> JOB FAILED WITH TIMEOUT >> >>> >> >>> *Parser*TotalProteins90,103NumberFragments10,933,089 >> >>> >> >> >> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807 >> >>> *Map-Reduce Framework*Combine output records10,033,499Map input records >> >>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine >> input >> >>> records10,844,881Map output records10,933,089 >> >>> Same code but fewer writes >> >>> JOB SUCCEEDED >> >>> >> >>> *Parser*TotalProteins90,103NumberFragments206,658,758 >> >>> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607 >> >>> FILE_BYTES_WRITTEN220,169,922 >> >>> *Map-Reduce Framework*Combine output records4,046,128Map input >> >>> records90,103Spilled
-
Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 secSteve Lewis 2012-01-19, 02:10
In my hands the problem occurs in all map jobs - an associate with a
different cluster - mine has 8 nodes - his 40 reports 80% of map tasks fail with a few succeeding - I suspect some kind of an I/O waiot but fail to see how it gets to 600sec On Wed, Jan 18, 2012 at 4:50 PM, Raj V <[EMAIL PROTECTED]> wrote: > Steve > > Does the timeout happen for all the map jobs? Are you using some kind of > shared storage for map outputs? Any problems with the physical disks? If > the shuffle phase has started could the disks be I/O waiting between the > read and write? > > Raj > > > > >________________________________ > > From: Steve Lewis <[EMAIL PROTECTED]> > >To: [EMAIL PROTECTED] > >Sent: Wednesday, January 18, 2012 4:21 PM > >Subject: Re: I am trying to run a large job and it is consistently > failing with timeout - nothing happens for 600 sec > > > >1) I do a lot of progress reporting > >2) Why would the job succeed when the only change in the code is > > if(NumberWrites++ % 100 == 0) > > context.write(key,value); > >comment out the test allowing full writes and the job fails > >Since every write is a report I assume that something in the write code or > >other hadoop code for dealing with output if failing. I do increment a > >counter for every write or in the case of the above code potential write > >What I am seeing is that where ever the timeout occurs it is not in a > place > >where I am capable of inserting more reporting > > > > > > > >On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <[EMAIL PROTECTED]> wrote: > > > >> Perhaps you are not reporting progress throughout your task. If you > >> happen to run a job large enough job you hit the the default timeout > >> mapred.task.timeout (that defaults to 10 min). Perhaps you should > >> consider reporting progress in your mapper/reducer by calling > >> progress() on the Reporter object. Check tip 7 of this link: > >> > >> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/ > >> > >> Hope that helps, > >> -Leo > >> > >> Sent from my phone > >> > >> On Jan 18, 2012, at 6:46 PM, Steve Lewis <[EMAIL PROTECTED]> wrote: > >> > >> > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting > >> the > >> > number of writes causes it to go away. It seems to imply that some > >> > context.write operation or something downstream from that is taking a > >> huge > >> > amount of time and that is all hadoop internal code - not mine so my > >> > question is why should increasing the number and volume of wriotes > cause > >> a > >> > task to time out > >> > > >> > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <[EMAIL PROTECTED]> > wrote: > >> > > >> >> Sounds like mapred.task.timeout? The default is 10 minutes. > >> >> > >> >> http://hadoop.apache.org/common/docs/current/mapred-default.html > >> >> > >> >> Thanks, > >> >> > >> >> Tom > >> >> > >> >> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <[EMAIL PROTECTED]> > >> >> wrote: > >> >>> The map tasks fail timing out after 600 sec. > >> >>> I am processing one 9 GB file with 16,000,000 records. Each record > >> (think > >> >>> is it as a line) generates hundreds of key value pairs. > >> >>> The job is unusual in that the output of the mapper in terms of > records > >> >> or > >> >>> bytes orders of magnitude larger than the input. > >> >>> I have no idea what is slowing down the job except that the problem > is > >> in > >> >>> the writes. > >> >>> > >> >>> If I change the job to merely bypass a fraction of the context.write > >> >>> statements the job succeeds. > >> >>> This is one map task that failed and one that succeeded - I cannot > >> >>> understand how a write can take so long > >> >>> or what else the mapper might be doing > >> >>> > >> >>> JOB FAILED WITH TIMEOUT > >> >>> > >> >>> *Parser*TotalProteins90,103NumberFragments10,933,089 > >> >>> > >> >> > >> > *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807 > >> >>> *Map-Reduce Framework*Combine output records10,033,499Map input Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
-
Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 secMichael Segel 2012-01-19, 03:08
But Steve, it is your code... :-)
Here is a simple test... Set your code up where the run fails... Add a simple timer to see how long you spend in the Mapper.map() method. only print out the time if its greater than lets say 500 seconds... The other thing is to update a dynamic counter in Mapper.map(). This would force a status update to be sent to the JT. Also you dont give a lot of detail... Are you writing out to an HBase table??? HTH -Mike On Jan 18, 2012, at 6:21 PM, Steve Lewis wrote: > 1) I do a lot of progress reporting > 2) Why would the job succeed when the only change in the code is > if(NumberWrites++ % 100 == 0) > context.write(key,value); > comment out the test allowing full writes and the job fails > Since every write is a report I assume that something in the write code or > other hadoop code for dealing with output if failing. I do increment a > counter for every write or in the case of the above code potential write > What I am seeing is that where ever the timeout occurs it is not in a place > where I am capable of inserting more reporting > > > > On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <[EMAIL PROTECTED]> wrote: > >> Perhaps you are not reporting progress throughout your task. If you >> happen to run a job large enough job you hit the the default timeout >> mapred.task.timeout (that defaults to 10 min). Perhaps you should >> consider reporting progress in your mapper/reducer by calling >> progress() on the Reporter object. Check tip 7 of this link: >> >> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/ >> >> Hope that helps, >> -Leo >> >> Sent from my phone >> >> On Jan 18, 2012, at 6:46 PM, Steve Lewis <[EMAIL PROTECTED]> wrote: >> >>> I KNOW is is a task timeout - what I do NOT know is WHY merely cutting >> the >>> number of writes causes it to go away. It seems to imply that some >>> context.write operation or something downstream from that is taking a >> huge >>> amount of time and that is all hadoop internal code - not mine so my >>> question is why should increasing the number and volume of wriotes cause >> a >>> task to time out >>> >>> On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <[EMAIL PROTECTED]> wrote: >>> >>>> Sounds like mapred.task.timeout? The default is 10 minutes. >>>> >>>> http://hadoop.apache.org/common/docs/current/mapred-default.html >>>> >>>> Thanks, >>>> >>>> Tom >>>> >>>> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <[EMAIL PROTECTED]> >>>> wrote: >>>>> The map tasks fail timing out after 600 sec. >>>>> I am processing one 9 GB file with 16,000,000 records. Each record >> (think >>>>> is it as a line) generates hundreds of key value pairs. >>>>> The job is unusual in that the output of the mapper in terms of records >>>> or >>>>> bytes orders of magnitude larger than the input. >>>>> I have no idea what is slowing down the job except that the problem is >> in >>>>> the writes. >>>>> >>>>> If I change the job to merely bypass a fraction of the context.write >>>>> statements the job succeeds. >>>>> This is one map task that failed and one that succeeded - I cannot >>>>> understand how a write can take so long >>>>> or what else the mapper might be doing >>>>> >>>>> JOB FAILED WITH TIMEOUT >>>>> >>>>> *Parser*TotalProteins90,103NumberFragments10,933,089 >>>>> >>>> >> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807 >>>>> *Map-Reduce Framework*Combine output records10,033,499Map input records >>>>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine >> input >>>>> records10,844,881Map output records10,933,089 >>>>> Same code but fewer writes >>>>> JOB SUCCEEDED >>>>> >>>>> *Parser*TotalProteins90,103NumberFragments206,658,758 >>>>> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607 >>>>> FILE_BYTES_WRITTEN220,169,922 >>>>> *Map-Reduce Framework*Combine output records4,046,128Map input >>>>> records90,103Spilled >>>>> Records4,046,128Map output bytes662,354,413Combine input
-
Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 secRaj Vishwanthan 2012-01-19, 03:28
You can try the following
- make it into a map only job (for debug purposes) - start your shuffle phase after all the maps are complete( there is a parameter for this) -characterize your disks for performance Raj Sent from Samsung Mobile Steve Lewis <[EMAIL PROTECTED]> wrote: In my hands the problem occurs in all map jobs - an associate with a different cluster - mine has 8 nodes - his 40 reports 80% of map tasks fail with a few succeeding - I suspect some kind of an I/O waiot but fail to see how it gets to 600sec On Wed, Jan 18, 2012 at 4:50 PM, Raj V <[EMAIL PROTECTED]> wrote: Steve Does the timeout happen for all the map jobs? Are you using some kind of shared storage for map outputs? Any problems with the physical disks? If the shuffle phase has started could the disks be I/O waiting between the read and write? Raj >________________________________ > From: Steve Lewis <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Wednesday, January 18, 2012 4:21 PM >Subject: Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec > >1) I do a lot of progress reporting >2) Why would the job succeed when the only change in the code is > if(NumberWrites++ % 100 == 0) > context.write(key,value); >comment out the test allowing full writes and the job fails >Since every write is a report I assume that something in the write code or >other hadoop code for dealing with output if failing. I do increment a >counter for every write or in the case of the above code potential write >What I am seeing is that where ever the timeout occurs it is not in a place >where I am capable of inserting more reporting > > > >On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <[EMAIL PROTECTED]> wrote: > >> Perhaps you are not reporting progress throughout your task. If you >> happen to run a job large enough job you hit the the default timeout >> mapred.task.timeout (that defaults to 10 min). Perhaps you should >> consider reporting progress in your mapper/reducer by calling >> progress() on the Reporter object. Check tip 7 of this link: >> >> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/ >> >> Hope that helps, >> -Leo >> >> Sent from my phone >> >> On Jan 18, 2012, at 6:46 PM, Steve Lewis <[EMAIL PROTECTED]> wrote: >> >> > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting >> the >> > number of writes causes it to go away. It seems to imply that some >> > context.write operation or something downstream from that is taking a >> huge >> > amount of time and that is all hadoop internal code - not mine so my >> > question is why should increasing the number and volume of wriotes cause >> a >> > task to time out >> > >> > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <[EMAIL PROTECTED]> wrote: >> > >> >> Sounds like mapred.task.timeout? The default is 10 minutes. >> >> >> >> http://hadoop.apache.org/common/docs/current/mapred-default.html >> >> >> >> Thanks, >> >> >> >> Tom >> >> >> >> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <[EMAIL PROTECTED]> >> >> wrote: >> >>> The map tasks fail timing out after 600 sec. >> >>> I am processing one 9 GB file with 16,000,000 records. Each record >> (think >> >>> is it as a line) generates hundreds of key value pairs. >> >>> The job is unusual in that the output of the mapper in terms of records >> >> or >> >>> bytes orders of magnitude larger than the input. >> >>> I have no idea what is slowing down the job except that the problem is >> in >> >>> the writes. >> >>> >> >>> If I change the job to merely bypass a fraction of the context.write >> >>> statements the job succeeds. >> >>> This is one map task that failed and one that succeeded - I cannot >> >>> understand how a write can take so long >> >>> or what else the mapper might be doing >> >>> >> >>> JOB FAILED WITH TIMEOUT >> >>> >> >>> *Parser*TotalProteins90,103NumberFragments10,933,089 >> >>> >> >> >> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807 Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com TODAY(Beta) • Powered by Yahoo! TV chefs' feud heats up over diabetes Anthony Bourdain takes a jab at Paula Deen after she reveals her diagnosis. Privacy Policy
-
Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 secMichel Segel 2012-01-19, 12:08
Timeout errors don't usually occur outside of the Mapper.map() 'phase'.
When we've seen this error it has to deal w M/R going against HBase.... Since the OP sees the error when he does a bulk 'write', but it stops when he reduces the number of writes ... That kind of suggests where the problem occurs ... Unless of course I missed something... Sent from a remote device. Please excuse any typos... Mike Segel On Jan 18, 2012, at 9:28 PM, Raj Vishwanthan <[EMAIL PROTECTED]> wrote: > You can try the following > - make it into a map only job (for debug purposes) > - start your shuffle phase after all the maps are complete( there is a parameter for this) > -characterize your disks for performance > > Raj > > > Sent from Samsung Mobile > > Steve Lewis <[EMAIL PROTECTED]> wrote: > > In my hands the problem occurs in all map jobs - an associate with a different cluster - mine has 8 nodes - his 40 reports 80% of map tasks fail with a few succeeding - > I suspect some kind of an I/O waiot but fail to see how it gets to 600sec > > On Wed, Jan 18, 2012 at 4:50 PM, Raj V <[EMAIL PROTECTED]> wrote: > Steve > > Does the timeout happen for all the map jobs? Are you using some kind of shared storage for map outputs? Any problems with the physical disks? If the shuffle phase has started could the disks be I/O waiting between the read and write? > > Raj > > > >> ________________________________ >> From: Steve Lewis <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Sent: Wednesday, January 18, 2012 4:21 PM >> Subject: Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec >> >> 1) I do a lot of progress reporting >> 2) Why would the job succeed when the only change in the code is >> if(NumberWrites++ % 100 == 0) >> context.write(key,value); >> comment out the test allowing full writes and the job fails >> Since every write is a report I assume that something in the write code or >> other hadoop code for dealing with output if failing. I do increment a >> counter for every write or in the case of the above code potential write >> What I am seeing is that where ever the timeout occurs it is not in a place >> where I am capable of inserting more reporting >> >> >> >> On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <[EMAIL PROTECTED]> wrote: >> >>> Perhaps you are not reporting progress throughout your task. If you >>> happen to run a job large enough job you hit the the default timeout >>> mapred.task.timeout (that defaults to 10 min). Perhaps you should >>> consider reporting progress in your mapper/reducer by calling >>> progress() on the Reporter object. Check tip 7 of this link: >>> >>> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/ >>> >>> Hope that helps, >>> -Leo >>> >>> Sent from my phone >>> >>> On Jan 18, 2012, at 6:46 PM, Steve Lewis <[EMAIL PROTECTED]> wrote: >>> >>>> I KNOW is is a task timeout - what I do NOT know is WHY merely cutting >>> the >>>> number of writes causes it to go away. It seems to imply that some >>>> context.write operation or something downstream from that is taking a >>> huge >>>> amount of time and that is all hadoop internal code - not mine so my >>>> question is why should increasing the number and volume of wriotes cause >>> a >>>> task to time out >>>> >>>> On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <[EMAIL PROTECTED]> wrote: >>>> >>>>> Sounds like mapred.task.timeout? The default is 10 minutes. >>>>> >>>>> http://hadoop.apache.org/common/docs/current/mapred-default.html >>>>> >>>>> Thanks, >>>>> >>>>> Tom >>>>> >>>>> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <[EMAIL PROTECTED]> >>>>> wrote: >>>>>> The map tasks fail timing out after 600 sec. >>>>>> I am processing one 9 GB file with 16,000,000 records. Each record >>> (think >>>>>> is it as a line) generates hundreds of key value pairs. >>>>>> The job is unusual in that the output of the mapper in terms of records |