|
Eran Kutner
2012-03-28, 11:28
Jimmy Xiang
2012-03-28, 14:17
Eran Kutner
2012-03-28, 15:09
Stack
2012-03-28, 15:20
Harsh J
2012-03-28, 15:21
Eran Kutner
2012-03-28, 15:25
Jean-Daniel Cryans
2012-03-28, 16:38
Eran Kutner
2012-03-28, 16:45
Jean-Daniel Cryans
2012-03-28, 16:48
Ted Yu
2012-03-28, 16:53
Eran Kutner
2012-03-28, 20:06
Eran Kutner
2012-04-05, 13:25
Ted Yu
2012-04-05, 13:52
Eran Kutner
2012-04-05, 14:35
|
-
Region server shutting down due to HDFS errorEran Kutner 2012-03-28, 11:28
Hi,
We have region server sporadically stopping under load due supposedly to errors writing to HDFS. Things like: 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error while syncing java.io.IOException: All datanodes 10.1.104.10:50010 are bad. Aborting.. It's happening with a different region server and data node every time, so it's not a problem with one specific server and there doesn't seem to be anything really wrong with either of them. I've already increased the file descriptor limit, datanode xceivers and data node handler count. Any idea what can be causing these errors? A more complete log is here: http://pastebin.com/wC90xU2x Thanks. -eran
-
Re: Region server shutting down due to HDFS errorJimmy Xiang 2012-03-28, 14:17
Which version of HDFS and HBase are you using?
When the problem happens, can you access the HDFS, for example, from hadoop dfs? Thanks, Jimmy On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > Hi, > > We have region server sporadically stopping under load due supposedly to > errors writing to HDFS. Things like: > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error while > syncing > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. Aborting.. > > It's happening with a different region server and data node every time, so > it's not a problem with one specific server and there doesn't seem to be > anything really wrong with either of them. I've already increased the file > descriptor limit, datanode xceivers and data node handler count. Any idea > what can be causing these errors? > > > A more complete log is here: http://pastebin.com/wC90xU2x > > Thanks. > > -eran
-
Re: Region server shutting down due to HDFS errorEran Kutner 2012-03-28, 15:09
Hi Jimmy,
HBase is built from latest sources of 0.90 branch (0.90.7-SNAPSHOT), I had the same problem with 0.90.4 Hadoop 0.20.2 from Cloudera CDH3u1 This failure happens during large M/R jobs, I have 10 servers and usually no more than 1 would fail like this, sometimes none. One thing worth mentioning is that the table it is trying to write to has over 5000 regions. -eran On Wed, Mar 28, 2012 at 16:17, Jimmy Xiang <[EMAIL PROTECTED]> wrote: > Which version of HDFS and HBase are you using? > > When the problem happens, can you access the HDFS, for example, from > hadoop dfs? > > Thanks, > Jimmy > > On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > > Hi, > > > > We have region server sporadically stopping under load due supposedly to > > errors writing to HDFS. Things like: > > > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error > while > > syncing > > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. Aborting.. > > > > It's happening with a different region server and data node every time, > so > > it's not a problem with one specific server and there doesn't seem to be > > anything really wrong with either of them. I've already increased the > file > > descriptor limit, datanode xceivers and data node handler count. Any idea > > what can be causing these errors? > > > > > > A more complete log is here: http://pastebin.com/wC90xU2x > > > > Thanks. > > > > -eran >
-
Re: Region server shutting down due to HDFS errorStack 2012-03-28, 15:20
On Wed, Mar 28, 2012 at 8:09 AM, Eran Kutner <[EMAIL PROTECTED]> wrote:
> Hi Jimmy, > HBase is built from latest sources of 0.90 branch (0.90.7-SNAPSHOT), I had > the same problem with 0.90.4 > Hadoop 0.20.2 from Cloudera CDH3u1 > Can you upgrade to CDH3u3 Eran? I don't remember if CDH3u1 had support for sync (The complaint in your log is coming up out sync). If it didn't, thats a problem (see reference guide). You should upgrade anyways because loads of fixes. > This failure happens during large M/R jobs, I have 10 servers and usually > no more than 1 would fail like this, sometimes none. > One thing worth mentioning is that the table it is trying to write to has > over 5000 regions. > Can you stop the table splitting more? Its a config. setting. Can I see more logs? In particular the bit hbase emits on startup. Thanks, St.Ack
-
Re: Region server shutting down due to HDFS errorHarsh J 2012-03-28, 15:21
Eran,
For 0.90.7 SNAPSHOT, set "hbase.regionserver.logroll.errors.tolerated" to > 0 (default). This will help RS survive transient HLog sync failures (with local DN) by retrying a few times before the RS decides to shut itself down. Also worth investigating if you had too much IO load/etc. on the box that lead to the DN throwing up an error during sync(). P.s. The fix from https://issues.apache.org/jira/browse/HBASE-4222 will also be in CDH3u4. On Wed, Mar 28, 2012 at 8:39 PM, Eran Kutner <[EMAIL PROTECTED]> wrote: > Hi Jimmy, > HBase is built from latest sources of 0.90 branch (0.90.7-SNAPSHOT), I had > the same problem with 0.90.4 > Hadoop 0.20.2 from Cloudera CDH3u1 > > This failure happens during large M/R jobs, I have 10 servers and usually > no more than 1 would fail like this, sometimes none. > One thing worth mentioning is that the table it is trying to write to has > over 5000 regions. > > -eran > > > > On Wed, Mar 28, 2012 at 16:17, Jimmy Xiang <[EMAIL PROTECTED]> wrote: > >> Which version of HDFS and HBase are you using? >> >> When the problem happens, can you access the HDFS, for example, from >> hadoop dfs? >> >> Thanks, >> Jimmy >> >> On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: >> > Hi, >> > >> > We have region server sporadically stopping under load due supposedly to >> > errors writing to HDFS. Things like: >> > >> > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error >> while >> > syncing >> > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. Aborting.. >> > >> > It's happening with a different region server and data node every time, >> so >> > it's not a problem with one specific server and there doesn't seem to be >> > anything really wrong with either of them. I've already increased the >> file >> > descriptor limit, datanode xceivers and data node handler count. Any idea >> > what can be causing these errors? >> > >> > >> > A more complete log is here: http://pastebin.com/wC90xU2x >> > >> > Thanks. >> > >> > -eran >> -- Harsh J
-
Re: Region server shutting down due to HDFS errorEran Kutner 2012-03-28, 15:25
Thanks Stack and Harsh, I'll try both suggestions and update the list with
the results. -eran On Wed, Mar 28, 2012 at 17:21, Harsh J <[EMAIL PROTECTED]> wrote: > Eran, > > For 0.90.7 SNAPSHOT, set "hbase.regionserver.logroll.errors.tolerated" > to > 0 (default). This will help RS survive transient HLog sync > failures (with local DN) by retrying a few times before the RS decides > to shut itself down. > > Also worth investigating if you had too much IO load/etc. on the box > that lead to the DN throwing up an error during sync(). > > P.s. The fix from https://issues.apache.org/jira/browse/HBASE-4222 > will also be in CDH3u4. > > On Wed, Mar 28, 2012 at 8:39 PM, Eran Kutner <[EMAIL PROTECTED]> wrote: > > Hi Jimmy, > > HBase is built from latest sources of 0.90 branch (0.90.7-SNAPSHOT), I > had > > the same problem with 0.90.4 > > Hadoop 0.20.2 from Cloudera CDH3u1 > > > > This failure happens during large M/R jobs, I have 10 servers and usually > > no more than 1 would fail like this, sometimes none. > > One thing worth mentioning is that the table it is trying to write to has > > over 5000 regions. > > > > -eran > > > > > > > > On Wed, Mar 28, 2012 at 16:17, Jimmy Xiang <[EMAIL PROTECTED]> wrote: > > > >> Which version of HDFS and HBase are you using? > >> > >> When the problem happens, can you access the HDFS, for example, from > >> hadoop dfs? > >> > >> Thanks, > >> Jimmy > >> > >> On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > >> > Hi, > >> > > >> > We have region server sporadically stopping under load due supposedly > to > >> > errors writing to HDFS. Things like: > >> > > >> > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error > >> while > >> > syncing > >> > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. > Aborting.. > >> > > >> > It's happening with a different region server and data node every > time, > >> so > >> > it's not a problem with one specific server and there doesn't seem to > be > >> > anything really wrong with either of them. I've already increased the > >> file > >> > descriptor limit, datanode xceivers and data node handler count. Any > idea > >> > what can be causing these errors? > >> > > >> > > >> > A more complete log is here: http://pastebin.com/wC90xU2x > >> > > >> > Thanks. > >> > > >> > -eran > >> > > > > -- > Harsh J >
-
Re: Region server shutting down due to HDFS errorJean-Daniel Cryans 2012-03-28, 16:38
Any chance we can see what happened before that too? Usually you
should see a lot more HDFS spam before getting that all the datanodes are bad. J-D On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > Hi, > > We have region server sporadically stopping under load due supposedly to > errors writing to HDFS. Things like: > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error while > syncing > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. Aborting.. > > It's happening with a different region server and data node every time, so > it's not a problem with one specific server and there doesn't seem to be > anything really wrong with either of them. I've already increased the file > descriptor limit, datanode xceivers and data node handler count. Any idea > what can be causing these errors? > > > A more complete log is here: http://pastebin.com/wC90xU2x > > Thanks. > > -eran
-
Re: Region server shutting down due to HDFS errorEran Kutner 2012-03-28, 16:45
I don't see any prior HDFS issues in the 15 minutes before this exception.
The logs on the datanode reported as problematic are clean as well. However, I now see the log is full of errors like this: 2012-03-28 00:15:05,358 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of gs_users,731481|S n+xKryLzdodzMFK0CjKvA==,1331226388691.29929cb2200b3541ead85e17b836ade5. 2012-03-28 00:15:05,359 WARN org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Error getting node's version in CLOSIN G state, aborting close of gs_users,731481|Sn+xKryLzdodzMFK0CjKvA==,1331226388691.29929cb2200b3541ead85e17b836ade5. -eran On Wed, Mar 28, 2012 at 18:38, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > Any chance we can see what happened before that too? Usually you > should see a lot more HDFS spam before getting that all the datanodes > are bad. > > J-D > > On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > > Hi, > > > > We have region server sporadically stopping under load due supposedly to > > errors writing to HDFS. Things like: > > > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error > while > > syncing > > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. Aborting.. > > > > It's happening with a different region server and data node every time, > so > > it's not a problem with one specific server and there doesn't seem to be > > anything really wrong with either of them. I've already increased the > file > > descriptor limit, datanode xceivers and data node handler count. Any idea > > what can be causing these errors? > > > > > > A more complete log is here: http://pastebin.com/wC90xU2x > > > > Thanks. > > > > -eran >
-
Re: Region server shutting down due to HDFS errorJean-Daniel Cryans 2012-03-28, 16:48
Can you look even further? Like a day?
J-D On Wed, Mar 28, 2012 at 9:45 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > I don't see any prior HDFS issues in the 15 minutes before this exception. > The logs on the datanode reported as problematic are clean as well. > However, I now see the log is full of errors like this: > 2012-03-28 00:15:05,358 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing > close of gs_users,731481|S > n쒪㝨眳ԫ䂣⫰==,1331226388691.29929cb2200b3541ead85e17b836ade5. > 2012-03-28 00:15:05,359 WARN > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Error > getting node's version in CLOSIN > G state, aborting close of > gs_users,731481|Sn쒪㝨眳ԫ䂣���==,1331226388691.29929cb2200b3541ead85e17b836ade5. > > -eran > > > > On Wed, Mar 28, 2012 at 18:38, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > >> Any chance we can see what happened before that too? Usually you >> should see a lot more HDFS spam before getting that all the datanodes >> are bad. >> >> J-D >> >> On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: >> > Hi, >> > >> > We have region server sporadically stopping under load due supposedly to >> > errors writing to HDFS. Things like: >> > >> > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error >> while >> > syncing >> > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. Aborting.. >> > >> > It's happening with a different region server and data node every time, >> so >> > it's not a problem with one specific server and there doesn't seem to be >> > anything really wrong with either of them. I've already increased the >> file >> > descriptor limit, datanode xceivers and data node handler count. Any idea >> > what can be causing these errors? >> > >> > >> > A more complete log is here: http://pastebin.com/wC90xU2x >> > >> > Thanks. >> > >> > -eran >>
-
Re: Region server shutting down due to HDFS errorTed Yu 2012-03-28, 16:53
Eran:
The error indicated some zookeeper related issue. Do you see KeeperException after the Error log ? I searched 90 codebase but couldn't find the exact log phrase: zhihyu$ find src/main -name '*.java' -exec grep "getting node's version in CLOSI" {} \; -print zhihyu$ find src/main -name '*.java' -exec grep 'Error getting ' {} \; Cheers On Wed, Mar 28, 2012 at 9:45 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > I don't see any prior HDFS issues in the 15 minutes before this exception. > The logs on the datanode reported as problematic are clean as well. > However, I now see the log is full of errors like this: > 2012-03-28 00:15:05,358 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing > close of gs_users,731481|S > n쒪㝨眳ԫ䂣⫰==,1331226388691.29929cb2200b3541ead85e17b836ade5. > 2012-03-28 00:15:05,359 WARN > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Error > getting node's version in CLOSIN > G state, aborting close of > gs_users,731481|Sn쒪㝨眳ԫ䂣���==,1331226388691.29929cb2200b3541ead85e17b836ade5. > > -eran > > > > On Wed, Mar 28, 2012 at 18:38, Jean-Daniel Cryans <[EMAIL PROTECTED] > >wrote: > > > Any chance we can see what happened before that too? Usually you > > should see a lot more HDFS spam before getting that all the datanodes > > are bad. > > > > J-D > > > > On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > We have region server sporadically stopping under load due supposedly > to > > > errors writing to HDFS. Things like: > > > > > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error > > while > > > syncing > > > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. > Aborting.. > > > > > > It's happening with a different region server and data node every time, > > so > > > it's not a problem with one specific server and there doesn't seem to > be > > > anything really wrong with either of them. I've already increased the > > file > > > descriptor limit, datanode xceivers and data node handler count. Any > idea > > > what can be causing these errors? > > > > > > > > > A more complete log is here: http://pastebin.com/wC90xU2x > > > > > > Thanks. > > > > > > -eran > > >
-
Re: Region server shutting down due to HDFS errorEran Kutner 2012-03-28, 20:06
hmmm... I couldn't find it either, so I've looked at the history of that
file and sure enough a few check-ins back it had that message. I have no idea how something like this could happen. I know I had some merge issues when I first got the latest version and built that project but I've then reverted all local changes and rebuilt. The only thing I can imagine is that the previous compiled class file was not modified and it was the one that got included in the JAR, although I don;t really know how can it happen. -eran On Wed, Mar 28, 2012 at 18:53, Ted Yu <[EMAIL PROTECTED]> wrote: > Eran: > The error indicated some zookeeper related issue. > Do you see KeeperException after the Error log ? > > I searched 90 codebase but couldn't find the exact log phrase: > > zhihyu$ find src/main -name '*.java' -exec grep "getting node's version in > CLOSI" {} \; -print > zhihyu$ find src/main -name '*.java' -exec grep 'Error getting ' {} \; > > Cheers > > On Wed, Mar 28, 2012 at 9:45 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > > > I don't see any prior HDFS issues in the 15 minutes before this > exception. > > The logs on the datanode reported as problematic are clean as well. > > However, I now see the log is full of errors like this: > > 2012-03-28 00:15:05,358 DEBUG > > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: > Processing > > close of gs_users,731481|S > > n쒪㝨眳ԫ䂣⫰==,1331226388691.29929cb2200b3541ead85e17b836ade5. > > 2012-03-28 00:15:05,359 WARN > > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Error > > getting node's version in CLOSIN > > G state, aborting close of > > > gs_users,731481|Sn쒪㝨眳ԫ䂣���==,1331226388691.29929cb2200b3541ead85e17b836ade5. > > > > -eran > > > > > > > > On Wed, Mar 28, 2012 at 18:38, Jean-Daniel Cryans <[EMAIL PROTECTED] > > >wrote: > > > > > Any chance we can see what happened before that too? Usually you > > > should see a lot more HDFS spam before getting that all the datanodes > > > are bad. > > > > > > J-D > > > > > > On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > > > > > We have region server sporadically stopping under load due supposedly > > to > > > > errors writing to HDFS. Things like: > > > > > > > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error > > > while > > > > syncing > > > > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. > > Aborting.. > > > > > > > > It's happening with a different region server and data node every > time, > > > so > > > > it's not a problem with one specific server and there doesn't seem to > > be > > > > anything really wrong with either of them. I've already increased the > > > file > > > > descriptor limit, datanode xceivers and data node handler count. Any > > idea > > > > what can be causing these errors? > > > > > > > > > > > > A more complete log is here: http://pastebin.com/wC90xU2x > > > > > > > > Thanks. > > > > > > > > -eran > > > > > >
-
Re: Region server shutting down due to HDFS errorEran Kutner 2012-04-05, 13:25
As promised I'm writing back to update the list.
Seems that after upgrading to cdh3u3 of the hadoop cluster and zookeeper ensemble (hadoop alone wasn't enough) things are no operating well with no HDFS errors in the logs. I've also set hbase.regionserver.logroll.errors.tolerated to 3 just in case. Now that the log is clean a new exception shows up but I'll open a separate thread about it. Thanks everyone. -eran On Wed, Mar 28, 2012 at 23:06, Eran Kutner <[EMAIL PROTECTED]> wrote: > hmmm... I couldn't find it either, so I've looked at the history of that > file and sure enough a few check-ins back it had that message. > I have no idea how something like this could happen. I know I had some > merge issues when I first got the latest version and built that project but > I've then reverted all local changes and rebuilt. The only thing I can > imagine is that the previous compiled class file was not modified and it > was the one that got included in the JAR, although I don;t really know how > can it happen. > > -eran > > > > On Wed, Mar 28, 2012 at 18:53, Ted Yu <[EMAIL PROTECTED]> wrote: > >> Eran: >> The error indicated some zookeeper related issue. >> Do you see KeeperException after the Error log ? >> >> I searched 90 codebase but couldn't find the exact log phrase: >> >> zhihyu$ find src/main -name '*.java' -exec grep "getting node's version in >> CLOSI" {} \; -print >> zhihyu$ find src/main -name '*.java' -exec grep 'Error getting ' {} \; >> >> Cheers >> >> On Wed, Mar 28, 2012 at 9:45 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: >> >> > I don't see any prior HDFS issues in the 15 minutes before this >> exception. >> > The logs on the datanode reported as problematic are clean as well. >> > However, I now see the log is full of errors like this: >> > 2012-03-28 00:15:05,358 DEBUG >> > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: >> Processing >> > close of gs_users,731481|S >> > n쒪㝨眳ԫ䂣⫰==,1331226388691.29929cb2200b3541ead85e17b836ade5. >> > 2012-03-28 00:15:05,359 WARN >> > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Error >> > getting node's version in CLOSIN >> > G state, aborting close of >> > >> gs_users,731481|Sn쒪㝨眳ԫ䂣���==,1331226388691.29929cb2200b3541ead85e17b836ade5. >> > >> > -eran >> > >> > >> > >> > On Wed, Mar 28, 2012 at 18:38, Jean-Daniel Cryans <[EMAIL PROTECTED] >> > >wrote: >> > >> > > Any chance we can see what happened before that too? Usually you >> > > should see a lot more HDFS spam before getting that all the datanodes >> > > are bad. >> > > >> > > J-D >> > > >> > > On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: >> > > > Hi, >> > > > >> > > > We have region server sporadically stopping under load due >> supposedly >> > to >> > > > errors writing to HDFS. Things like: >> > > > >> > > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > > while >> > > > syncing >> > > > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. >> > Aborting.. >> > > > >> > > > It's happening with a different region server and data node every >> time, >> > > so >> > > > it's not a problem with one specific server and there doesn't seem >> to >> > be >> > > > anything really wrong with either of them. I've already increased >> the >> > > file >> > > > descriptor limit, datanode xceivers and data node handler count. Any >> > idea >> > > > what can be causing these errors? >> > > > >> > > > >> > > > A more complete log is here: http://pastebin.com/wC90xU2x >> > > > >> > > > Thanks. >> > > > >> > > > -eran >> > > >> > >> > >
-
Re: Region server shutting down due to HDFS errorTed Yu 2012-04-05, 13:52
Thanks for writing back.
I guess you meant 'things are now operating well', below :-) On Thu, Apr 5, 2012 at 6:25 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > As promised I'm writing back to update the list. > Seems that after upgrading to cdh3u3 of the hadoop cluster and zookeeper > ensemble (hadoop alone wasn't enough) things are no operating well with no > HDFS errors in the logs. I've also set > hbase.regionserver.logroll.errors.tolerated to 3 just in case. Now that the > log is clean a new exception shows up but I'll open a separate thread about > it. > > Thanks everyone. > > -eran > > > > On Wed, Mar 28, 2012 at 23:06, Eran Kutner <[EMAIL PROTECTED]> wrote: > > > hmmm... I couldn't find it either, so I've looked at the history of that > > file and sure enough a few check-ins back it had that message. > > I have no idea how something like this could happen. I know I had some > > merge issues when I first got the latest version and built that project > but > > I've then reverted all local changes and rebuilt. The only thing I can > > imagine is that the previous compiled class file was not modified and it > > was the one that got included in the JAR, although I don;t really know > how > > can it happen. > > > > -eran > > > > > > > > On Wed, Mar 28, 2012 at 18:53, Ted Yu <[EMAIL PROTECTED]> wrote: > > > >> Eran: > >> The error indicated some zookeeper related issue. > >> Do you see KeeperException after the Error log ? > >> > >> I searched 90 codebase but couldn't find the exact log phrase: > >> > >> zhihyu$ find src/main -name '*.java' -exec grep "getting node's version > in > >> CLOSI" {} \; -print > >> zhihyu$ find src/main -name '*.java' -exec grep 'Error getting ' {} \; > >> > >> Cheers > >> > >> On Wed, Mar 28, 2012 at 9:45 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > >> > >> > I don't see any prior HDFS issues in the 15 minutes before this > >> exception. > >> > The logs on the datanode reported as problematic are clean as well. > >> > However, I now see the log is full of errors like this: > >> > 2012-03-28 00:15:05,358 DEBUG > >> > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: > >> Processing > >> > close of gs_users,731481|S > >> > n쒪㝨眳ԫ䂣⫰==,1331226388691.29929cb2200b3541ead85e17b836ade5. > >> > 2012-03-28 00:15:05,359 WARN > >> > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Error > >> > getting node's version in CLOSIN > >> > G state, aborting close of > >> > > >> > gs_users,731481|Sn쒪㝨眳ԫ䂣���==,1331226388691.29929cb2200b3541ead85e17b836ade5. > >> > > >> > -eran > >> > > >> > > >> > > >> > On Wed, Mar 28, 2012 at 18:38, Jean-Daniel Cryans < > [EMAIL PROTECTED] > >> > >wrote: > >> > > >> > > Any chance we can see what happened before that too? Usually you > >> > > should see a lot more HDFS spam before getting that all the > datanodes > >> > > are bad. > >> > > > >> > > J-D > >> > > > >> > > On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[EMAIL PROTECTED]> > wrote: > >> > > > Hi, > >> > > > > >> > > > We have region server sporadically stopping under load due > >> supposedly > >> > to > >> > > > errors writing to HDFS. Things like: > >> > > > > >> > > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: > Error > >> > > while > >> > > > syncing > >> > > > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. > >> > Aborting.. > >> > > > > >> > > > It's happening with a different region server and data node every > >> time, > >> > > so > >> > > > it's not a problem with one specific server and there doesn't seem > >> to > >> > be > >> > > > anything really wrong with either of them. I've already increased > >> the > >> > > file > >> > > > descriptor limit, datanode xceivers and data node handler count. > Any > >> > idea > >> > > > what can be causing these errors? > >> > > > > >> > > > > >> > > > A more complete log is here: http://pastebin.com/wC90xU2x > >> > > > > >> > > > Thanks. > >> > > > > >> > > > -eran > >> > > > >> > > >> > > > > >
-
Re: Region server shutting down due to HDFS errorEran Kutner 2012-04-05, 14:35
Freudian slip :)
-eran On Thu, Apr 5, 2012 at 16:52, Ted Yu <[EMAIL PROTECTED]> wrote: > Thanks for writing back. > > I guess you meant 'things are now operating well', below :-) > > On Thu, Apr 5, 2012 at 6:25 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > > > As promised I'm writing back to update the list. > > Seems that after upgrading to cdh3u3 of the hadoop cluster and zookeeper > > ensemble (hadoop alone wasn't enough) things are no operating well with > no > > HDFS errors in the logs. I've also set > > hbase.regionserver.logroll.errors.tolerated to 3 just in case. Now that > the > > log is clean a new exception shows up but I'll open a separate thread > about > > it. > > > > Thanks everyone. > > > > -eran > > > > > > > > On Wed, Mar 28, 2012 at 23:06, Eran Kutner <[EMAIL PROTECTED]> wrote: > > > > > hmmm... I couldn't find it either, so I've looked at the history of > that > > > file and sure enough a few check-ins back it had that message. > > > I have no idea how something like this could happen. I know I had some > > > merge issues when I first got the latest version and built that project > > but > > > I've then reverted all local changes and rebuilt. The only thing I can > > > imagine is that the previous compiled class file was not modified and > it > > > was the one that got included in the JAR, although I don;t really know > > how > > > can it happen. > > > > > > -eran > > > > > > > > > > > > On Wed, Mar 28, 2012 at 18:53, Ted Yu <[EMAIL PROTECTED]> wrote: > > > > > >> Eran: > > >> The error indicated some zookeeper related issue. > > >> Do you see KeeperException after the Error log ? > > >> > > >> I searched 90 codebase but couldn't find the exact log phrase: > > >> > > >> zhihyu$ find src/main -name '*.java' -exec grep "getting node's > version > > in > > >> CLOSI" {} \; -print > > >> zhihyu$ find src/main -name '*.java' -exec grep 'Error getting ' {} \; > > >> > > >> Cheers > > >> > > >> On Wed, Mar 28, 2012 at 9:45 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > > >> > > >> > I don't see any prior HDFS issues in the 15 minutes before this > > >> exception. > > >> > The logs on the datanode reported as problematic are clean as well. > > >> > However, I now see the log is full of errors like this: > > >> > 2012-03-28 00:15:05,358 DEBUG > > >> > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: > > >> Processing > > >> > close of gs_users,731481|S > > >> > n쒪㝨眳ԫ䂣⫰==,1331226388691.29929cb2200b3541ead85e17b836ade5. > > >> > 2012-03-28 00:15:05,359 WARN > > >> > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: > Error > > >> > getting node's version in CLOSIN > > >> > G state, aborting close of > > >> > > > >> > > > gs_users,731481|Sn쒪㝨眳ԫ䂣���==,1331226388691.29929cb2200b3541ead85e17b836ade5. > > >> > > > >> > -eran > > >> > > > >> > > > >> > > > >> > On Wed, Mar 28, 2012 at 18:38, Jean-Daniel Cryans < > > [EMAIL PROTECTED] > > >> > >wrote: > > >> > > > >> > > Any chance we can see what happened before that too? Usually you > > >> > > should see a lot more HDFS spam before getting that all the > > datanodes > > >> > > are bad. > > >> > > > > >> > > J-D > > >> > > > > >> > > On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[EMAIL PROTECTED]> > > wrote: > > >> > > > Hi, > > >> > > > > > >> > > > We have region server sporadically stopping under load due > > >> supposedly > > >> > to > > >> > > > errors writing to HDFS. Things like: > > >> > > > > > >> > > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: > > Error > > >> > > while > > >> > > > syncing > > >> > > > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. > > >> > Aborting.. > > >> > > > > > >> > > > It's happening with a different region server and data node > every > > >> time, > > >> > > so > > >> > > > it's not a problem with one specific server and there doesn't > seem > > >> to > > >> > be > > >> > > > anything really wrong with either of them. I've already > increased > > >> the > > >> |