Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Region server shutting down due to HDFS error


Copy link to this message
-
Re: Region server shutting down due to HDFS error
Eran Kutner 2012-04-05, 14:35
Freudian slip :)

-eran

On Thu, Apr 5, 2012 at 16:52, Ted Yu <[EMAIL PROTECTED]> wrote:

> Thanks for writing back.
>
> I guess you meant 'things are now operating well', below :-)
>
> On Thu, Apr 5, 2012 at 6:25 AM, Eran Kutner <[EMAIL PROTECTED]> wrote:
>
> > As promised I'm writing back to update the list.
> > Seems that after upgrading to cdh3u3 of the hadoop cluster and zookeeper
> > ensemble (hadoop alone wasn't enough) things are no operating well with
> no
> > HDFS errors in the logs. I've also set
> > hbase.regionserver.logroll.errors.tolerated to 3 just in case. Now that
> the
> > log is clean a new exception shows up but I'll open a separate thread
> about
> > it.
> >
> > Thanks everyone.
> >
> > -eran
> >
> >
> >
> > On Wed, Mar 28, 2012 at 23:06, Eran Kutner <[EMAIL PROTECTED]> wrote:
> >
> > > hmmm... I couldn't find it either, so I've looked at the history of
> that
> > > file and sure enough a few check-ins back it had that message.
> > > I have no idea how something like this could happen. I know I had some
> > > merge issues when I first got the latest version and built that project
> > but
> > > I've then reverted all local changes and rebuilt. The only thing I can
> > > imagine is that the previous compiled class file was not modified and
> it
> > > was the one that got included in the JAR, although I don;t really know
> > how
> > > can it happen.
> > >
> > > -eran
> > >
> > >
> > >
> > > On Wed, Mar 28, 2012 at 18:53, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >
> > >> Eran:
> > >> The error indicated some zookeeper related issue.
> > >> Do you see KeeperException after the Error log ?
> > >>
> > >> I searched 90 codebase but couldn't find the exact log phrase:
> > >>
> > >> zhihyu$ find src/main -name '*.java' -exec grep "getting node's
> version
> > in
> > >> CLOSI" {} \; -print
> > >> zhihyu$ find src/main -name '*.java' -exec grep 'Error getting ' {} \;
> > >> -print
> > >>
> > >> Cheers
> > >>
> > >> On Wed, Mar 28, 2012 at 9:45 AM, Eran Kutner <[EMAIL PROTECTED]> wrote:
> > >>
> > >> > I don't see any prior HDFS issues in the 15 minutes before this
> > >> exception.
> > >> > The logs on the datanode reported as problematic are clean as well.
> > >> > However, I now see the log is full of errors like this:
> > >> > 2012-03-28 00:15:05,358 DEBUG
> > >> > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> > >> Processing
> > >> > close of gs_users,731481|S
> > >> > n쒪㝨眳ԫ䂣⫰==,1331226388691.29929cb2200b3541ead85e17b836ade5.
> > >> > 2012-03-28 00:15:05,359 WARN
> > >> > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Error
> > >> > getting node's version in CLOSIN
> > >> > G state, aborting close of
> > >> >
> > >>
> >
> gs_users,731481|Sn쒪㝨眳ԫ䂣���==,1331226388691.29929cb2200b3541ead85e17b836ade5.
> > >> >
> > >> > -eran
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Mar 28, 2012 at 18:38, Jean-Daniel Cryans <
> > [EMAIL PROTECTED]
> > >> > >wrote:
> > >> >
> > >> > > Any chance we can see what happened before that too? Usually you
> > >> > > should see a lot more HDFS spam before getting that all the
> > datanodes
> > >> > > are bad.
> > >> > >
> > >> > > J-D
> > >> > >
> > >> > > On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[EMAIL PROTECTED]>
> > wrote:
> > >> > > > Hi,
> > >> > > >
> > >> > > > We have region server sporadically stopping under load due
> > >> supposedly
> > >> > to
> > >> > > > errors writing to HDFS. Things like:
> > >> > > >
> > >> > > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient:
> > Error
> > >> > > while
> > >> > > > syncing
> > >> > > > java.io.IOException: All datanodes 10.1.104.10:50010 are bad.
> > >> > Aborting..
> > >> > > >
> > >> > > > It's happening with a different region server and data node
> every
> > >> time,
> > >> > > so
> > >> > > > it's not a problem with one specific server and there doesn't
> seem
> > >> to
> > >> > be
> > >> > > > anything really wrong with either of them. I've already
> increased
> > >> the
> > >>