Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Region server blocked at waitForAckedSeqno


Copy link to this message
-
Re: Region server blocked at waitForAckedSeqno
I have seen this before.  The last guess was that it's a bug somewhere
in the HDFS client (one of my colleagues was looking into it at the
time).  It's missed an ack'd seq number and probably will never
recover without a dn and rs restart.  I'll try and dig up any of the
pertinent info.

What version of HDFS are you running ?
Same version of client and server ?
Anything happening with the network at the time ?
Was there a nn fail over at the time ?

On Tue, Sep 3, 2013 at 7:49 PM, Himanshu Vashishtha <[EMAIL PROTECTED]> wrote:
> Looking at the jstack, log roller and log syncer, both are blocked to get
> the sequence number:
> {code}
> "regionserver60020.logRoller" daemon prio=10 tid=0x00007f317007f800
> nid=0x27ee6 in Object.wait() [0x00007f318acd8000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at
> org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:1708)
>         - locked <0x00007f34ae7b3638> (a java.util.LinkedList)
>         at
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1609)
>
> .....
> "regionserver60020.logSyncer" daemon prio=10 tid=0x00007f317007e800
> nid=0x27ee5 in Object.wait() [0x00007f318add9000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at
> org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:1708)
>         - locked <0x00007f34ae7b3638> (a java.util.LinkedList)
>         at
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1609)
>
> {code}
>
>
> This blocks other append ops.
>
> What do you see in the NN and DN logs, which has this log file. Can you
> pastebin NN, DN logs along with their jstack.
>
> On another note, I don't see the above exception in the log you attached.
> Is that really the meta regionserver log? All I could see for meta table is
> that it is calling MetaEditor to update meta, like an ordinary client. You
> seem to have your own set of Handlers?
> blah... "COP IPC Server handler 87 on 60020:" blah....
>
>
> Thanks,
> Himanshu
>
>
> On Mon, Sep 2, 2013 at 8:30 PM, Mickey <[EMAIL PROTECTED]> wrote:
>
>> Hi Himanshu,
>> It lasted for more than one hour. At last I tried to stop the region
>> server in and failed. From the jstack it was still blocked by
>>
>> the HLog syncer. So I kill the process with "kill -9" and then the HBase
>> got well.
>>
>> hbase.regionserver.logroll.errors.tolerated is the default value 0.
>>
>> My HBase cluster is mainly based on 0.94.1.
>>
>> Attachment is the region server which contains the .META. and the jstack
>> when it is blocked.
>>
>> Thanks,
>> Mickey
>>
>>
>>
>> 2013/9/2 Himanshu Vashishtha <[EMAIL PROTECTED]>
>>
>>> Hey Mickey,
>>>
>>> I have few followup questions:
>>>
>>> For how long these threads blocked? What happens afterwards, regionserver
>>> resumes, or aborts?
>>> And, could you pastebin the logs after the above exception?
>>> Sync failure causes a log roll, which is retried based on value of
>>> hbase.regionserver.logroll.errors.tolerated
>>> Which 0.94 version you are using?
>>>
>>> Thanks,
>>> Himanshu
>>>
>>>
>>>
>>> On Mon, Sep 2, 2013 at 5:16 AM, Mickey <[EMAIL PROTECTED]> wrote:
>>>
>>> > Hi, all
>>> >
>>> > I was testing HBase with HDFS QJM HA recently. Hadoop version is CDH
>>> 4.3.0
>>> > and HBase is based on 0.94 with some patches(include HBASE-8211)
>>> > In a test, I met a blocking issue in HBase.  I killed a node which is
>>> the
>>> > active namenode, also datanode, regionserver on it.
>>> >
>>> > The HDFS fail over successfully. The master tried re-assign the regions
>>> > after detecting the regionserver down. But no region can be online.
>>> >
>>> > From the log I found all operations to .META. failed. Printing the
>>> jstack
>>> > of the region server who contains the .META. , I found info below:
>>> > "regionserver60020.logSyncer" daemon prio=10 tid=0x00007f317007e800