Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # dev >> Probably a bug in FsEditLog


+
Anty 2013-04-07, 09:48
Copy link to this message
-
Re: Probably a bug in FsEditLog
Thanks for analyzing and reporting this Anty,

What version of Apache Hadoop 1.x are you encountering this on? If
you've spotted the code issue on branch-1, please do log a HDFS JIRA
with some NN logs and your other details.

On Sun, Apr 7, 2013 at 3:18 PM, Anty <[EMAIL PROTECTED]> wrote:
> Hi:ALL
>
> In our cluster, we configure the NameNode to write to both local  and NFS
> mounted directories. When the NFS mounted directory is inaccessible, the
> NameNode should keep running without error, but our namenode crash with
> following stack trace.
>
> 2013-04-02 23:35:21,535 WARN org.apache.hadoop.hdfs.server.common.Storage:
>> Removing storage dir /nfs2-mount/onest3/dfs/name
>> 2013-04-02 23:35:21,536 FATAL
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to find edits
>> stream with IO error
>> java.lang.Exception: Unable to find edits stream with IO error
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.fatalExit(FSEditLog.java:430)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.removeEditsStreamsAndStorageDirs(FSEditLog.java:519)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:1139)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1641)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:689)
>>         at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
>>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
>> 2013-04-02 23:35:21,539 INFO
>> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>>
>
> According to the stack trace, When NameNode tries to sync edit log, it does
> identify the mounted NFS directory is inaccessible, and attempt to remove
> it from the FSEditLog#editStreams. However, it found the edit stream
> corresponding to the mounted NFS has already been removed. Under this
> circumstance, NameNode just kill itself, aborted!
>
>  After looking through the source code of HDFS, I found there is another
> code path of removing edit stream from FSEditLog#editStreams, which can
> cause above race condition. In method FSEditLog#logEdit
>
>
>>    if (getNumEditStreams() < 1)
>>         {
>>             throw new AssertionError("No edit streams to log to");
>>         }
>>         long start = FSNamesystem.now();
>>         for (int idx = 0; idx < editStreams.size(); idx++)
>>         {
>>             EditLogOutputStream eStream = editStreams.get(idx);
>>             try
>>             {
>>                 eStream.write(op, writables);
>>             }
>>             catch (IOException ioe)
>>             {
>>                 removeEditsAndStorageDir(idx);
>>                 idx--;
>>             }
>>         }
>>
>>
> The cause of this race condition lie in FSEditLog#logSync method, there are
> two steps in FSEditLog#logSync
>
> 1.    Do sync operation, if one edit stream is accessible, put it into
> error stream list.(un-synchronized)
> 2.    Delete error stream from FSEditLog#editStreams(synchronized)
>
> Step #1 isn’t synchronized, so there is a possibility that after step#1 and
> before step #2 the error stream has already been removed from other thread
> by invoking FSEditLog#logEdit
>
> If this was exactly a bug, my proposal fix is in method FSEditlog#logsync
> ignore or print some warning message instead of aborting the namenode when

Harsh J
+
Anty 2013-04-08, 09:14
+
Harsh J 2013-04-08, 09:24
+
Anty 2013-04-08, 14:22
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB