-Re: Split brain - is it possible in hadoop?
Michael Segel 2012-06-19, 12:47
In your example, you only have one active Name Node. So how would you encounter a 'split brain' scenario?
Maybe it would be better if you defined what you mean by a split brain?
On Jun 18, 2012, at 8:30 PM, hdev ml wrote:
> All hadoop contributors/experts,
> I am trying to simulate split brain in our installation. There are a few
> things we want to know
> 1. Does data corruption happen?
> 2. If Yes in #1, how to recover from it.
> 3. What are the corrective steps to take in this situation e.g. killing one
> namenode etc
> So to simulate this I took following steps.
> 1. We already have a healthy test cluster, consisting of 4 machines. One
> machine runs namenode and a datanode, other machine runs secondarynamenode
> and a datanode, 3rd runs jobtracker and a datanode, and 4th one just a
> 2. Copied the hadoop installation folder to a new location in the datanode.
> 3. Kept all configurations same in hdfs-site and core-site xmls, except
> renamed the fs.default.name to a different URI
> 4. The namenode directory - dfs.name.dir was pointing to the same shared
> NFS mounted directory to which the main namenode points to.
> I started this standby namenode using following command
> bin/hadoop-daemon.sh --config conf --hosts slaves start namenode
> It errored out saying that "the directory is already locked", which is an
> expected behaviour. The directory has been locked by the original namenode.
> So I changed the dfs.name.dir to some other folder, and issued the same
> command. It fails with message - "namenode has not been formatted", which
> is also expected.
> This makes me think - does splitbrain situation really occur in hadoop?
> My understanding is that split brain happens because of timeouts on the
> main namenode. The way it happens is, when the timeout occurs, the HA
> implementation - Be it Linux HA, Veritas etc., thinks that the main
> namenode has died and tries to start the standby namenode. The standby
> namenode starts up and then main namenode comes back from the timeout phase
> and starts functioning as if nothing happened, giving rise to 2 namenodes
> in the cluster - Split Brain.
> Considering the error messages and the above understanding, I cannot point
> 2 different namenodes to same directory, because the main namenode isn't
> responding but has locked the directory.
> So can I safely conclude that split brain does not occur in hadoop?
> Or am I missing any other situation where split brain happens and the
> namenode directory is not locked, thus allowing the standby namenode also
> to start up?
> Has anybody encountered this?
> Any help is really appreciated.