After weeks of struggle, numerous error debugging and the like I finally managed to set-up a fully distributed cluster. I decided to share my experience with the new comers.
In case the experts on here disagree with some of the facts mentioned here-in feel free to correct or add your comments.
Example Cluster Topology:
Node 1 – NameNode+JobTracker
Node 2 – SecondaryNameNode
Node 3, 4, .., N – DataNodes 1,2,..N+TaskTrackers 1,2,..N
Configuration set-up after you installed Hadoop:
Firstly, you will need to find every host address of your respective Node by running:
Your /etc/hadoop/ folder contains subfolders of your configuration files. Your installation will create a default folder conf.empty. Copy it to, say conf.cluster and make sure your soft link conf-> points to conf.cluster
You can see what it points now to by running:
$ alternatives --display hadoop-conf
Make a new link and set it to point to conf.cluster:
$ sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.cluster 50
$ sudo alternatives --set hadoop-conf /etc/hadoop/conf.cluster
Run the display again to check proper configuration
$ alternatives --display hadoop-conf
Let’s go inside conf.cluster
As a minimum, we will need to modify the following files:
<value>hdfs://<host-name>/:8020/</value> # it is the host-name of your NameNode -Node1 which you found with “hostname –f” above
<!--<value><host-name>:8021</value> --> # it is host-name of your NameNode – Node 1 as well, since we intend to run NameNode and JobTracker on the same machine
3. masters # if this file doesn’t exist yet, create it and add one line:
<host-name> # it is the host-name of your Node2 – running SecondaryNameNode
4. slaves # if this file doesn’t exist yet, create it and add your host-names ( one per line):
<host-name> # it is the host-name of your Node3 – running DataNode1
<host-name> # it is the host-name of your Node4 – running DataNode2
<host-name> # it is the host-name of your NodeN – running DataNodeN
5. If you are not comfortable touching hdfs-site.xml, no problem, after you format your NameNode, it will create dfs/name dfs/data etc. folder structure in your local Linux default /tmp/hadoop-hdfs/directory. You could later change this to a different folder by specifying hdfs-site.xml but please learn on the file structure/permissions/owners of those directories /dfs/data dfs/name dfs/namesecondary etc that were created for you by default first.
Let’s format HDFS namespace: (note we format it as hdfs user)
$ sudo –u hdfs hadoop namenode –format
NOTE – that you only run this command ONCE on the NameNode only!
I only added the following property to my hdfs-site.xml on the NameNode- Node1 for the SecondaryNameNode to use:
<value>namenode.host.address:50070</value> # I change this to 0.0.0.0:50070 for EC2 environment
Needed for running SNN
The address and the base port on which the dfs NameNode Web UI will listen.
If the port is 0, the server will start on a free port.
</property>other SNN properties for hdfs-site.xml
6. Copy you /conf.cluster/folder to every Node in your cluster: Node2 (SNN) , Node3,4,..N (DNs+TTs). Make sure your conf soft link points to tis directory (see above).
7. Now we ready to start daemons:
Everytime you start a respective Daemon, a log report is written. This is the FIRST place to look for potential problems.
Unless you change the property in hadoop-env.sh, found in your /conf/conf.cluster/ directory, namely “#export HADOOP_LOG_DIR=/foor/bar/whatever” the default logs are written on each respective Node to:
NameNode, DataNode, SecondaryNameNode – “/var/log/hadoop-hdfs/” directory
JobTracker,TaskTracker- “/var/log/hadoop-mapreduce” or “/var/log/hadoop-0.20-mapreduce/” or else, depending on the version of your MR. Respective Daemon will have a respective filename ending with .log
I came across a lot of errors playing with this, as follows:
a. Error: connection refused
This is normally caused by your firewall. Try running “sudo /etc/init.d/iptables status”. I bet it is running. Solution: either add allowed ports or temporarily turn off iptables by running “sudo service iptables stop”
Try to restart your Daemon (that is refused connection) and check your respective /var/log/…. Datanode or TaskTracker or else .log file again.
This solved my problems with connections. You can test connection by running “telnet <ip-address> <port>” of the Node you are trying to connect to.
b. Binding exception.
This happens when you try to start a Daemon on the machine that is not supposed to run this Daemon. For example, trying to start JobTracker on a slave machine. This is a given. JobTracker is already running on your MasterNode - Node1 hence the binding Exception.
c. Java heap size or Java Child exception were thrown when I ran too small of an instance on EC2. Increasing it from tiny to small or from small to medium, solved the issue.
d. DataNode running on slave throws an Exception about DataNode id –mismatch. This happened when I tried to duplicate an instance on EC2, and as a result ended up with two different DataNodes with the same id. Deleting /tmp/hadoop-hdfs/dfs/data directory on the replicated Instance and restarting dataNode Daemon solved this issue.
Now, that you fixed your above errors and restarted respective Daemons your ..log files should be clean of any errors.
Let’s now check that all of our DataNodes1,2-N (Nodes3,4…N) are registered with the Master Namenode - Node1.
“$hadoop dfsadmin –printTopology”
Should display all y