Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Hadoop - cluster set-up (for DUMMIES)...  or how I did it


Copy link to this message
-
Re: Hadoop - cluster set-up (for DUMMIES)... or how I did it
Mohammad Tariq 2012-11-03, 08:58
Hello Andy,

        Thank you  for sharing your experience with us. I would just like
to add that it is always good to include "dfs.name.dir" and "dfs.data.dir"
properties in hdfs-site.xml file to make sure that everything runs smoothly
as /tmp gets emptied at each restart. So, there are always chances of
loosing the data and meta info. Also, t's good to add "hadoop.tmp.dir" in
core-site.xml as it also default to /tmp.

Regards,
    Mohammad Tariq

On Fri, Nov 2, 2012 at 10:05 PM, Kartashov, Andy <[EMAIL PROTECTED]>wrote:

> Hello Hadoopers,
>
> After weeks of struggle, numerous error debugging and the like I finally
> managed to set-up a fully distributed cluster. I decided to share my
> experience with the new comers.
>  In case the experts on here disagree with some of the facts mentioned
> here-in feel free to correct or add your comments.
>
> Example Cluster Topology:
> Node 1 – NameNode+JobTracker
> Node 2 – SecondaryNameNode
> Node 3, 4, .., N – DataNodes 1,2,..N+TaskTrackers 1,2,..N
>
> Configuration set-up after you installed Hadoop:
>
> Firstly, you will need to find every host address of your respective Node
> by running:
> $hostname –f
>
> Your /etc/hadoop/ folder contains subfolders of your configuration files.
>  Your installation will create a default folder conf.empty. Copy it to, say
> conf.cluster and make sure your soft link conf-> points to conf.cluster
>
> You can see what it points now to by running:
> $ alternatives --display hadoop-conf
>
> Make a new link and set it to point to conf.cluster:
> $ sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf
> /etc/hadoop/conf.cluster 50
> $ sudo alternatives --set hadoop-conf /etc/hadoop/conf.cluster
> Run the display again to check proper configuration
> $ alternatives --display hadoop-conf
>
> Let’s go inside conf.cluster
> $cd conf.cluster/
>
> As a minimum, we will need to modify the following files:
> 1.      core-site.xml
> <property>
>   <name>fs.defaultFS</name>
>     <value>hdfs://<host-name>/:8020/</value> # it is the host-name of your
> NameNode -Node1 which you found with “hostname –f” above
>   </property>
>
> 2.      mapred-site.xml
>   <property>
>     <name>mapred.job.tracker</name>
>     <!--<value><host-name>:8021</value> --> # it is host-name of your
> NameNode – Node 1  as well, since we intend to run NameNode and JobTracker
> on the same machine
>     <value>hdfs://ip-10-62-62-235.ec2.internal:8021</value>
>   </property>
>
> 3.      masters # if this file doesn’t exist yet, create it and add one
> line:
> <host-name> # it is the host-name of your Node2 – running SecondaryNameNode
>
> 4.      slaves # if this file doesn’t exist yet, create it and add your
> host-names ( one per line):
> <host-name> # it is the host-name of your Node3 – running DataNode1
> <host-name> # it is the host-name of your Node4 – running DataNode2
> ….
> <host-name> # it is the host-name of your NodeN – running DataNodeN
>
>
> 5.      If you are not comfortable touching hdfs-site.xml, no problem,
> after you format your NameNode, it will create dfs/name dfs/data etc.
> folder structure in your local Linux default /tmp/hadoop-hdfs/directory.
> You could later change this to a different folder by specifying
> hdfs-site.xml  but please learn on the file structure/permissions/owners of
> those directories /dfs/data dfs/name dfs/namesecondary etc that were
> created for you by default first.
>
> Let’s format HDFS namespace: (note we format it as hdfs user)
> $ sudo –u hdfs hadoop  namenode –format
> NOTE – that you only run this command ONCE on the NameNode only!
>
> I only added the following property to my hdfs-site.xml on the NameNode-
> Node1 for the SecondaryNameNode to use:
>
> <property>
>   <name>dfs.namenode.http-address</name>
>   <value>namenode.host.address:50070</value>   # I change this to
> 0.0.0.0:50070 for EC2 environment
>   <description>
>     Needed for running SNN
>     The address and the base port on which the dfs NameNode Web UI will