|
Arv Mistry
2010-09-15, 15:50
Nan Zhu
2010-09-15, 19:51
David Rosenstrauch
2010-09-15, 20:19
Matthew Foley
2010-09-16, 00:45
Ranjib Dey
2010-09-16, 08:27
Arv Mistry
2010-09-16, 13:55
|
-
Multiple DataNodes on a single machineArv Mistry 2010-09-15, 15:50
Hi,
Is it possible to run multiple data nodes on a single machine? I currently have a machine with multiple disks and enough disk capacity for replication across them. I don't need redundancy at the machine level but would like to be able to handle a single disk failure. So I was thinking if I can run multiple DataNodes on a single machine each assigned a separate disk that would give me the protection I need against disk failure. Can anyone give me any insights in to how I would setup multiple DataNodes to run on a single machine? Thanks in advance, Cheers Arv
-
Re: Multiple DataNodes on a single machineNan Zhu 2010-09-15, 19:51
Hi, Arv,
Actually, several days ago, I deployed a system which is similar with your requirements In our cluster environment, since I have to run modified hadoop, we invoked two namenodes, two jobtrackers, two trackers on each node, and as you mentioned, two datanodes in single host, What you have to do is that you should ensure there is no conflicts on the port of datanodes occupy(modify the configuration files), and for assigning separate disks , just set the different root directory of your HDFS, I guess that you also have to run two namenodes just like we have done Cheers Nan On Wed, Sep 15, 2010 at 11:50 PM, Arv Mistry <[EMAIL PROTECTED]> wrote: > Hi, > > Is it possible to run multiple data nodes on a single machine? I > currently have a machine with multiple disks and enough disk capacity > for replication across them. I don't need redundancy at the machine > level but would like to be able to handle a single disk failure. > > So I was thinking if I can run multiple DataNodes on a single machine > each assigned a separate disk that would give me the protection I need > against disk failure. > > Can anyone give me any insights in to how I would setup multiple > DataNodes to run on a single machine? Thanks in advance, > > Cheers Arv >
-
Re: Multiple DataNodes on a single machineDavid Rosenstrauch 2010-09-15, 20:19
On 09/15/2010 11:50 AM, Arv Mistry wrote:
> Hi, > > Is it possible to run multiple data nodes on a single machine? I > currently have a machine with multiple disks and enough disk capacity > for replication across them. I don't need redundancy at the machine > level but would like to be able to handle a single disk failure. > > So I was thinking if I can run multiple DataNodes on a single machine > each assigned a separate disk that would give me the protection I need > against disk failure. > > Can anyone give me any insights in to how I would setup multiple > DataNodes to run on a single machine? Thanks in advance, > > Cheers Arv I guess you *could*, but it doesn't seem to me it would make much sense in a production environment, since the 2 data nodes running on the machine would be competing with each other for CPU usage, network bandwidth, etc. DR
-
Re: Multiple DataNodes on a single machineMatthew Foley 2010-09-16, 00:45
Hello Arv,
It is possible to run multiple datanodes on a single machine, and this can be useful for small-scale test scenarios. Also you mentioned in your previous message that you have a Hadoop implementation with only one physical datanode server and want to replicate within it, between spindles. This also makes sense, and will work. Of course, if you have two datanodes running you will get only order-2 replication, not order-3, even if the replication has been set to 3. I will describe the config in a moment, but I would first like to point out that in clusters with even a few datanode servers, one is better off with cross-server replication. Without cross-server replication, losing the System disk will make ALL data volumes unavailable. And of course, multiple datanodes running on one server will compete for cores, NICs, bus, and memory access, even if not for spindles. A previous responder suggested running two namenodes also, but it wasn't clear whether he meant two primaries or one primary and one secondary/checkpoint nameserver. The latter is fine, but running two primary namenodes is definitely not the thing to do! Anyway, here's how you set it up. I have done this recently with v0.21.0, with two datanode processes in a single box (along with namenode sharing the same box), and it did replicate correctly between the two. I haven't tried it with > 2 datanodes, and I don't know what the impact on process efficiency would be, but that would probably work too. 1. In your HADOOP_HOME directory, copy the "conf" directory to, say, "conf2". 2. In the conf2 directory, edit as follows: a) In hadoop-env.sh, provide unique non-default HADOOP_IDENT_STRING, e.g. ${USER}_02 b) In hdfs-site.xml, change dfs.data.dir to show the desired targets/volumes for datanode#2, and of course make sure the corresponding target directories exist. Also remove these targets from the dfs.data.dir target list for datanode#1 in conf/hdfs-site.xml. c) in hdfs-site.xml, set the four following "address:port" strings to something non-conflicting with the other datanode and other processes running on this box: - dfs.datanode.address (default 0.0.0.0:50010) - dfs.datanode.ipc.address (default 0.0.0.0:50020) - dfs.datanode.http.address (default 0.0.0.0:50075) - dfs.datanode.https.address (default 0.0.0.0:50475) Note: the defaults above are what datanode#1 is probably running on. I added 2 to each port number for datanode#2 and it seemed to work okay. You might also wish to note the default ports associated with the namenode and job/task tracker processes, in case they are running on the same box: - fs.default.name 0.0.0.0:9000 - dfs.http.address 0.0.0.0:50070 - dfs.https.address 0.0.0.0:50470 - dfs.secondary.http.address 0.0.0.0:50090 - mapred.job.tracker.http.address 0.0.0.0:50030 - mapred.task.tracker.report.address 127.0.0.1:0 - mapred.task.tracker.http.address 0.0.0.0:50060 3. At this point, launching with: bin/hdfs --config $HADOOP_HOME/conf2 datanode will work. To make it convenient to launch as a service, you can add a couple lines to the end of the bin/start-dfs.sh script like: HADOOP_CONF_DIR2=$HADOOP_HOME/conf2 "$HADOOP_COMMON_HOME"/bin/hadoop-daemons.sh --config $HADOOP_CONF_DIR2 --script "$bin"/hdfs start datanode $dataStartOpt Hope this helps, --Matt On Sep 15, 2010, at 8:50 AM, Arv Mistry wrote: Hi, Is it possible to run multiple data nodes on a single machine? I currently have a machine with multiple disks and enough disk capacity for replication across them. I don't need redundancy at the machine level but would like to be able to handle a single disk failure. So I was thinking if I can run multiple DataNodes on a single machine each assigned a separate disk that would give me the protection I need against disk failure. Can anyone give me any insights in to how I would setup multiple DataNodes to run on a single machine? Thanks in advance, Cheers Arv
-
Re: Multiple DataNodes on a single machineRanjib Dey 2010-09-16, 08:27
to me, the notion of replication is to provide a failsafe mechanism in case
some nodes go down, hence running two datanode on a single host does not serve this basic purpose. As already mentioned you can run two data nodes using two instanced of hadoop with different ports. In case you do not want to replicate , you can simply set the replication factor to 1. On Thu, Sep 16, 2010 at 6:15 AM, Matthew Foley <[EMAIL PROTECTED]> wrote: > Hello Arv, > It is possible to run multiple datanodes on a single machine, and this can > be useful for small-scale test scenarios. Also you mentioned in your > previous message that you have a Hadoop implementation with only one > physical datanode server and want to replicate within it, between spindles. > This also makes sense, and will work. Of course, if you have two datanodes > running you will get only order-2 replication, not order-3, even if the > replication has been set to 3. > > I will describe the config in a moment, but I would first like to point out > that in clusters with even a few datanode servers, one is better off with > cross-server replication. Without cross-server replication, losing the > System disk will make ALL data volumes unavailable. And of course, multiple > datanodes running on one server will compete for cores, NICs, bus, and > memory access, even if not for spindles. > > A previous responder suggested running two namenodes also, but it wasn't > clear whether he meant two primaries or one primary and one > secondary/checkpoint nameserver. The latter is fine, but running two > primary namenodes is definitely not the thing to do! > > Anyway, here's how you set it up. I have done this recently with v0.21.0, > with two datanode processes in a single box (along with namenode sharing the > same box), and it did replicate correctly between the two. I haven't tried > it with > 2 datanodes, and I don't know what the impact on process > efficiency would be, but that would probably work too. > > 1. In your HADOOP_HOME directory, copy the "conf" directory to, say, > "conf2". > > 2. In the conf2 directory, edit as follows: > > a) In hadoop-env.sh, provide unique non-default HADOOP_IDENT_STRING, e.g. > ${USER}_02 > b) In hdfs-site.xml, change dfs.data.dir to show the desired > targets/volumes for datanode#2, and of course make sure the corresponding > target directories exist. Also remove these targets from the dfs.data.dir > target list for datanode#1 in conf/hdfs-site.xml. > c) in hdfs-site.xml, set the four following "address:port" strings to > something non-conflicting with the other datanode and other processes > running on this box: > - dfs.datanode.address (default 0.0.0.0:50010) > - dfs.datanode.ipc.address (default 0.0.0.0:50020) > - dfs.datanode.http.address (default 0.0.0.0:50075) > - dfs.datanode.https.address (default 0.0.0.0:50475) > Note: the defaults above are what datanode#1 is probably running on. I > added 2 to each port number for datanode#2 and it seemed to work okay. You > might also wish to note the default ports associated with the namenode and > job/task tracker processes, in case they are running on the same box: > - fs.default.name 0.0.0.0:9000 > - dfs.http.address 0.0.0.0:50070 > - dfs.https.address 0.0.0.0:50470 > - dfs.secondary.http.address 0.0.0.0:50090 > - mapred.job.tracker.http.address 0.0.0.0:50030 > - mapred.task.tracker.report.address 127.0.0.1:0 > - mapred.task.tracker.http.address 0.0.0.0:50060 > > 3. At this point, launching with: > bin/hdfs --config $HADOOP_HOME/conf2 datanode > will work. To make it convenient to launch as a service, you can add a > couple lines to the end of the bin/start-dfs.sh script like: > HADOOP_CONF_DIR2=$HADOOP_HOME/conf2 > "$HADOOP_COMMON_HOME"/bin/hadoop-daemons.sh --config $HADOOP_CONF_DIR2 > --script "$bin"/hdfs start datanode $dataStartOpt > > Hope this helps, > --Matt > > On Sep 15, 2010, at 8:50 AM, Arv Mistry wrote: > > Hi, > > Is it possible to run multiple data nodes on a single machine? I
-
RE: Multiple DataNodes on a single machineArv Mistry 2010-09-16, 13:55
Thanks for the responses, I especially appreciate the details Matthew!
Just for the record, I appreciate that having multiple DataNodes on a single machine defeats the purpose or the advantages given by having them spread across machines across racks. I intend to go to that model as we grow. Cheers Arv -----Original Message----- From: Matthew Foley [mailto:[EMAIL PROTECTED]] Sent: September 15, 2010 8:45 PM To: [EMAIL PROTECTED] Cc: Matthew Foley Subject: Re: Multiple DataNodes on a single machine Hello Arv, It is possible to run multiple datanodes on a single machine, and this can be useful for small-scale test scenarios. Also you mentioned in your previous message that you have a Hadoop implementation with only one physical datanode server and want to replicate within it, between spindles. This also makes sense, and will work. Of course, if you have two datanodes running you will get only order-2 replication, not order-3, even if the replication has been set to 3. I will describe the config in a moment, but I would first like to point out that in clusters with even a few datanode servers, one is better off with cross-server replication. Without cross-server replication, losing the System disk will make ALL data volumes unavailable. And of course, multiple datanodes running on one server will compete for cores, NICs, bus, and memory access, even if not for spindles. A previous responder suggested running two namenodes also, but it wasn't clear whether he meant two primaries or one primary and one secondary/checkpoint nameserver. The latter is fine, but running two primary namenodes is definitely not the thing to do! Anyway, here's how you set it up. I have done this recently with v0.21.0, with two datanode processes in a single box (along with namenode sharing the same box), and it did replicate correctly between the two. I haven't tried it with > 2 datanodes, and I don't know what the impact on process efficiency would be, but that would probably work too. 1. In your HADOOP_HOME directory, copy the "conf" directory to, say, "conf2". 2. In the conf2 directory, edit as follows: a) In hadoop-env.sh, provide unique non-default HADOOP_IDENT_STRING, e.g. ${USER}_02 b) In hdfs-site.xml, change dfs.data.dir to show the desired targets/volumes for datanode#2, and of course make sure the corresponding target directories exist. Also remove these targets from the dfs.data.dir target list for datanode#1 in conf/hdfs-site.xml. c) in hdfs-site.xml, set the four following "address:port" strings to something non-conflicting with the other datanode and other processes running on this box: - dfs.datanode.address (default 0.0.0.0:50010) - dfs.datanode.ipc.address (default 0.0.0.0:50020) - dfs.datanode.http.address (default 0.0.0.0:50075) - dfs.datanode.https.address (default 0.0.0.0:50475) Note: the defaults above are what datanode#1 is probably running on. I added 2 to each port number for datanode#2 and it seemed to work okay. You might also wish to note the default ports associated with the namenode and job/task tracker processes, in case they are running on the same box: - fs.default.name 0.0.0.0:9000 - dfs.http.address 0.0.0.0:50070 - dfs.https.address 0.0.0.0:50470 - dfs.secondary.http.address 0.0.0.0:50090 - mapred.job.tracker.http.address 0.0.0.0:50030 - mapred.task.tracker.report.address 127.0.0.1:0 - mapred.task.tracker.http.address 0.0.0.0:50060 3. At this point, launching with: bin/hdfs --config $HADOOP_HOME/conf2 datanode will work. To make it convenient to launch as a service, you can add a couple lines to the end of the bin/start-dfs.sh script like: HADOOP_CONF_DIR2=$HADOOP_HOME/conf2 "$HADOOP_COMMON_HOME"/bin/hadoop-daemons.sh --config $HADOOP_CONF_DIR2 --script "$bin"/hdfs start datanode $dataStartOpt Hope this helps, --Matt On Sep 15, 2010, at 8:50 AM, Arv Mistry wrote: Hi, Is it possible to run multiple data nodes on a single machine? I currently have a machine with multiple disks and enough disk capacity for replication across them. I don't need redundancy at the machine level but would like to be able to handle a single disk failure. So I was thinking if I can run multiple DataNodes on a single machine each assigned a separate disk that would give me the protection I need against disk failure. Can anyone give me any insights in to how I would setup multiple DataNodes to run on a single machine? Thanks in advance, Cheers Arv |