|
David Milne
2010-06-10, 02:56
David Milne
2010-06-14, 02:33
Jeff Hammerbacher
2010-06-14, 02:39
David Milne
2010-06-14, 04:21
Vinod KV
2010-06-14, 05:52
Vinod KV
2010-06-14, 05:57
Amr Awadallah
2010-06-14, 12:37
Edward Capriolo
2010-06-14, 15:26
Steve Loughran
2010-06-14, 15:49
David Milne
2010-06-14, 22:49
David Milne
2010-06-14, 23:45
David Milne
2010-06-15, 00:04
Vinod KV
2010-06-15, 08:10
Steve Loughran
2010-06-15, 10:08
Jason Stowe
2010-06-15, 19:10
Edward Capriolo
2010-06-15, 20:47
|
-
Problems with HOD and HDFSDavid Milne 2010-06-10, 02:56
Hi there,
I am trying to get Hadoop on Demand up and running, but am having problems with the ringmaster not being able to communicate with HDFS. The output from the hod allocate command ends with this, with full verbosity: [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve 'hdfs' service address. [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated. [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop() [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop() [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate cluster /home/dmilne/hadoop/cluster [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7 I've attached the hodrc file below, but briefly HOD is supposed to provision an HDFS cluster as well as a Map/Reduce cluster, and seems to be failing to do so. The ringmaster log looks like this: [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr addr hdfs: not found [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr addr hdfs: not found ... and so on, until it gives up Any ideas why? One red flag is that when running the allocate command, some of the variables echo-ed back look dodgy: --gridservice-hdfs.fs_port 0 --gridservice-hdfs.host localhost --gridservice-hdfs.info_port 0 These are not what I specified in the hodrc. Are the port numbers just set to 0 because I am not using an external HDFS, or is this a problem? The software versions involved are: - Hadoop 0.20.2 - Python 2.5.2 (no Twisted) - Java 1.6.0_20 - Torque 2.4.5 The hodrc file looks like this: [hod] stream = True java-home = /opt/jdk1.6.0_20 cluster = debian5 cluster-factor = 1.8 xrs-port-range = 32768-65536 debug = 3 allocate-wait-time = 3600 temp-dir = /scratch/local/dmilne/hod [ringmaster] register = True stream = False temp-dir = /scratch/local/dmilne/hod log-dir = /scratch/local/dmilne/hod/log http-port-range = 8000-9000 idleness-limit = 864000 work-dirs /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2 xrs-port-range = 32768-65536 debug = 4 [hodring] stream = False temp-dir = /scratch/local/dmilne/hod log-dir = /scratch/local/dmilne/hod/log register = True java-home = /opt/jdk1.6.0_20 http-port-range = 8000-9000 xrs-port-range = 32768-65536 debug = 4 [resource_manager] queue = express batch-home = /opt/torque-2.4.5 id = torque options = l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB" #env-vars HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python [gridservice-mapred] external = False pkgs = /opt/hadoop-0.20.2 tracker_port = 8030 info_port = 50080 [gridservice-hdfs] external = False pkgs = /opt/hadoop-0.20.2 fs_port = 8020 info_port = 50070 Cheers, Dave
-
Re: Problems with HOD and HDFSDavid Milne 2010-06-14, 02:33
Anybody? I am completely stuck here. I have no idea who else I can ask
or where I can go for more information. Is there somewhere specific where I should be asking about HOD? Thank you, Dave On Thu, Jun 10, 2010 at 2:56 PM, David Milne <[EMAIL PROTECTED]> wrote: > Hi there, > > I am trying to get Hadoop on Demand up and running, but am having > problems with the ringmaster not being able to communicate with HDFS. > > The output from the hod allocate command ends with this, with full verbosity: > > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve > 'hdfs' service address. > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id > 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated. > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop() > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop() > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate > cluster /home/dmilne/hadoop/cluster > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7 > > > I've attached the hodrc file below, but briefly HOD is supposed to > provision an HDFS cluster as well as a Map/Reduce cluster, and seems > to be failing to do so. The ringmaster log looks like this: > > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr > addr hdfs: not found > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr > addr hdfs: not found > > ... and so on, until it gives up > > Any ideas why? One red flag is that when running the allocate command, > some of the variables echo-ed back look dodgy: > > --gridservice-hdfs.fs_port 0 > --gridservice-hdfs.host localhost > --gridservice-hdfs.info_port 0 > > These are not what I specified in the hodrc. Are the port numbers just > set to 0 because I am not using an external HDFS, or is this a > problem? > > > The software versions involved are: > - Hadoop 0.20.2 > - Python 2.5.2 (no Twisted) > - Java 1.6.0_20 > - Torque 2.4.5 > > > The hodrc file looks like this: > > [hod] > stream = True > java-home = /opt/jdk1.6.0_20 > cluster = debian5 > cluster-factor = 1.8 > xrs-port-range = 32768-65536 > debug = 3 > allocate-wait-time = 3600 > temp-dir = /scratch/local/dmilne/hod > > [ringmaster] > register = True > stream = False > temp-dir = /scratch/local/dmilne/hod > log-dir = /scratch/local/dmilne/hod/log > http-port-range = 8000-9000 > idleness-limit = 864000 > work-dirs > /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2 > xrs-port-range = 32768-65536 > debug = 4 > > [hodring] > stream = False > temp-dir = /scratch/local/dmilne/hod > log-dir = /scratch/local/dmilne/hod/log > register = True > java-home = /opt/jdk1.6.0_20 > http-port-range = 8000-9000 > xrs-port-range = 32768-65536 > debug = 4 > > [resource_manager] > queue = express > batch-home = /opt/torque-2.4.5 > id = torque > options = l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB" > #env-vars > HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
-
Re: Problems with HOD and HDFSJeff Hammerbacher 2010-06-14, 02:39
Hey Dave,
I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't think HOD is actively used or developed anywhere these days. You're attempting to use a mostly deprecated project, and hence not receiving any support on the mailing list. Thanks, Jeff On Sun, Jun 13, 2010 at 7:33 PM, David Milne <[EMAIL PROTECTED]> wrote: > Anybody? I am completely stuck here. I have no idea who else I can ask > or where I can go for more information. Is there somewhere specific > where I should be asking about HOD? > > Thank you, > Dave > > On Thu, Jun 10, 2010 at 2:56 PM, David Milne <[EMAIL PROTECTED]> wrote: > > Hi there, > > > > I am trying to get Hadoop on Demand up and running, but am having > > problems with the ringmaster not being able to communicate with HDFS. > > > > The output from the hod allocate command ends with this, with full > verbosity: > > > > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve > > 'hdfs' service address. > > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id > > 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated. > > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop() > > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop() > > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate > > cluster /home/dmilne/hadoop/cluster > > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7 > > > > > > I've attached the hodrc file below, but briefly HOD is supposed to > > provision an HDFS cluster as well as a Map/Reduce cluster, and seems > > to be failing to do so. The ringmaster log looks like this: > > > > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: > hdfs > > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr > > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> > > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr > > addr hdfs: not found > > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: > hdfs > > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr > > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> > > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr > > addr hdfs: not found > > > > ... and so on, until it gives up > > > > Any ideas why? One red flag is that when running the allocate command, > > some of the variables echo-ed back look dodgy: > > > > --gridservice-hdfs.fs_port 0 > > --gridservice-hdfs.host localhost > > --gridservice-hdfs.info_port 0 > > > > These are not what I specified in the hodrc. Are the port numbers just > > set to 0 because I am not using an external HDFS, or is this a > > problem? > > > > > > The software versions involved are: > > - Hadoop 0.20.2 > > - Python 2.5.2 (no Twisted) > > - Java 1.6.0_20 > > - Torque 2.4.5 > > > > > > The hodrc file looks like this: > > > > [hod] > > stream = True > > java-home = /opt/jdk1.6.0_20 > > cluster = debian5 > > cluster-factor = 1.8 > > xrs-port-range = 32768-65536 > > debug = 3 > > allocate-wait-time = 3600 > > temp-dir = /scratch/local/dmilne/hod > > > > [ringmaster] > > register = True > > stream = False > > temp-dir = /scratch/local/dmilne/hod > > log-dir = /scratch/local/dmilne/hod/log > > http-port-range = 8000-9000 > > idleness-limit = 864000 > > work-dirs > > /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2 > > xrs-port-range = 32768-65536 > > debug = 4 > > > > [hodring] > > stream = False > > temp-dir = /scratch/local/dmilne/hod > > log-dir = /scratch/local/dmilne/hod/log
-
Re: Problems with HOD and HDFSDavid Milne 2010-06-14, 04:21
Ok, thanks Jeff.
This is pretty surprising though. I would have thought many people would be in my position, where they have to use Hadoop on a general purpose cluster, and need it to play nice with a resource manager? What do other people do in this position, if they don't use HOD? Deprecated normally means there is a better alternative. - Dave On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <[EMAIL PROTECTED]> wrote: > Hey Dave, > > I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't > think HOD is actively used or developed anywhere these days. You're > attempting to use a mostly deprecated project, and hence not receiving any > support on the mailing list. > > Thanks, > Jeff > > On Sun, Jun 13, 2010 at 7:33 PM, David Milne <[EMAIL PROTECTED]> wrote: > >> Anybody? I am completely stuck here. I have no idea who else I can ask >> or where I can go for more information. Is there somewhere specific >> where I should be asking about HOD? >> >> Thank you, >> Dave >> >> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <[EMAIL PROTECTED]> wrote: >> > Hi there, >> > >> > I am trying to get Hadoop on Demand up and running, but am having >> > problems with the ringmaster not being able to communicate with HDFS. >> > >> > The output from the hod allocate command ends with this, with full >> verbosity: >> > >> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve >> > 'hdfs' service address. >> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id >> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated. >> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop() >> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop() >> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate >> > cluster /home/dmilne/hadoop/cluster >> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7 >> > >> > >> > I've attached the hodrc file below, but briefly HOD is supposed to >> > provision an HDFS cluster as well as a Map/Reduce cluster, and seems >> > to be failing to do so. The ringmaster log looks like this: >> > >> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: >> hdfs >> > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> >> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr >> > addr hdfs: not found >> > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: >> hdfs >> > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> >> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr >> > addr hdfs: not found >> > >> > ... and so on, until it gives up >> > >> > Any ideas why? One red flag is that when running the allocate command, >> > some of the variables echo-ed back look dodgy: >> > >> > --gridservice-hdfs.fs_port 0 >> > --gridservice-hdfs.host localhost >> > --gridservice-hdfs.info_port 0 >> > >> > These are not what I specified in the hodrc. Are the port numbers just >> > set to 0 because I am not using an external HDFS, or is this a >> > problem? >> > >> > >> > The software versions involved are: >> > - Hadoop 0.20.2 >> > - Python 2.5.2 (no Twisted) >> > - Java 1.6.0_20 >> > - Torque 2.4.5 >> > >> > >> > The hodrc file looks like this: >> > >> > [hod] >> > stream = True >> > java-home = /opt/jdk1.6.0_20 >> > cluster = debian5 >> > cluster-factor = 1.8 >> > xrs-port-range = 32768-65536 >> > debug = 3 >> > allocate-wait-time = 3600 >> > temp-dir = /scratch/local/dmilne/hod >> > >> > [ringmaster] >> > register = True >> > stream = False >> > temp-dir = /scratch/local/dmilne/hod
-
Re: Problems with HOD and HDFSVinod KV 2010-06-14, 05:52
On Monday 14 June 2010 08:03 AM, David Milne wrote:
> Anybody? I am completely stuck here. I have no idea who else I can ask > or where I can go for more information. Is there somewhere specific > where I should be asking about HOD? > > Thank you, > Dave > In the ringmaster logs, you should see which node was supposed to run Namenode. This can be found above the logs that you've printed. I can barely remember but I guess it reads something like getCommand(). Once you find out the node, check the hodring logs there, something must have gone wrong there. The return code was 7 - indicating HDFS failure. See http://hadoop.apache.org/common/docs/r0.20.0/hod_user_guide.html#The+Exit+Codes+For+HOD+Are+Not+Getting+Into+Torque, and check if you are hitting one of the problems listed there. HTH, +vinod > On Thu, Jun 10, 2010 at 2:56 PM, David Milne<[EMAIL PROTECTED]> wrote: > >> Hi there, >> >> I am trying to get Hadoop on Demand up and running, but am having >> problems with the ringmaster not being able to communicate with HDFS. >> >> The output from the hod allocate command ends with this, with full verbosity: >> >> [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve >> 'hdfs' service address. >> [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id >> 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated. >> [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop() >> [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop() >> [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate >> cluster /home/dmilne/hadoop/cluster >> [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7 >> >> >> I've attached the hodrc file below, but briefly HOD is supposed to >> provision an HDFS cluster as well as a Map/Reduce cluster, and seems >> to be failing to do so. The ringmaster log looks like this: >> >> [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs >> [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr >> service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> >> [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr >> addr hdfs: not found >> [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs >> [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr >> service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> >> [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr >> addr hdfs: not found >> >> ... and so on, until it gives up >> >> Any ideas why? One red flag is that when running the allocate command, >> some of the variables echo-ed back look dodgy: >> >> --gridservice-hdfs.fs_port 0 >> --gridservice-hdfs.host localhost >> --gridservice-hdfs.info_port 0 >> >> These are not what I specified in the hodrc. Are the port numbers just >> set to 0 because I am not using an external HDFS, or is this a >> problem? >> >> >> The software versions involved are: >> - Hadoop 0.20.2 >> - Python 2.5.2 (no Twisted) >> - Java 1.6.0_20 >> - Torque 2.4.5 >> >> >> The hodrc file looks like this: >> >> [hod] >> stream = True >> java-home = /opt/jdk1.6.0_20 >> cluster = debian5 >> cluster-factor = 1.8 >> xrs-port-range = 32768-65536 >> debug = 3 >> allocate-wait-time = 3600 >> temp-dir = /scratch/local/dmilne/hod >> >> [ringmaster] >> register = True >> stream = False >> temp-dir = /scratch/local/dmilne/hod >> log-dir = /scratch/local/dmilne/hod/log >> http-port-range = 8000-9000 >> idleness-limit = 864000 >> work-dirs >> /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2 >> xrs-port-range = 32768-65536
-
Re: Problems with HOD and HDFSVinod KV 2010-06-14, 05:57
On Monday 14 June 2010 09:51 AM, David Milne wrote:
> Ok, thanks Jeff. > > This is pretty surprising though. I would have thought many people > would be in my position, where they have to use Hadoop on a general > purpose cluster, and need it to play nice with a resource manager? > What do other people do in this position, if they don't use HOD? > Deprecated normally means there is a better alternative. > > - Dave > It isn't formally deprecated though. May be we'll need to do it explicitly; that'll help putting up proper documentation about what else to use instead. A quick reply is that you start a static cluster on a set of nodes. Static cluster means bringing up hadoop dameons on a set of nodes using the startup scripts distributed along in bin/ directory. That said, there are no changes in HOD in 0.21 and beyond. Deploying 0.21 clusters should mostly work out of the box. But beyond 0.21, it may not work because HOD needs to be updated w.r.t removed/updated hadoop specific configuration parameters and environmental variables it generates itself. HTH, +vinod > On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher<[EMAIL PROTECTED]> wrote: > >> Hey Dave, >> >> I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't >> think HOD is actively used or developed anywhere these days. You're >> attempting to use a mostly deprecated project, and hence not receiving any >> support on the mailing list. >> >> Thanks, >> Jeff >> >> On Sun, Jun 13, 2010 at 7:33 PM, David Milne<[EMAIL PROTECTED]> wrote: >> >> >>> Anybody? I am completely stuck here. I have no idea who else I can ask >>> or where I can go for more information. Is there somewhere specific >>> where I should be asking about HOD? >>> >>> Thank you, >>> Dave >>> >>> On Thu, Jun 10, 2010 at 2:56 PM, David Milne<[EMAIL PROTECTED]> wrote: >>> >>>> Hi there, >>>> >>>> I am trying to get Hadoop on Demand up and running, but am having >>>> problems with the ringmaster not being able to communicate with HDFS. >>>> >>>> The output from the hod allocate command ends with this, with full >>>> >>> verbosity: >>> >>>> [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve >>>> 'hdfs' service address. >>>> [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id >>>> 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated. >>>> [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop() >>>> [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop() >>>> [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate >>>> cluster /home/dmilne/hadoop/cluster >>>> [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7 >>>> >>>> >>>> I've attached the hodrc file below, but briefly HOD is supposed to >>>> provision an HDFS cluster as well as a Map/Reduce cluster, and seems >>>> to be failing to do so. The ringmaster log looks like this: >>>> >>>> [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: >>>> >>> hdfs >>> >>>> [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr >>>> service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> >>>> [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr >>>> addr hdfs: not found >>>> [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: >>>> >>> hdfs >>> >>>> [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr >>>> service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> >>>> [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr >>>> addr hdfs: not found >>>> >>>> ... and so on, until it gives up >>>> >>>> Any ideas why? One red flag is that when running the allocate command, >>>> some of the variables echo-ed back look dodgy: >>>> >>>> --gridservice-hdfs.fs_port 0 >>>> --gridservice-hdfs.host localhost >>>> --gridservice-hdfs.info_port 0 >>>> >>>> These are not what I specified in the hodrc. Are the port numbers just
-
Re: Problems with HOD and HDFSAmr Awadallah 2010-06-14, 12:37
Dave,
Yes, many others have the same situation, the recommended solution is either to use the Fair Share Scheduler or the Capacity Scheduler. These schedulers are much better than HOD since they take data locality into consideration (they don't just spin up 20 TT nodes on machines that have nothing to do with your data). They also don't lock down the nodes just for you, so as TT are freed other jobs can use them immediately (as opposed to no body can use them till your entire job is done). Also, if you are brave and want to try something spanking new, then I recommend you reach out to the Mesos guys, they have a scheduler layer under Hadoop that is data locality aware: http://mesos.berkeley.edu/ -- amr On Sun, Jun 13, 2010 at 9:21 PM, David Milne <[EMAIL PROTECTED]> wrote: > Ok, thanks Jeff. > > This is pretty surprising though. I would have thought many people > would be in my position, where they have to use Hadoop on a general > purpose cluster, and need it to play nice with a resource manager? > What do other people do in this position, if they don't use HOD? > Deprecated normally means there is a better alternative. > > - Dave > > On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <[EMAIL PROTECTED]> > wrote: > > Hey Dave, > > > > I can't speak for the folks at Yahoo!, but from watching the JIRA, I > don't > > think HOD is actively used or developed anywhere these days. You're > > attempting to use a mostly deprecated project, and hence not receiving > any > > support on the mailing list. > > > > Thanks, > > Jeff > > > > On Sun, Jun 13, 2010 at 7:33 PM, David Milne <[EMAIL PROTECTED]> > wrote: > > > >> Anybody? I am completely stuck here. I have no idea who else I can ask > >> or where I can go for more information. Is there somewhere specific > >> where I should be asking about HOD? > >> > >> Thank you, > >> Dave > >> > >> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <[EMAIL PROTECTED]> > wrote: > >> > Hi there, > >> > > >> > I am trying to get Hadoop on Demand up and running, but am having > >> > problems with the ringmaster not being able to communicate with HDFS. > >> > > >> > The output from the hod allocate command ends with this, with full > >> verbosity: > >> > > >> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve > >> > 'hdfs' service address. > >> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id > >> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated. > >> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop() > >> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from > rm.stop() > >> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate > >> > cluster /home/dmilne/hadoop/cluster > >> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7 > >> > > >> > > >> > I've attached the hodrc file below, but briefly HOD is supposed to > >> > provision an HDFS cluster as well as a Map/Reduce cluster, and seems > >> > to be failing to do so. The ringmaster log looks like this: > >> > > >> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr > name: > >> hdfs > >> > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr > >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> > >> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr > >> > addr hdfs: not found > >> > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr > name: > >> hdfs > >> > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr > >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> > >> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr > >> > addr hdfs: not found > >> > > >> > ... and so on, until it gives up > >> > > >> > Any ideas why? One red flag is that when running the allocate command, > >> > some of the variables echo-ed back look dodgy: > >> > > >> > --gridservice-hdfs.fs_port 0 > >> > --gridservice-hdfs.host localhost
-
Re: Problems with HOD and HDFSEdward Capriolo 2010-06-14, 15:26
On Mon, Jun 14, 2010 at 8:37 AM, Amr Awadallah <[EMAIL PROTECTED]> wrote:
> Dave, > > Yes, many others have the same situation, the recommended solution is > either to use the Fair Share Scheduler or the Capacity Scheduler. These > schedulers are much better than HOD since they take data locality into > consideration (they don't just spin up 20 TT nodes on machines that have > nothing to do with your data). They also don't lock down the nodes just for > you, so as TT are freed other jobs can use them immediately (as opposed to > no body can use them till your entire job is done). > > Also, if you are brave and want to try something spanking new, then I > recommend you reach out to the Mesos guys, they have a scheduler layer > under > Hadoop that is data locality aware: > > http://mesos.berkeley.edu/ > > -- amr > > On Sun, Jun 13, 2010 at 9:21 PM, David Milne <[EMAIL PROTECTED]> wrote: > > > Ok, thanks Jeff. > > > > This is pretty surprising though. I would have thought many people > > would be in my position, where they have to use Hadoop on a general > > purpose cluster, and need it to play nice with a resource manager? > > What do other people do in this position, if they don't use HOD? > > Deprecated normally means there is a better alternative. > > > > - Dave > > > > On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <[EMAIL PROTECTED]> > > wrote: > > > Hey Dave, > > > > > > I can't speak for the folks at Yahoo!, but from watching the JIRA, I > > don't > > > think HOD is actively used or developed anywhere these days. You're > > > attempting to use a mostly deprecated project, and hence not receiving > > any > > > support on the mailing list. > > > > > > Thanks, > > > Jeff > > > > > > On Sun, Jun 13, 2010 at 7:33 PM, David Milne <[EMAIL PROTECTED]> > > wrote: > > > > > >> Anybody? I am completely stuck here. I have no idea who else I can ask > > >> or where I can go for more information. Is there somewhere specific > > >> where I should be asking about HOD? > > >> > > >> Thank you, > > >> Dave > > >> > > >> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <[EMAIL PROTECTED]> > > wrote: > > >> > Hi there, > > >> > > > >> > I am trying to get Hadoop on Demand up and running, but am having > > >> > problems with the ringmaster not being able to communicate with > HDFS. > > >> > > > >> > The output from the hod allocate command ends with this, with full > > >> verbosity: > > >> > > > >> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to > retrieve > > >> > 'hdfs' service address. > > >> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster > id > > >> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated. > > >> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop() > > >> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from > > rm.stop() > > >> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate > > >> > cluster /home/dmilne/hadoop/cluster > > >> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7 > > >> > > > >> > > > >> > I've attached the hodrc file below, but briefly HOD is supposed to > > >> > provision an HDFS cluster as well as a Map/Reduce cluster, and seems > > >> > to be failing to do so. The ringmaster log looks like this: > > >> > > > >> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr > > name: > > >> hdfs > > >> > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr > > >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> > > >> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr > > >> > addr hdfs: not found > > >> > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr > > name: > > >> hdfs > > >> > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr > > >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8> > > >> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr > > >> > addr hdfs: not found I have not used it much, but I think HOD is pretty cool. I guess most people who are looking to (spin up, run job ,transfer off, spin down) are using EC2. HOD does something like make private hadoop clouds on your hardware and many probably do not have that use case. As schedulers advance and get better HOD becomes less attractive, but I can always see a place for it.
-
Re: Problems with HOD and HDFSSteve Loughran 2010-06-14, 15:49
Edward Capriolo wrote:
> > I have not used it much, but I think HOD is pretty cool. I guess most people > who are looking to (spin up, run job ,transfer off, spin down) are using > EC2. HOD does something like make private hadoop clouds on your hardware and > many probably do not have that use case. As schedulers advance and get > better HOD becomes less attractive, but I can always see a place for it. I don't know who is using it, or maintaining it; we've been bringing up short-lived Hadoop clusters different. I think I should write a little article on the topic; I presented about it at Berlin Buzzwords last week. Short lived Hadoop clusters on VMs are fine if you don't have enough data or CPU load to justify a set of dedicated physical machines, and is a good way of experimenting with Hadoop at scale. You can maybe lock down the network better too, though that depends on your VM infrastructure. Where VMs are weak is in disk IO performance, but there's no reason why the VM infrastructure can't take a list of filenames/directories as a hint for VM placement (placement is the new scheduling, incidentally), and virtualized IO can only improve. If you can run Hadoop MapReduce directly against SAN-mounted storage then you can stop worrying about locality of data and still gain from parallelisation of the operations. -steve
-
Re: Problems with HOD and HDFSDavid Milne 2010-06-14, 22:49
Thanks everyone for your replies.
Even though HOD looks like a dead-end I would prefer to use it. I am just one user of the cluster among many, and currently the only one using Hadoop. The jobs I need to run are pretty much one-off: they are big jobs that I can't do without Hadoop, but I might need to run them once a month or less. The ability to provision MapReduce and HDFS when I need it sounds ideal. Following Vinod's advice, I have rolled back to Hadoop 0.20.1 (the last version that HOD kept up with) and taken a closer look at the ringmaster logs. However, I am still getting the same problems as before, and I can't find anything in the logs to help me identify the NameNode. The full ringmaster log is below. It's a pretty repetitive song, so I've identified the chorus. [2010-06-15 10:07:40,236] DEBUG/10 ringMaster:569 - Getting service ID. [2010-06-15 10:07:40,237] DEBUG/10 ringMaster:573 - Got service ID: 34350.symphony.cs.waikato.ac.nz [2010-06-15 10:07:40,239] DEBUG/10 ringMaster:756 - Command to execute: /bin/cp /home/dmilne/hadoop/hadoop-0.20.1.tar.gz /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster [2010-06-15 10:07:42,314] DEBUG/10 ringMaster:762 - Completed command execution. Exit Code: 0. [2010-06-15 10:07:42,315] DEBUG/10 ringMaster:591 - Service registry @ http://symphony.cs.waikato.ac.nz:36372 [2010-06-15 10:07:47,503] DEBUG/10 ringMaster:726 - tarball name : /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1.tar.gz hadoop package name : hadoop-0.20.1/ [2010-06-15 10:07:47,505] DEBUG/10 ringMaster:716 - Returning Hadoop directory as: /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/ [2010-06-15 10:07:47,515] DEBUG/10 util:215 - Executing command /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/bin/hadoop version to find hadoop version [2010-06-15 10:07:48,241] DEBUG/10 util:224 - Version from hadoop command: Hadoop 0.20.1 [2010-06-15 10:07:48,244] DEBUG/10 ringMaster:117 - Using max-connect value 30 [2010-06-15 10:07:48,246] INFO/20 ringMaster:61 - Twisted interface not found. Using hodXMLRPCServer. [2010-06-15 10:07:48,257] DEBUG/10 ringMaster:73 - Ringmaster RPC Server at 33771 [2010-06-15 10:07:48,265] DEBUG/10 ringMaster:121 - registering: http://cn71:8030/hadoop-0.20.1.tar.gz [2010-06-15 10:07:48,275] DEBUG/10 ringMaster:658 - dmilne 34350.symphony.cs.waikato.ac.nz cn71.symphony.cs.waikato.ac.nz ringmaster hod [2010-06-15 10:07:48,307] DEBUG/10 ringMaster:670 - Registered with serivce registry: http://symphony.cs.waikato.ac.nz:36372. //chorus start [2010-06-15 10:07:48,393] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs [2010-06-15 10:07:48,394] DEBUG/10 ringMaster:487 - getServiceAddr service: <hodlib.GridServices.hdfs.Hdfs instance at 0xc9e050> [2010-06-15 10:07:48,395] DEBUG/10 ringMaster:504 - getServiceAddr addr hdfs: not found //chorus end //chorus (3x) [2010-06-15 10:07:51,461] DEBUG/10 ringMaster:726 - tarball name : /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1.tar.gz hadoop package name : hadoop-0.20.1/ [2010-06-15 10:07:51,463] DEBUG/10 ringMaster:716 - Returning Hadoop directory as: /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/ [2010-06-15 10:07:51,465] DEBUG/10 ringMaster:690 - hadoopdir=/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/, java-home=/opt/jdk1.6.0_20 [2010-06-15 10:07:51,470] DEBUG/10 util:215 - Executing command /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/bin/hadoop version to find hadoop version //chorus (1x) [2010-06-15 10:07:52,448] DEBUG/10 util:224 - Version from hadoop command: Hadoop 0.20.1 [2010-06-15 10:07:52,450] DEBUG/10 ringMaster:697 - starting jt monitor [2010-06-15 10:07:52,453] DEBUG/10 ringMaster:913 - Entered start method. [2010-06-15 10:07:52,455] DEBUG/10 ringMaster:924 - /home/dmilne/hadoop/hadoop-0.20.1/contrib/hod/bin/hodring 8000-9000 --hodring.userid dmilne --hodring.java-home /opt/jdk1.6.0_20 [2010-06-15 10:07:52,456] DEBUG/10 ringMaster:479 - getServiceAddr name: mapred [2010-06-15 10:07:52,458] DEBUG/10 ringMaster:487 - getServiceAddr service: <hodlib.GridServices.mapred.MapReduce instance at 0xc9e098> [2010-06-15 10:07:52,460] DEBUG/10 ringMaster:504 - getServiceAddr addr mapred: not found [2010-06-15 10:07:52,470] DEBUG/10 torque:147 - pbsdsh command: /opt/torque-2.4.5/bin/pbsdsh /home/dmilne/hadoop/hadoop-0.20.1/contrib/hod/bin/hodring 8000-9000 --hodring.userid dmilne --hodring.java-home /opt/jdk1.6.0_20 [2010-06-15 10:07:52,475] DEBUG/10 ringMaster:929 - Returned from runWorkers. //chorus (many times) [2010-06-15 10:12:02,852] DEBUG/10 ringMaster:530 - inside xml-rpc call to stop ringmaster [2010-06-15 10:12:02,853] DEBUG/10 ringMaster:976 - RingMaster stop method invoked. [2010-06-15 10:12:02,854] DEBUG/10 ringMaster:981 - finding exit code //chorus (1x) [2010-06-15 10:12:02,858] DEBUG/10 ringMaster:533 - returning from xml-rpc call to stop ringmaster [2010-06-15 10:12:02,859] DEBUG/10 ringMaster:949 - exit code 7 [2010-06-15 10:12:02,859] DEBUG/10 ringMaster:983 - stopping ringmaster instance [2010-06-15 10:12:03,420] DEBUG/10 ringMaster:479 - getServiceAddr name: mapred [2010-06-15 10:12:03,421] DEBUG/10 ringMaster:487 - getServiceAddr service: <hodlib.GridServices.mapred.MapReduce instance at 0xc9e098> [2010-06-15 10:12:03,422] DEBUG/10 ringMaster:504 - getServiceAddr addr mapred: not found [2010-06-15 10:12:03,852] DEBUG/10 idleJobTracker:79 - Joining the monitoring thread. [2010-06-15 10:12:03,853] DEBUG/10 idleJobTracker:83 - Joined the monitoring thread. [2010-06-15 10:12:04,442] DEBUG/10 ringMaster:793 - Cleaned up temporary dir: /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster [2010-06-15 10:12:04,477] DEBUG/10 ringMaster:976 - RingMas
-
Re: Problems with HOD and HDFSDavid Milne 2010-06-14, 23:45
Unless I am missing something, the Fair Share and Capacity schedulers
sound like a solution to a different problem: aren't they for a dedicated Hadoop cluster that needs to be shared by lots of people? I have a general purpose cluster that needs to be shared by lots of people. Only one of them (me) wants to run hadoop, and only wants to run it intermittently. I'm not concerned with data locality, as my workflow is: 1) upload data I need to process to cluster 2) run a chain of map-reduce tasks 3) grab processed data from cluster 4) clean up cluster Mesos sounds good, but I am definitely NOT brave about this. As I said, I am just one user of the cluster among many. I would want to stick with Torque and Maui for resource management. - Dave On Tue, Jun 15, 2010 at 12:37 AM, Amr Awadallah <[EMAIL PROTECTED]> wrote: > Dave, > > Yes, many others have the same situation, the recommended solution is > either to use the Fair Share Scheduler or the Capacity Scheduler. These > schedulers are much better than HOD since they take data locality into > consideration (they don't just spin up 20 TT nodes on machines that have > nothing to do with your data). They also don't lock down the nodes just for > you, so as TT are freed other jobs can use them immediately (as opposed to > no body can use them till your entire job is done). > > Also, if you are brave and want to try something spanking new, then I > recommend you reach out to the Mesos guys, they have a scheduler layer under > Hadoop that is data locality aware: > > http://mesos.berkeley.edu/ > > -- amr > > On Sun, Jun 13, 2010 at 9:21 PM, David Milne <[EMAIL PROTECTED]> wrote: > >> Ok, thanks Jeff. >> >> This is pretty surprising though. I would have thought many people >> would be in my position, where they have to use Hadoop on a general >> purpose cluster, and need it to play nice with a resource manager? >> What do other people do in this position, if they don't use HOD? >> Deprecated normally means there is a better alternative. >> >> - Dave >> >> On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <[EMAIL PROTECTED]> >> wrote: >> > Hey Dave, >> > >> > I can't speak for the folks at Yahoo!, but from watching the JIRA, I >> don't >> > think HOD is actively used or developed anywhere these days. You're >> > attempting to use a mostly deprecated project, and hence not receiving >> any >> > support on the mailing list. >> > >> > Thanks, >> > Jeff >> > >> > On Sun, Jun 13, 2010 at 7:33 PM, David Milne <[EMAIL PROTECTED]> >> wrote: >> > >> >> Anybody? I am completely stuck here. I have no idea who else I can ask >> >> or where I can go for more information. Is there somewhere specific >> >> where I should be asking about HOD? >> >> >> >> Thank you, >> >> Dave >> >> >> >> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <[EMAIL PROTECTED]> >> wrote: >> >> > Hi there, >> >> > >> >> > I am trying to get Hadoop on Demand up and running, but am having >> >> > problems with the ringmaster not being able to communicate with HDFS. >> >> > >> >> > The output from the hod allocate command ends with this, with full >> >> verbosity: >> >> > >> >> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve >> >> > 'hdfs' service address. >> >> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id >> >> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated. >> >> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop() >> >> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from >> rm.stop() >> >> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate >> >> > cluster /home/dmilne/hadoop/cluster >> >> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7 >> >> > >> >> > >> >> > I've attached the hodrc file below, but briefly HOD is supposed to >> >> > provision an HDFS cluster as well as a Map/Reduce cluster, and seems >> >> > to be failing to do so. The ringmaster log looks like this: >> >> > >> >> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr
-
Re: Problems with HOD and HDFSDavid Milne 2010-06-15, 00:04
Is there something else I could read about setting up short-lived
Hadoop clusters on virtual machines? I have no experience with VMs at all. I see there is quite a bit of material about using them to get Hadoop up and running with a psuedo-cluster on a single machine, but I don't follow how this stretches out to using multiple machines allocated by Torque. Thanks, Dave On Tue, Jun 15, 2010 at 3:49 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > Edward Capriolo wrote: > >> >> I have not used it much, but I think HOD is pretty cool. I guess most >> people >> who are looking to (spin up, run job ,transfer off, spin down) are using >> EC2. HOD does something like make private hadoop clouds on your hardware >> and >> many probably do not have that use case. As schedulers advance and get >> better HOD becomes less attractive, but I can always see a place for it. > > I don't know who is using it, or maintaining it; we've been bringing up > short-lived Hadoop clusters different. > > I think I should write a little article on the topic; I presented about it > at Berlin Buzzwords last week. > > Short lived Hadoop clusters on VMs are fine if you don't have enough data or > CPU load to justify a set of dedicated physical machines, and is a good way > of experimenting with Hadoop at scale. You can maybe lock down the network > better too, though that depends on your VM infrastructure. > > Where VMs are weak is in disk IO performance, but there's no reason why the > VM infrastructure can't take a list of filenames/directories as a hint for > VM placement (placement is the new scheduling, incidentally), and > virtualized IO can only improve. If you can run Hadoop MapReduce directly > against SAN-mounted storage then you can stop worrying about locality of > data and still gain from parallelisation of the operations. > > > -steve > > >
-
Re: Problems with HOD and HDFSVinod KV 2010-06-15, 08:10
On Tuesday 15 June 2010 04:19 AM, David Milne wrote:
> [2010-06-15 10:07:52,470] DEBUG/10 torque:147 - pbsdsh command: > /opt/torque-2.4.5/bin/pbsdsh > /home/dmilne/hadoop/hadoop-0.20.1/contrib/hod/bin/hodring > --hodring.tarball-retry-initial-time 1.0 > --hodring.cmd-retry-initial-time 2.0 --hodring.cmd-retry-interval 2.0 > --hodring.service-id 34350.symphony.cs.waikato.ac.nz > --hodring.temp-dir /scratch/local/dmilne/hod --hodring.http-port-range > 8000-9000 --hodring.userid dmilne --hodring.java-home /opt/jdk1.6.0_20 > --hodring.svcrgy-addr symphony.cs.waikato.ac.nz:36372 > --hodring.download-addr h:t --hodring.tarball-retry-interval 3.0 > --hodring.log-dir /scratch/local/dmilne/hod/log > --hodring.mapred-system-dir-root /mapredsystem > --hodring.xrs-port-range 32768-65536 --hodring.debug 4 > --hodring.ringmaster-xrs-addr cn71:33771 --hodring.register > [2010-06-15 10:07:52,475] DEBUG/10 ringMaster:929 - Returned from runWorkers. > > //chorus (many times) > Did you mean pbsdsh command itseld was printed many times above? That should not happen. I previously thought hodrings could not start namenode but looks like hodrings themselves failed to startup. You can do two things: - See qstat output, log into the slave nodes where your job was supposed to start and see hodring logs there. - run the above hodring command yourselves directly on on these slave nodes for your job and see if it fails with some error. +Vinod
-
Re: Problems with HOD and HDFSSteve Loughran 2010-06-15, 10:08
David Milne wrote:
> Is there something else I could read about setting up short-lived > Hadoop clusters on virtual machines? I have no experience with VMs at > all. I see there is quite a bit of material about using them to get > Hadoop up and running with a psuedo-cluster on a single machine, but I > don't follow how this stretches out to using multiple machines > allocated by Torque. My slides are up here http://www.slideshare.net/steve_l/farming-hadoop-inthecloud We've been bringing up hadoop in a virtual infrastructure, first you ask for the master node containing a NN, a JT and a DN with almost no storage (just enough for the filesystem to go live, so stop the JT blocking). If it comes up you then have a stable hostname for the filesystem which you can use for all the real worker nodes (DN + TT) you want. Some nearby physicists are trying to get Hadoop to co-exist with the grid schedulers, I've added a feature request to make the reporting of task tracker slots something plugins can handle, so that you'd have a set of hadoop workers which could be used by the grid apps or by hadoop -with physical hadoop storage. When they were doing work scheduled out of hadoop, they'd report less availability to the Job Tracker, so not overload the machines. Dan Templeton of Sun/Oracle has been working with getting Hadoop to coexist with his resource manager -he's worth contacting. Maybe we could persuade him to give public online talk on the topic. -steve
-
Re: Problems with HOD and HDFSJason Stowe 2010-06-15, 19:10
Hi David,
The original HOD project was integrated with Condor ( http://bit.ly/CondorProject), which Yahoo! was using to schedule clusters. A year or two ago, the Condor project in addition to being open-source w/o costs for licensing, created close integration with Hadoop (as does SGE), as presented by me at a prior Hadoop World, and the Condor team at Condor Week 2010: http://bit.ly/Condor_Hadoop_CondorWeek2010 My company has solutions for deploying Hadoop Clusters on shared infrastructure using CycleServer and schedulers like Condor/SGE/etc. The general deployment strategy is to deploy head nodes (Name/Job Tracker), then execute nodes, and to be careful about how you deal with data/sizing/replication counts. If you're interested in this, please feel free to drop us a line at my e-mail or http://cyclecomputing.com/about/contact Thanks, Jason On Mon, Jun 14, 2010 at 7:45 PM, David Milne <[EMAIL PROTECTED]> wrote: > Unless I am missing something, the Fair Share and Capacity schedulers > sound like a solution to a different problem: aren't they for a > dedicated Hadoop cluster that needs to be shared by lots of people? I > have a general purpose cluster that needs to be shared by lots of > people. Only one of them (me) wants to run hadoop, and only wants to > run it intermittently. I'm not concerned with data locality, as my > workflow is: > > 1) upload data I need to process to cluster > 2) run a chain of map-reduce tasks > 3) grab processed data from cluster > 4) clean up cluster > > Mesos sounds good, but I am definitely NOT brave about this. As I > said, I am just one user of the cluster among many. I would want to > stick with Torque and Maui for resource management. > > - Dave > > On Tue, Jun 15, 2010 at 12:37 AM, Amr Awadallah <[EMAIL PROTECTED]> wrote: > > Dave, > > > > Yes, many others have the same situation, the recommended solution is > > either to use the Fair Share Scheduler or the Capacity Scheduler. These > > schedulers are much better than HOD since they take data locality into > > consideration (they don't just spin up 20 TT nodes on machines that have > > nothing to do with your data). They also don't lock down the nodes just > for > > you, so as TT are freed other jobs can use them immediately (as opposed > to > > no body can use them till your entire job is done). > > > > Also, if you are brave and want to try something spanking new, then I > > recommend you reach out to the Mesos guys, they have a scheduler layer > under > > Hadoop that is data locality aware: > > > > http://mesos.berkeley.edu/ > > > > -- amr > > > > On Sun, Jun 13, 2010 at 9:21 PM, David Milne <[EMAIL PROTECTED]> > wrote: > > > >> Ok, thanks Jeff. > >> > >> This is pretty surprising though. I would have thought many people > >> would be in my position, where they have to use Hadoop on a general > >> purpose cluster, and need it to play nice with a resource manager? > >> What do other people do in this position, if they don't use HOD? > >> Deprecated normally means there is a better alternative. > >> > >> - Dave > >> > >> On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <[EMAIL PROTECTED] > > > >> wrote: > >> > Hey Dave, > >> > > >> > I can't speak for the folks at Yahoo!, but from watching the JIRA, I > >> don't > >> > think HOD is actively used or developed anywhere these days. You're > >> > attempting to use a mostly deprecated project, and hence not receiving > >> any > >> > support on the mailing list. > >> > > >> > Thanks, > >> > Jeff > >> > > >> > On Sun, Jun 13, 2010 at 7:33 PM, David Milne <[EMAIL PROTECTED]> > >> wrote: > >> > > >> >> Anybody? I am completely stuck here. I have no idea who else I can > ask > >> >> or where I can go for more information. Is there somewhere specific > >> >> where I should be asking about HOD? > >> >> > >> >> Thank you, > >> >> Dave > >> >> > >> >> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <[EMAIL PROTECTED]> > >> wrote: > >> >> > Hi there, > >> >> > > >> >> > I am trying to get Hadoop on Demand up and running, but am having =================================Jason A. Stowe cell: 607.227.9686 main: 888.292.5320 http://twitter.com/jasonastowe/ http://twitter.com/cyclecomputing/ Cycle Computing, LLC Leader in Open Compute Solutions for Clouds, Servers, and Desktops Enterprise Condor Support and Management Tools http://www.cyclecomputing.com http://www.cyclecloud.com
-
Re: Problems with HOD and HDFSEdward Capriolo 2010-06-15, 20:47
On Tue, Jun 15, 2010 at 3:10 PM, Jason Stowe <[EMAIL PROTECTED]>wrote:
> Hi David, > The original HOD project was integrated with Condor ( > http://bit.ly/CondorProject), which Yahoo! was using to schedule clusters. > > A year or two ago, the Condor project in addition to being open-source w/o > costs for licensing, created close integration with Hadoop (as does SGE), > as > presented by me at a prior Hadoop World, and the Condor team at Condor Week > 2010: > http://bit.ly/Condor_Hadoop_CondorWeek2010 > > My company has solutions for deploying Hadoop Clusters on shared > infrastructure using CycleServer and schedulers like Condor/SGE/etc. The > general deployment strategy is to deploy head nodes (Name/Job Tracker), > then > execute nodes, and to be careful about how you deal with > data/sizing/replication counts. > > If you're interested in this, please feel free to drop us a line at my > e-mail or http://cyclecomputing.com/about/contact > > Thanks, > Jason > > > On Mon, Jun 14, 2010 at 7:45 PM, David Milne <[EMAIL PROTECTED]> wrote: > > > Unless I am missing something, the Fair Share and Capacity schedulers > > sound like a solution to a different problem: aren't they for a > > dedicated Hadoop cluster that needs to be shared by lots of people? I > > have a general purpose cluster that needs to be shared by lots of > > people. Only one of them (me) wants to run hadoop, and only wants to > > run it intermittently. I'm not concerned with data locality, as my > > workflow is: > > > > 1) upload data I need to process to cluster > > 2) run a chain of map-reduce tasks > > 3) grab processed data from cluster > > 4) clean up cluster > > > > Mesos sounds good, but I am definitely NOT brave about this. As I > > said, I am just one user of the cluster among many. I would want to > > stick with Torque and Maui for resource management. > > > > - Dave > > > > On Tue, Jun 15, 2010 at 12:37 AM, Amr Awadallah <[EMAIL PROTECTED]> > wrote: > > > Dave, > > > > > > Yes, many others have the same situation, the recommended solution is > > > either to use the Fair Share Scheduler or the Capacity Scheduler. These > > > schedulers are much better than HOD since they take data locality into > > > consideration (they don't just spin up 20 TT nodes on machines that > have > > > nothing to do with your data). They also don't lock down the nodes just > > for > > > you, so as TT are freed other jobs can use them immediately (as opposed > > to > > > no body can use them till your entire job is done). > > > > > > Also, if you are brave and want to try something spanking new, then I > > > recommend you reach out to the Mesos guys, they have a scheduler layer > > under > > > Hadoop that is data locality aware: > > > > > > http://mesos.berkeley.edu/ > > > > > > -- amr > > > > > > On Sun, Jun 13, 2010 at 9:21 PM, David Milne <[EMAIL PROTECTED]> > > wrote: > > > > > >> Ok, thanks Jeff. > > >> > > >> This is pretty surprising though. I would have thought many people > > >> would be in my position, where they have to use Hadoop on a general > > >> purpose cluster, and need it to play nice with a resource manager? > > >> What do other people do in this position, if they don't use HOD? > > >> Deprecated normally means there is a better alternative. > > >> > > >> - Dave > > >> > > >> On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher < > [EMAIL PROTECTED] > > > > > >> wrote: > > >> > Hey Dave, > > >> > > > >> > I can't speak for the folks at Yahoo!, but from watching the JIRA, I > > >> don't > > >> > think HOD is actively used or developed anywhere these days. You're > > >> > attempting to use a mostly deprecated project, and hence not > receiving > > >> any > > >> > support on the mailing list. > > >> > > > >> > Thanks, > > >> > Jeff > > >> > > > >> > On Sun, Jun 13, 2010 at 7:33 PM, David Milne <[EMAIL PROTECTED]> > > >> wrote: > > >> > > > >> >> Anybody? I am completely stuck here. I have no idea who else I can > > ask > > >> >> or where I can go for more information. Is there somewhere specific allocated by Torque. Hadoop does not have a concept of VirutalHosting NameNode has a port, jobtracker has a port, DataNode users a port, and has a port for the web interface, task tracker is the same deal. Running multiple copies of hadoop on the same machine is "easy". All you have to do is make sure they do not step on each other. Make sure they do not write to the same folder locations, make sure they do not use the same ports. Single setup NameNode: 9000 Web: 50070 JobTracker: 1000 Web: 50030 ... Multi Setup Setup 1 NameNode: 9001 Web: 50071 JobTracker: 1001 Web: 50031 ... Setup2 NameNode: 9002 Web: 50072 JobTracker: 1002 Web: 50032 ... HOD is supposed to handle the "dirty" work for you of building configuration files, installing hadoop to the nodes, starting the hadoop components. You could theoretically accomplish similar things with remote SSH keys, and a boatload of scripting. HOD is a deployment and management tool. It sounds like it may not meet your need. Is your goal to just deploy and manage one instance of Hadoop or multiple instances? HOD is designed to install multiple instances of hadoop on a single set of hardware. It sounds like you want to deploy one cluster per group of VM's which is not really the same thing. |