|
Pradeep Fernando
2009-02-25, 09:11
Enis Soztutar
2009-02-25, 10:06
Steve Loughran
2009-02-25, 12:42
Pradeep Fernando
2009-02-25, 17:00
Raghu Angadi
2009-02-26, 00:00
Pradeep Fernando
2009-02-26, 03:53
Ajit Ratnaparkhi
2009-02-26, 07:23
lohit
2009-02-26, 08:10
Steve Loughran
2009-02-26, 11:53
Raghu Angadi
2009-02-26, 18:43
Ajit Ratnaparkhi
2009-03-01, 08:06
Bharat Jain
2009-03-02, 20:04
Raghu Angadi
2009-03-04, 19:08
Jakob Homan
2009-03-04, 19:41
|
-
Contributing to hadoopPradeep Fernando 2009-02-25, 09:11
hello devs,
I'm a newbie to hadoop world and still in the process of reading documentation in order to get a better understanding on hadoop. But still i m really interested on this project and community. Im willing to contribute to this project as a developer in the future. I saw that the hadoop can be set up in a single node. My concern is, do you need to have a special kind of infrastructure in order to contribute to hadoop (since this is dealing with clustering environment). if it needs a environment with more than one node to successfully carry out the development is there a work around? can you devs enlighten me on this plz. Thanks in advance, Pradeep Fernando.
-
Re: Contributing to hadoopEnis Soztutar 2009-02-25, 10:06
Please see below,
Pradeep Fernando wrote: > hello devs, > > I'm a newbie to hadoop world and still in the process of reading > documentation in order to get a better understanding on hadoop. > But still i m really interested on this project and community. Im willing to > contribute to this project as a developer in the future. > welcome > I saw that the hadoop can be set up in a single node. My concern is, do you > need to have a special kind of infrastructure in order > to contribute to hadoop (since this is dealing with clustering environment). > if it needs a environment with more than one node to successfully > carry out the development is there a work around? > Actually you can run hadoop in local mode, distributed mode, or pseudo-distributed mode. In local mode mapred jobs run in LocalJobRunner and the tasks run sequentially. In pseudo-distributed mode, you run NN,JT,DN and TTs in the same machine so the jobs run as they are in the distributed mode. Most of the mapred test cases run with this configuration. You can check MiniDFSCluster and MiniMRCluster classes as as reference. > can you devs enlighten me on this plz. > > Thanks in advance, > Pradeep Fernando. > >
-
Re: Contributing to hadoopSteve Loughran 2009-02-25, 12:42
Pradeep Fernando wrote:
> hello devs, > > I'm a newbie to hadoop world and still in the process of reading > documentation in order to get a better understanding on hadoop. > But still i m really interested on this project and community. Im willing to > contribute to this project as a developer in the future. > I saw that the hadoop can be set up in a single node. My concern is, do you > need to have a special kind of infrastructure in order > to contribute to hadoop (since this is dealing with clustering environment). > if it needs a environment with more than one node to successfully > carry out the development is there a work around? > > can you devs enlighten me on this plz. Everything works best on a Linux box on a network where DNS and reverse DNS works. The code is mostly in Java, Ant for building, JUnit for testing. The release/contribution process is more rigorous than anything I've seen in the OSS world -nobody gets to check in anything until Hudson is happy. Have you worked on Apache projects before?
-
Re: Contributing to hadoopPradeep Fernando 2009-02-25, 17:00
Thanks Enis & steve,
for your valuable guidelines. As i can understand since hadoop is a implementation of map-reduce it is aimed at working in a clustering environment. So me, having only a Desktop and Internet connection having doubts weather i can successfully contribute to the project. In local mode mapred jobs run in LocalJobRunner and the tasks run > sequentially. In pseudo-distributed mode, you run NN,JT,DN and TTs in the > same machine so the jobs run as they are in the distributed mode. Most of > the mapred test cases run with this configuration. although this sounds good. there are clouds like Amazon EC2 for setting up a cluster.Are you devs make use of that sort of infrastructure in development testing.I dont knw this is a right question or this is relevent at all. plz bare with me if. Have you worked on Apache projects before? yes i have contributed to the Apache Axis2 project.so Im pretty much familiar with java,ant,maven,junit , etc.
-
Re: Contributing to hadoopRaghu Angadi 2009-02-26, 00:00
Pradeep Fernando wrote:
> although this sounds good. there are clouds like Amazon EC2 for setting up a > cluster.Are you devs make use of that sort of infrastructure in development > testing.I dont knw this is a right question or this is relevent at all. plz > bare with me if. I guess you are asking if it would be more convenient if one had access to a larger cluster for development.. I doubt it. At Y! I have access to many machines and clusters.. but about 99% of my development happens using single machine for testing. I would guess that is true for most of the Hadoop developers. It is just more convenient on a single machine. I run multiple datanodes on the same machine if required. Real clusters are useful mainly to stress the system and to debug performance issues or hard to reproduce bugs. Raghu.
-
Re: Contributing to hadoopPradeep Fernando 2009-02-26, 03:53
Raghu,
I guess you are asking if it would be more convenient if one had access to a > larger cluster for development. exactly..... I have access to many machines and clusters.. but about 99% of my > development happens using single machine for testing. I would guess that is > true for most of the Hadoop developers. well this is the answer I was looking for.... :D seems to be I have enough resources to contribute to this project. Thanks a lot raghu. regards, Pradeep Fernando.
-
Re: Contributing to hadoopAjit Ratnaparkhi 2009-02-26, 07:23
Raghu,
Can you please tell me how to run multiple datanodes on one machine. thanks, -Ajit. On Thu, Feb 26, 2009 at 9:23 AM, Pradeep Fernando <[EMAIL PROTECTED]>wrote: > Raghu, > > I guess you are asking if it would be more convenient if one had access to > a > > larger cluster for development. > > > exactly..... > > I have access to many machines and clusters.. but about 99% of my > > development happens using single machine for testing. I would guess that > is > > true for most of the Hadoop developers. > > > well this is the answer I was looking for.... :D > seems to be I have enough resources to contribute to this project. > Thanks a lot raghu. > > regards, > Pradeep Fernando. >
-
Re: Contributing to hadooplohit 2009-02-26, 08:10
Ajit,
Another easy way to test your code is to write a testcase. You could follow any of the already existing test HDFS test cases. For example src/test/org/apache/hadoop/hdfs/server/namenode/TestFsck.java will tell you how to create a cluster using MiniDFS with multiple DataNodes, creating files and also running fsck programatically. If you would want to test something, you make the changes and run the test. You could also run just one test alone by executing 'ant -Dtestcase=TestFsck test' Lohit ----- Original Message ---- From: Ajit Ratnaparkhi <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Wednesday, February 25, 2009 11:23:56 PM Subject: Re: Contributing to hadoop Raghu, Can you please tell me how to run multiple datanodes on one machine. thanks, -Ajit. On Thu, Feb 26, 2009 at 9:23 AM, Pradeep Fernando <[EMAIL PROTECTED]>wrote: > Raghu, > > I guess you are asking if it would be more convenient if one had access to > a > > larger cluster for development. > > > exactly..... > > I have access to many machines and clusters.. but about 99% of my > > development happens using single machine for testing. I would guess that > is > > true for most of the Hadoop developers. > > > well this is the answer I was looking for.... :D > seems to be I have enough resources to contribute to this project. > Thanks a lot raghu. > > regards, > Pradeep Fernando. >
-
Re: Contributing to hadoopSteve Loughran 2009-02-26, 11:53
Pradeep Fernando wrote:
> Thanks Enis & steve, > > for your valuable guidelines. As i can understand since hadoop is a > implementation of map-reduce it is aimed at working in a clustering > environment. So me, having only a Desktop and Internet connection having > doubts weather i can successfully contribute to the project. > You can have some fun on a single machine; the algorithms are the same, just some of the problems involved in running jobsd and managing machines dufferent > > although this sounds good. there are clouds like Amazon EC2 for setting up a > cluster.Are you devs make use of that sort of infrastructure in development > testing.I dont knw this is a right question or this is relevent at all. plz > bare with me if. Tom White does. EC2 has some issues *your test runs accrue debt, especially if you are setting up and tearing down machines regularly * the network is insecure Hadoop could do with some improvement over network security, though there's no reason why AWS could'nt offer virtual VPNs. > > Have you worked on Apache projects before? > > > yes i have contributed to the Apache Axis2 project.so Im pretty much > familiar with java,ant,maven,junit , etc. > OK, if you worked on Axis2 then you'll know the basics, though be advised that the hadoop commit process is much more rigorous. I'd recommend start playing with MR algorithms on any data you have to hand, if you want some interesting datasets then ask on the user mailing list and you will get some pointers. Go with a current release, and not SVN_HEAD if you want stability in your life. Only if/when you want to make changes to the code should you go with SVN head -steve
-
Re: Contributing to hadoopRaghu Angadi 2009-02-26, 18:43
You can run with a small shell script. You need to override couple of environment and config variables. something like : run_datanode () { DN=$2 HADOOP_LOG_DIR=logs$DN HADOOP_PID_DIR=$HADOOP_LOG_DIR bin/hadoop-daemon.sh $1 datanode \ -Dhadoop.tmp.dir=/some/dir/dfs$DN \ -Ddfs.datanode.address=0.0.0.0:5001$DN \ -Ddfs.datanode.http.address=0.0.0.0:5008$DN \ -Ddfs.datanode.ipc.address=0.0.0.0:5002$DN } You can start second datanode like : run_datanode start 2 Pretty useful for testing. Raghu. Ajit Ratnaparkhi wrote: > Raghu, > > Can you please tell me how to run multiple datanodes on one machine. > > thanks, > -Ajit. > > On Thu, Feb 26, 2009 at 9:23 AM, Pradeep Fernando <[EMAIL PROTECTED]>wrote: > >> Raghu, >> >> I guess you are asking if it would be more convenient if one had access to >> a >>> larger cluster for development. >> >> exactly..... >> >> I have access to many machines and clusters.. but about 99% of my >>> development happens using single machine for testing. I would guess that >> is >>> true for most of the Hadoop developers. >> >> well this is the answer I was looking for.... :D >> seems to be I have enough resources to contribute to this project. >> Thanks a lot raghu. >> >> regards, >> Pradeep Fernando. >> >
-
Re: Contributing to hadoopAjit Ratnaparkhi 2009-03-01, 08:06
Hi,
thanks for your help. I tried the above mentioned script(one mentioned by Raghu), but whenever i execute it, following message gets displayed, *datanode running as process <process_id>. Stop it first*. I am starting the single node cluster by command bin/start-dfs.sh first, after which i am executing the above mentioned script to start second datanode. I also tried giving seperate changed configuration from a seperate directory for config by executing command, *bin/hadoop-daemons.sh --config <config-directory-path> start datanode* Still it gives same message as above. also in this thread before Ramya mentioned about DataNodeCluster.java. This will help, but I am not getting how to execute this class. Can you please help regarding this. thanks, -Ajit. On Thu, Feb 26, 2009 at 6:43 PM, Raghu Angadi <[EMAIL PROTECTED]> wrote: > > You can run with a small shell script. You need to override couple of > environment and config variables. > > something like : > > run_datanode () { > DN=$2 > HADOOP_LOG_DIR=logs$DN > HADOOP_PID_DIR=$HADOOP_LOG_DIR > bin/hadoop-daemon.sh $1 datanode \ > -Dhadoop.tmp.dir=/some/dir/dfs$DN \ > -Ddfs.datanode.address=0.0.0.0:5001$DN \ > -Ddfs.datanode.http.address=0.0.0.0:5008$DN \ > -Ddfs.datanode.ipc.address=0.0.0.0:5002$DN > } > > You can start second datanode like : run_datanode start 2 > > Pretty useful for testing. > > Raghu. > > > Ajit Ratnaparkhi wrote: > >> Raghu, >> >> Can you please tell me how to run multiple datanodes on one machine. >> >> thanks, >> -Ajit. >> >> On Thu, Feb 26, 2009 at 9:23 AM, Pradeep Fernando <[EMAIL PROTECTED] >> >wrote: >> >> Raghu, >>> >>> I guess you are asking if it would be more convenient if one had access >>> to >>> a >>> >>>> larger cluster for development. >>>> >>> >>> exactly..... >>> >>> I have access to many machines and clusters.. but about 99% of my >>> >>>> development happens using single machine for testing. I would guess that >>>> >>> is >>> >>>> true for most of the Hadoop developers. >>>> >>> >>> well this is the answer I was looking for.... :D >>> seems to be I have enough resources to contribute to this project. >>> Thanks a lot raghu. >>> >>> regards, >>> Pradeep Fernando. >>> >>> >> >
-
Re: Contributing to hadoopBharat Jain 2009-03-02, 20:04
Hi,
I have question regarding how to go about contributing. I have 7+ exp and earlier worked in search related tech like lucene, solr, crawlers etc while at AOL. I have setup hadoop on cluster earlier. Are there any issues or problems that I can start looking into to get hands on? Basically how to go about doing some serious work? Thanks Bharat Jain On Sun, Mar 1, 2009 at 3:06 AM, Ajit Ratnaparkhi <[EMAIL PROTECTED] > wrote: > Hi, > thanks for your help. > > I tried the above mentioned script(one mentioned by Raghu), but whenever i > execute it, following message gets displayed, > *datanode running as process <process_id>. Stop it first*. > I am starting the single node cluster by command bin/start-dfs.sh first, > after which i am executing the above mentioned script to start second > datanode. > > I also tried giving seperate changed configuration from a seperate > directory > for config by executing command, > *bin/hadoop-daemons.sh --config <config-directory-path> start datanode* > Still it gives same message as above. > > also in this thread before Ramya mentioned about DataNodeCluster.java. This > will help, but I am not getting how to execute this class. Can you please > help regarding this. > > thanks, > -Ajit. > > > > On Thu, Feb 26, 2009 at 6:43 PM, Raghu Angadi <[EMAIL PROTECTED]> > wrote: > > > > > You can run with a small shell script. You need to override couple of > > environment and config variables. > > > > something like : > > > > run_datanode () { > > DN=$2 > > HADOOP_LOG_DIR=logs$DN > > HADOOP_PID_DIR=$HADOOP_LOG_DIR > > bin/hadoop-daemon.sh $1 datanode \ > > -Dhadoop.tmp.dir=/some/dir/dfs$DN \ > > -Ddfs.datanode.address=0.0.0.0:5001$DN \ > > -Ddfs.datanode.http.address=0.0.0.0:5008$DN \ > > -Ddfs.datanode.ipc.address=0.0.0.0:5002$DN > > } > > > > You can start second datanode like : run_datanode start 2 > > > > Pretty useful for testing. > > > > Raghu. > > > > > > Ajit Ratnaparkhi wrote: > > > >> Raghu, > >> > >> Can you please tell me how to run multiple datanodes on one machine. > >> > >> thanks, > >> -Ajit. > >> > >> On Thu, Feb 26, 2009 at 9:23 AM, Pradeep Fernando <[EMAIL PROTECTED] > >> >wrote: > >> > >> Raghu, > >>> > >>> I guess you are asking if it would be more convenient if one had access > >>> to > >>> a > >>> > >>>> larger cluster for development. > >>>> > >>> > >>> exactly..... > >>> > >>> I have access to many machines and clusters.. but about 99% of my > >>> > >>>> development happens using single machine for testing. I would guess > that > >>>> > >>> is > >>> > >>>> true for most of the Hadoop developers. > >>>> > >>> > >>> well this is the answer I was looking for.... :D > >>> seems to be I have enough resources to contribute to this project. > >>> Thanks a lot raghu. > >>> > >>> regards, > >>> Pradeep Fernando. > >>> > >>> > >> > > >
-
Re: Contributing to hadoopRaghu Angadi 2009-03-04, 19:08
Ajit Ratnaparkhi wrote:
> Hi, > thanks for your help. > > I tried the above mentioned script(one mentioned by Raghu), but whenever i > execute it, following message gets displayed, > *datanode running as process <process_id>. Stop it first*. > I am starting the single node cluster by command bin/start-dfs.sh first, > after which i am executing the above mentioned script to start second > datanode. Did you try to do what the error message asks you to? Better still, you should try to find where the message is coming from. I realize this is not particularly a useful reply for a user but for a developer, I hope it is. I just wrote the example script in the mail editor. I did not test it.. may be 'export' before setting HADOOP_* env variables in the script is required. Currently I use a different (a bit less elegant) method for starting multiple nodes. When I switch to this method, I will post the script. better still, post your script once you get it to working. Raghu. > I also tried giving seperate changed configuration from a seperate directory > for config by executing command, > *bin/hadoop-daemons.sh --config <config-directory-path> start datanode* > Still it gives same message as above. > > also in this thread before Ramya mentioned about DataNodeCluster.java. This > will help, but I am not getting how to execute this class. Can you please > help regarding this. > > thanks, > -Ajit. > > > > On Thu, Feb 26, 2009 at 6:43 PM, Raghu Angadi <[EMAIL PROTECTED]> wrote: > >> You can run with a small shell script. You need to override couple of >> environment and config variables. >> >> something like : >> >> run_datanode () { >> DN=$2 >> HADOOP_LOG_DIR=logs$DN >> HADOOP_PID_DIR=$HADOOP_LOG_DIR >> bin/hadoop-daemon.sh $1 datanode \ >> -Dhadoop.tmp.dir=/some/dir/dfs$DN \ >> -Ddfs.datanode.address=0.0.0.0:5001$DN \ >> -Ddfs.datanode.http.address=0.0.0.0:5008$DN \ >> -Ddfs.datanode.ipc.address=0.0.0.0:5002$DN >> } >> >> You can start second datanode like : run_datanode start 2 >> >> Pretty useful for testing. >> >> Raghu. >> >> >> Ajit Ratnaparkhi wrote: >> >>> Raghu, >>> >>> Can you please tell me how to run multiple datanodes on one machine. >>> >>> thanks, >>> -Ajit. >>> >>> On Thu, Feb 26, 2009 at 9:23 AM, Pradeep Fernando <[EMAIL PROTECTED] >>>> wrote: >>> Raghu, >>>> I guess you are asking if it would be more convenient if one had access >>>> to >>>> a >>>> >>>>> larger cluster for development. >>>>> >>>> exactly..... >>>> >>>> I have access to many machines and clusters.. but about 99% of my >>>> >>>>> development happens using single machine for testing. I would guess that >>>>> >>>> is >>>> >>>>> true for most of the Hadoop developers. >>>>> >>>> well this is the answer I was looking for.... :D >>>> seems to be I have enough resources to contribute to this project. >>>> Thanks a lot raghu. >>>> >>>> regards, >>>> Pradeep Fernando. >>>> >>>> >
-
Re: Contributing to hadoopJakob Homan 2009-03-04, 19:41
There is definitely something to be said for developing via TDD as
Lohit mentioned. Hadoop has an extensive set of tools for writing unit tests that run on simulated clusters (see http://www.cloudera.com/blog/2008/12/16/testing-hadoop/ for an excellent tutorial). This will save you time in the long run because your testing can be contributed as well as the actual patch and there's no need to muck about with configuring clusters, manually starting datanodes, etc. Actually needing a cluster to test or develop patches against is pretty rare and indicative of a problem somewhere else. -Jakob On Mar 4, 2009, at 11:08 AM, Raghu Angadi wrote: > Ajit Ratnaparkhi wrote: >> Hi, >> thanks for your help. >> I tried the above mentioned script(one mentioned by Raghu), but >> whenever i >> execute it, following message gets displayed, >> *datanode running as process <process_id>. Stop it first*. >> I am starting the single node cluster by command bin/start-dfs.sh >> first, >> after which i am executing the above mentioned script to start second >> datanode. > > Did you try to do what the error message asks you to? Better still, > you should try to find where the message is coming from. I realize > this is not particularly a useful reply for a user but for a > developer, I hope it is. > > I just wrote the example script in the mail editor. I did not test > it.. may be 'export' before setting HADOOP_* env variables in the > script is required. Currently I use a different (a bit less elegant) > method for starting multiple nodes. When I switch to this method, I > will post the script. > > better still, post your script once you get it to working. > > Raghu. > >> I also tried giving seperate changed configuration from a seperate >> directory >> for config by executing command, >> *bin/hadoop-daemons.sh --config <config-directory-path> start >> datanode* >> Still it gives same message as above. >> also in this thread before Ramya mentioned about >> DataNodeCluster.java. This >> will help, but I am not getting how to execute this class. Can you >> please >> help regarding this. >> thanks, >> -Ajit. >> On Thu, Feb 26, 2009 at 6:43 PM, Raghu Angadi <rangadi@yahoo- >> inc.com> wrote: >>> You can run with a small shell script. You need to override couple >>> of >>> environment and config variables. >>> >>> something like : >>> >>> run_datanode () { >>> DN=$2 >>> HADOOP_LOG_DIR=logs$DN >>> HADOOP_PID_DIR=$HADOOP_LOG_DIR >>> bin/hadoop-daemon.sh $1 datanode \ >>> -Dhadoop.tmp.dir=/some/dir/dfs$DN \ >>> -Ddfs.datanode.address=0.0.0.0:5001$DN \ >>> -Ddfs.datanode.http.address=0.0.0.0:5008$DN \ >>> -Ddfs.datanode.ipc.address=0.0.0.0:5002$DN >>> } >>> >>> You can start second datanode like : run_datanode start 2 >>> >>> Pretty useful for testing. >>> >>> Raghu. >>> >>> >>> Ajit Ratnaparkhi wrote: >>> >>>> Raghu, >>>> >>>> Can you please tell me how to run multiple datanodes on one >>>> machine. >>>> >>>> thanks, >>>> -Ajit. >>>> >>>> On Thu, Feb 26, 2009 at 9:23 AM, Pradeep Fernando <[EMAIL PROTECTED] >>>>> wrote: >>>> Raghu, >>>>> I guess you are asking if it would be more convenient if one had >>>>> access >>>>> to >>>>> a >>>>> >>>>>> larger cluster for development. >>>>>> >>>>> exactly..... >>>>> >>>>> I have access to many machines and clusters.. but about 99% of my >>>>> >>>>>> development happens using single machine for testing. I would >>>>>> guess that >>>>>> >>>>> is >>>>> >>>>>> true for most of the Hadoop developers. >>>>>> >>>>> well this is the answer I was looking for.... :D >>>>> seems to be I have enough resources to contribute to this project. >>>>> Thanks a lot raghu. >>>>> >>>>> regards, >>>>> Pradeep Fernando. >>>>> >>>>> > |