|
|
Marcel Mitsuto F. S. 2013-03-20, 17:58
Hi,
I'm starting a project to build a 10 node cluster grid.
I've already successfully built a 10 node grid with hadoop 1.0.4.
This next grid would preferrably be the 0.23.X branch, which I think would be the best version to smoothly transition to 2.0.3 release (right?)
When I was working with the 1.0.4 proof-of-concept, I was scratching my head about the 'clients' role that submits jobs to the cluster, all the work then of `hadoop fs -put` I was doing directly from namenode instance.
So the question: How do I setup a grid where clients could send jobs to the cluster in a queued fashion way, and how to setup the 'clients' to properly being acknowledged by the grid and being able to send jobs? Am I correct to think that 'client' could be anyone (my laptop in the network that reaches namenode) with access to the cluster with hadoop installed locally?
Thanks in advance.
Harsh J 2013-03-21, 00:59
You are correct about your idea of clients. To talk to HDFS, they need to be allowed to talk to the NN's ports as well as the DN's ports. To talk to YARN/MR, they need access to both RM and NM ports (as well as the JobHistoryServer's web port).
Aside of just a local install, they'll also need the install configured to point to the cluster's URLs.
Regarding 0.23 or 2.0.3, you can choose either. The 0.23 is a fast-moving one right now, as stability improvements under YARN and MR2 are continuously being added and released (majorly by and for use at Yahoo! as well). The 2.0.x has a slightly wider release period with new features and possible incompatibilities still coming in (until it hits beta) and carries HDFS-HA features in it, plus protobuf-based protocols (which 0.23 lacks). Eventually, the 0.23 will stop and move over to 2.x once the latter finally stabilizes.
But upgrade-wise, you can do both 1.x -> 2.x or 1.x -> 0.23.x (For now, until it lasts) -> 2.x; both routes are supported.
On Wed, Mar 20, 2013 at 11:28 PM, Marcel Mitsuto F. S. <[EMAIL PROTECTED]> wrote: > Hi, > > I'm starting a project to build a 10 node cluster grid. > > I've already successfully built a 10 node grid with hadoop 1.0.4. > > This next grid would preferrably be the 0.23.X branch, which I think would > be the best version to smoothly transition to 2.0.3 release (right?) > > When I was working with the 1.0.4 proof-of-concept, I was scratching my head > about the 'clients' role that submits jobs to the cluster, all the work then > of `hadoop fs -put` I was doing directly from namenode instance. > > So the question: How do I setup a grid where clients could send jobs to the > cluster in a queued fashion way, and how to setup the 'clients' to properly > being acknowledged by the grid and being able to send jobs? Am I correct to > think that 'client' could be anyone (my laptop in the network that reaches > namenode) with access to the cluster with hadoop installed locally? > > Thanks in advance.
-- Harsh J
Marcel Mitsuto F. S. 2013-04-02, 20:28
Thank you for your answer!
Sorry for this late response. I just got my hands on ten servers (hp 2950 iii) that were upgraded by another set of servers, and these are the production grid servers.
This is a grid to compute exographic metrics from webserver accesslogs like geolocation, ISP, and all kind of metrics related to our portal's audience, to support our operations and content delivery teams with complimentary metrics than Google Analytics and Omniture already provides, and the daily log rotation should be around 400GB uncompressed Apache's CustomLog. We won't hold raw data in HDFS as it would increase hardware requirements to a level we're not yet able to compromise. We're going to Map Reduce these raw logs to meningful metrics.
They all have 6 slots for SAS 15K HDD, and I already asked hardware guys to install CentOS distribution on RAID1 using 2 disks of 73GB. The remaining 4 slots will be filled with 300GB 15K SAS HDDs and I want them to be handled by hadoop, ending up with 8 x 1.2TB total DataNode storage. 2 servers to NN, SNN and JobTracker, and 8 DN/TT servers.
Now comes the questions:
#1: I'm following the list and there are some questions regarding building the kernel for this hardware using different I/O scheduler approaches. I have yet customize one kernel to upgrade our default CentOS6 stock kernel with new I/O schedulers if it seems to enhance performance, maximizing throughput. Should I do it?
#2: With 400GB of raw input data, and 9.6TB total HDFS storage, with a daily or maybe hourly batch jobs, what should be the optimal multiplier to HDFS redundat copies of HDFS blocks? Would the answer to #1 impacts what value I'd configure to be the multiplier on #2 to have optimal HDFS usage and to meet the processing time requirements for our batch jobs?
Thank you for your attention and time!
Best regards, Marcel
On Wed, Mar 20, 2013 at 9:59 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> You are correct about your idea of clients. To talk to HDFS, they need > to be allowed to talk to the NN's ports as well as the DN's ports. To > talk to YARN/MR, they need access to both RM and NM ports (as well as > the JobHistoryServer's web port). > > Aside of just a local install, they'll also need the install > configured to point to the cluster's URLs. > > Regarding 0.23 or 2.0.3, you can choose either. The 0.23 is a > fast-moving one right now, as stability improvements under YARN and > MR2 are continuously being added and released (majorly by and for use > at Yahoo! as well). The 2.0.x has a slightly wider release period with > new features and possible incompatibilities still coming in (until it > hits beta) and carries HDFS-HA features in it, plus protobuf-based > protocols (which 0.23 lacks). Eventually, the 0.23 will stop and move > over to 2.x once the latter finally stabilizes. > > But upgrade-wise, you can do both 1.x -> 2.x or 1.x -> 0.23.x (For > now, until it lasts) -> 2.x; both routes are supported. > > On Wed, Mar 20, 2013 at 11:28 PM, Marcel Mitsuto F. S. > <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I'm starting a project to build a 10 node cluster grid. > > > > I've already successfully built a 10 node grid with hadoop 1.0.4. > > > > This next grid would preferrably be the 0.23.X branch, which I think > would > > be the best version to smoothly transition to 2.0.3 release (right?) > > > > When I was working with the 1.0.4 proof-of-concept, I was scratching my > head > > about the 'clients' role that submits jobs to the cluster, all the work > then > > of `hadoop fs -put` I was doing directly from namenode instance. > > > > So the question: How do I setup a grid where clients could send jobs to > the > > cluster in a queued fashion way, and how to setup the 'clients' to > properly > > being acknowledged by the grid and being able to send jobs? Am I correct > to > > think that 'client' could be anyone (my laptop in the network that > reaches > > namenode) with access to the cluster with hadoop installed locally?
|
|