|
|
-
Re: New Production Cluster Criticisms/Adviceanil gupta 2012-08-15, 02:40
My 2 cents on Hadoop version in Production:
If you think you will be deploying your stuff in prod in in 1-2 month then you should note that cdh4 uses Hadoop-2.0.0-Alpha and "Alpha" release means Hadoop-2.0.0 is not production ready. \. So you might need to make a call on which cdh version to use(cdh3u3 or cdh4). Personally, i have used both cdh3u2 and cdh4. Recently, i completed setting up a fully distributed cluster of cdh4 with HA for Namenode, Zookeeper, and HBase Master. HA for Namenode is a big advantage with Hadoop-2.0.0. HTH, Anil Gupta On Tue, Aug 14, 2012 at 6:36 PM, Eli Finkelshteyn <[EMAIL PROTECTED]>wrote: > Hey Mohammad, > Thanks for the reply. I've been using Hadoop and Pig for a while, and I've > setup a pseudo-cluster before. I've just never setup anything > production-scale yet and wanted advice on that. > > Cheers, > > > On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: > >> Hello Eli, >> >> If this is your first time with Hadoop then I would suggest to >> configure a cluster locally just to get yourself familiar with Hadoop(a >> pseudo setup would do). >> >> For your analytical stuff you can have a look at Pig, another member of >> the Hadoop ecosystem. It's a dataflow language that makes analytics really >> easy. >> >> As a data store Hbase would definitely be a good move. >> >> For data aggregation, you can also have a look at Flume and Chukwa, apart >> from Scribe. >> >> On Wednesday, August 15, 2012, Eli Finkelshteyn <[EMAIL PROTECTED]> >> wrote: >> > Hey Folks, >> > I'm going to be setting up my first new production cluster soon, and >> was hoping to get some advice and criticism on my current plan of action. >> Here's my current plan: >> > Background/Requirements: >> > I'm setting this up for a start-up that's not gathering very big data >> yet, but will be in the next few months (I hope, anyway). I'd like to use >> the cluster for a few things, at least at first: >> > 1. logging stuff it doesn't make sense to write to a normal database >> (as well as duplicates of what I am throwing in my database so I can use >> that stuff from HDFS later on). Basically, just logging a ton >> of information I might want for analytics/model training later. >> > 2. analytics processing. >> > 3. model training (for machine learning). I'll primarily do this >> through Mahout. >> > 4. will probably want hbase on there as well for real time reading of >> some data. I'm not married to this, and haven't played around much with >> hbase yet, but wanted to leave the possibility open. >> > The Plan: >> > I'm thinking I'll set this up in Amazon. We have most of the rest of >> our hardware there, and I really like the option to be able to spin up a >> bunch of extra workers at will to have them train some ML model for me and >> then kill them off. For now, just to get things off the ground, I'm going >> to setup a small 4 machine cluster (1 NameNode, 1 >> SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing >> around with that setup, and will add more to it as needed. Since everything >> will be puppetized, adding more machines shouldn't be too bad (I think). >> I've been using Cloudera so far, and I haven't seen any good reason to >> switch, so I'll use CDH4. For logging, I'll just use scribe and wind up >> storing stuff as lzos (a good tutorial on the best way to do this would be >> awesome). >> > Thoughts? >> > Eli >> >> -- >> Regards, >> Mohammad Tariq >> >> > -- Thanks & Regards, Anil Gupta |