My 2 cents on Hadoop version in Production:
If you think you will be deploying your stuff in prod in in 1-2 month then
you should note that cdh4 uses Hadoop-2.0.0-Alpha and "Alpha" release means
Hadoop-2.0.0 is not production ready. \. So you might need to make a call
on which cdh version to use(cdh3u3 or cdh4).
Personally, i have used both cdh3u2 and cdh4. Recently, i completed setting
up a fully distributed cluster of cdh4 with HA for Namenode, Zookeeper, and
HBase Master. HA for Namenode is a big advantage with Hadoop-2.0.0.
On Tue, Aug 14, 2012 at 6:36 PM, Eli Finkelshteyn <[EMAIL PROTECTED]>wrote:
> Hey Mohammad,
> Thanks for the reply. I've been using Hadoop and Pig for a while, and I've
> setup a pseudo-cluster before. I've just never setup anything
> production-scale yet and wanted advice on that.
> On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>> Hello Eli,
>> If this is your first time with Hadoop then I would suggest to
>> configure a cluster locally just to get yourself familiar with Hadoop(a
>> pseudo setup would do).
>> For your analytical stuff you can have a look at Pig, another member of
>> the Hadoop ecosystem. It's a dataflow language that makes analytics really
>> As a data store Hbase would definitely be a good move.
>> For data aggregation, you can also have a look at Flume and Chukwa, apart
>> from Scribe.
>> On Wednesday, August 15, 2012, Eli Finkelshteyn <[EMAIL PROTECTED]>
>> > Hey Folks,
>> > I'm going to be setting up my first new production cluster soon, and
>> was hoping to get some advice and criticism on my current plan of action.
>> Here's my current plan:
>> > Background/Requirements:
>> > I'm setting this up for a start-up that's not gathering very big data
>> yet, but will be in the next few months (I hope, anyway). I'd like to use
>> the cluster for a few things, at least at first:
>> > 1. logging stuff it doesn't make sense to write to a normal database
>> (as well as duplicates of what I am throwing in my database so I can use
>> that stuff from HDFS later on). Basically, just logging a ton
>> of information I might want for analytics/model training later.
>> > 2. analytics processing.
>> > 3. model training (for machine learning). I'll primarily do this
>> through Mahout.
>> > 4. will probably want hbase on there as well for real time reading of
>> some data. I'm not married to this, and haven't played around much with
>> hbase yet, but wanted to leave the possibility open.
>> > The Plan:
>> > I'm thinking I'll set this up in Amazon. We have most of the rest of
>> our hardware there, and I really like the option to be able to spin up a
>> bunch of extra workers at will to have them train some ML model for me and
>> then kill them off. For now, just to get things off the ground, I'm going
>> to setup a small 4 machine cluster (1 NameNode, 1
>> SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
>> around with that setup, and will add more to it as needed. Since everything
>> will be puppetized, adding more machines shouldn't be too bad (I think).
>> I've been using Cloudera so far, and I haven't seen any good reason to
>> switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
>> storing stuff as lzos (a good tutorial on the best way to do this would be
>> > Thoughts?
>> > Eli
>> Mohammad Tariq
Thanks & Regards,