Thanks for all your help and replies. Though I am leaning towards option 1
or 2, I looked up Big Table...an Incubator project in Apache. Could not
find enough info on it in its website. I have a few more questions...and
hope they apply to these mailing-list..
1. Cos: Can you please point me to a link that talk about BigTop & EC2?
2. Regarding Whirr, can I just choose an Ubuntu EBS-backed AMI? Would that
be any different from choosing a normal Hadoop AMI and (later) try to mount
an EBS to this instance?
3. John: I like you idea of using S3 to store input and output. But, say I
start a hadoop cluster, configure Sqoop and Hive and run it. Then, after I
get my output in S3, I either stop it or terminate it (since I do not have
EBS, I don't care). Now, after a while, I want to bring up a similar
cluster and run Hive and Sqoop and do more experiments. In this case, will
I have to reconfigure all my Sqoop settings, Hive table schemas etc?
Because, I think once I "stop" an instance, I will lose the configs and
when I restart a Hadoop AMI, I will only have hadoop nicely running in that
instance and nothing else.
I ideally want everything to persist...even configs and newly installed
tools (Hive, Sqoop). Or , should I create a custom Ubuntu AMI with Hadoop,
Sqoop, Hive etc "pre-cooked" in it? Probably, this is the ideal way to
proceed...even if it is a little painful. I think I really want EBS-backed
instance..as it maintains its internal state when stopped and restarted.
Please let me know your opinion. This discussion is deviating from what I
originally started as..
A little Googling has similar posts:
https://forums.aws.amazon.com/message.jspa?messageID=131157I know I can get to know by trying out these ....but, I want to lessen my
burden in the trial-and-error process.
Thanks very much,
PD.
On Tue, Nov 29, 2011 at 12:40 PM, Konstantin Boudnik <[EMAIL PROTECTED]> wrote:
> I'd suggest you use BigTop (cross-posting to bigtop-dev@ list) produced
> bit
> which also posses Puppet recipes allowing for fully automated deployment
> and
> configuration. BigTop also uses Jenkins EC2 plugin for deployment part and
> it
> seems to work real great!
>
> Cos
>
> On Tue, Nov 29, 2011 at 12:28PM, Periya.Data wrote:
> > Hi All,
> > I am just beginning to learn how to deploy a small cluster (a 3
> > node cluster) on EC2. After some quick Googling, I see the following
> > approaches:
> >
> > 1. Use Whirr for quick deployment and tearing down. Uses CDH3. Does it
> > have features for persisting (EBS)?
> > 2. CDH Cloud Scripts - has EC2 AMI - again for temp Hadoop
> clusters/POC
> > etc. Good stuff - I can persist using EBS snapshots. But, this uses
> CDH2.
> > 3. Install hadoop manually and related stuff like Hive...on each
> cluster
> > node...on EC2 (or use some automation tool like Chef). I do not
> prefer it.
> > 4. Hadoop distribution comes with EC2 (under src/contrib) and there
> are
> > several Hadoop EC2 AMIs available. I have not studied enough to know
> if
> > that is easy for a beginner like me.
> > 5. Anything else??
> >
> > 1 and 2 look promising as a beginner. If any of you have any thoughts
> about
> > this, I would like to know (like what to keep in mind, what to take care
> > of, caveats etc). I want my data /config to persist (using EBS) and
> > continue from where I left off...(after a few days). Also, I want to
> have
> > HIVE and SQOOP installed. Can this done using 1 or 2? Or, will
> installation
> > of them have to be done manually after I set up the cluster?
> >
> > Thanks very much,
> >
> > PD.
>