-Re: Distributed cluster filesystem on EC2
Tom White 2011-08-31, 15:50
You might consider Apache Whirr (http://whirr.apache.org/) for
bringing up Hadoop clusters on EC2.
On Wed, Aug 31, 2011 at 8:22 AM, Robert Evans <[EMAIL PROTECTED]> wrote:
> It sounds like an interesting idea, but I have not really heard of anyone doing it before. It would make for a good feature to have tiered file systems all mapped into the same namespace, but that would be a lot of work and complexity.
> The quick solution would be to know what data you want to process before hand and then run distcp to copy it from S3 into HDFS before launching the other map/reduce jobs. I don't think there is anything automatic out there.
> --Bobby Evans
> On 8/29/11 4:56 PM, "Dmitry Pushkarev" <[EMAIL PROTECTED]> wrote:
> Dear hadoop users,
> Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2,
> and one thing that I'm trying to explore is whether we can use alternative
> scheduling systems like SGE with shared FS for non data intensive tasks,
> since they are easier to work with for lay users.
> One problem for now is how to create shared cluster filesystem similar to
> HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks
> and permissions), that will use amazon EC2 local nonpersistent storage.
> Idea is to keep original data on S3, then as needed fire up a bunch of
> nodes, start shared filesystem, and quickly copy data from S3 to that FS,
> run the analysis with SGE, save results and shut down that filesystem.
> I tried things like S3FS and similar native S3 implementation but speed is
> too bad. Currently I just have a FS on my master node that is shared via NFS
> to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start
> more than 10 nodes.
> Thank you. I'd appreciate any suggestions and links to relevant resources!.