Am 29.08.2012 um 22:58 schrieb Steve Sonnenberg <[EMAIL PROTECTED]>:
> Is there any way to import data into HDFS without copying it in? (kinda of like by reference)
> I'm pretty sure the answer to this is no.
> What I'm looking for is something that will take existing NFS data and access it as an HDFS filesystem.
> Use case: I have existing data in a warehouse that I would like to run MapReduce etc. on without copying it into HDFS.
> If the data were in S3, could I run MapReduce on it?
Hadoop has a filesystem abstraction layer that supports many physical filesystem implementation. Such as HDFS of course, but also the local filesystem, S3, FTP, and others.
You simply loose data locality if you're running MapReduce on data that is -well- not local to where it's been processed.
With data stored in S3, a common solution is to fire up an EMR (elastic mapreduce) cluster inside Amazon's datacenter to work on your S3 data. It's not real data locality, but at least the processing happens in the same data center as your data. And once you're done processing the data, you can take down the EMR cluster.