Generally I do not see a problem with your plan of using HDFS to store
these files, assuming they are updated rarely if ever. Hadoop is
traditionally a batch system and MapReduce largely remains a batch
system. I'd argue this because minimum job latencies are in the
"seconds" range. HDFS, however, has real time systems built on top of
it, like HBase. The main issue to be concerned with when using HDFS as
simply storage is file size. As HDFS stores it's metadata in RAM, you
don't want to create tremendous numbers of "small" files. With
50-100MB files you should fine.
On Mon, Oct 15, 2012 at 2:47 PM, Matt Painter <[EMAIL PROTECTED]> wrote:
> I am a new Hadoop user, and would really appreciate your opinions on whether
> Hadoop is the right tool for what I'm thinking of using it for.
> I am investigating options for scaling an archive of around 100Tb of image
> data. These images are typically TIFF files of around 50-100Mb each and need
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
> Our previous solution was a monolithic, expensive - and very full - SAN so I
> am excited by Hadoop's distributed, extensible, redundant architecture.
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely be
> in file access analysis rather than any processing on the files themselves.
> In other words, what I really want is a distributed, resilient, scalable
> Is Hadoop suitable if we just use this facility, or would I be misusing it
> and inviting grief?
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/