Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Re: Suitability of HDFS for live file store

Copy link to this message
Re: Suitability of HDFS for live file store
Ted Dunning 2012-10-15, 22:17
If you are going to mention commercial distros, you should include MapR as
well.  Hadoop compatible, very scalable and handles very large numbers of
files in a Posix-ish environment.

On Mon, Oct 15, 2012 at 1:35 PM, Brian Bockelman <[EMAIL PROTECTED]>wrote:

> Hi,
> We use HDFS to process data for the LHC - somewhat similar case here.  Our
> files are a bit larger, our total local data size if ~1PB logical, and we
> "bring our own" batch system, so no Map-Reduce.  We perform many random
> reads, so we are quite sensitive to underlying latency.
> I don't see any obvious mismatches between your requirements and HDFS
> capabilities that you can eliminate it as a candidate without an
> evaluation.  Do note that HDFS does not provide complete POSIX semantics -
> but you don't appear to need them?
> IMHO, if you are looking for the following requirements:
> 1) Proven petascale data store (never want to be on the bleeding edge of
> your filesystem's scaling!).
> 2) Has self-healing semantics (can recover from the loss of RAIDs or
> entire storage targets).
> 3) Open source (but do consider commercial companies - your time is worth
> something!).
> You end up at looking at a very small number of candidates.  Others
> filesystems that should be on your list:
> 1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial
> support.  I personally don't know enough to provide a pros/cons list, but
> we keep it on our radar.
> 2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.
>  Requires a quite recent kernel.  Quite good on-paper design.
> 3) Lustre.  I think you'd be disappointed with the self-healing.  A very
> "traditional" HPC/clustered filesystem design.
> For us, HDFS wins.  I think it has the possibility of being a winner in
> your case too.
> Brian
> On Oct 15, 2012, at 3:21 PM, Jay Vyas <[EMAIL PROTECTED]> wrote:
> Seems like a heavyweight solution unless you are actually processing the
> images?
> Wow, no mapreduce, no streaming writes, and relatively small files.  Im
> surprised that you are considering hadoop at all ?
> Im surprised there isnt a simpler solution that uses redundancy without
> all the
> daemons and name nodes and task trackers and stuff.
> Might make it kind of awkward as a normal file system.
> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>> Hey Matt,
>> What do you mean by 'real-time' though? While HDFS has pretty good
>> contiguous data read speeds (and you get N x replicas to read from),
>> if you're looking to "cache" frequently accessed files into memory
>> then HDFS does not natively have support for that. Otherwise, I agree
>> with Brock, seems like you could make it work with HDFS (sans
>> MapReduce - no need to run it if you don't need it).
>> The presence of NameNode audit logging will help your file access
>> analysis requirement.
>> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> >
>> > I am a new Hadoop user, and would really appreciate your opinions on
>> whether
>> > Hadoop is the right tool for what I'm thinking of using it for.
>> >
>> > I am investigating options for scaling an archive of around 100Tb of
>> image
>> > data. These images are typically TIFF files of around 50-100Mb each and
>> need
>> > to be made available online in realtime. Access to the files will be
>> > sporadic and occasional, but writing the files will be a daily activity.
>> > Speed of write is not particularly important.
>> >
>> > Our previous solution was a monolithic, expensive - and very full - SAN
>> so I
>> > am excited by Hadoop's distributed, extensible, redundant architecture.
>> >
>> > My concern is that a lot of the discussion on and use cases for Hadoop
>> is
>> > regarding data processing with MapReduce and - from what I understand -
>> > using HDFS for the purpose of input for MapReduce jobs. My other
>> concern is
>> > vague indication that it's not a 'real-time' system. We may be using