Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> HDFS without Hadoop: Why?


Copy link to this message
-
Re: HDFS without Hadoop: Why?
Nathan,

Great references. There is a good place to put them to:
http://wiki.apache.org/hadoop/HDFS_Publications
GPFS and Lustre papers are not there yet, I believe.

Thanks,
--Konstantin

On Thu, Feb 3, 2011 at 10:48 AM, Nathan Rutman <[EMAIL PROTECTED]> wrote:

>
> On Feb 2, 2011, at 6:42 PM, Konstantin Shvachko wrote:
>
> Thanks for the link Stu.
> More details are on limitations are here:
> http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf
>
> I think that Nathan raised an interesting question and his assessment of
> HDFS use
> cases are generally right.
> Some assumptions though are outdated at this point.
> And people mentioned about it in the thread.
> We have append implementation, which allows reopening files for updates.
> We also have symbolic links and quotas (space and name-space).
> The api to HDFS is not posix, true. But in addition to Fuse people also use
>
> Thrift to access hdfs.
> Most of these features are explained in HDFS overview paper:
> http://storageconference.org/2010/Papers/MSST/Shvachko.pdf
>
> Stand-alone HDFS is actually used in several places. I like what
> Brian Bockelman at University of Nebraska does.
> They store CERN data in their cluster, and physicists use Fortran to access
> the data,
> not map-reduce, as I heard.
> http://storageconference.org/2010/Presentations/MSST/3.Bockelman.pdf
>
> This doesn't seem to mention what storage they're using.
>
>
> With respect to other distributed file systems. HDFS performance was
> compared to
> PVFS, GPFS and Lustre. The results were in favor of HDFS. See e.g.
>
> PVFS
>
> http://www.cs.cmu.edu/~wtantisi/files/hadooppvfs-pdl08.pdf<http://www.cs.cmu.edu/%7Ewtantisi/files/hadooppvfs-pdl08.pdf>
>
>
> Some other references for those interested:  HDFS vs
> GPFS
> Cloud analytics: Do we really need to reinvent the storage stack?<http://www.usenix.org/event/hotcloud09/tech/full_papers/ananthanarayanan.pdf>
> Lustre
> http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf
> Ceph
> www.usenix.org—maltzahn.pdf<http://www.usenix.org/publications/login/2010-08/openpdfs/maltzahn.pdf>
>
> These GPFS and Lustre papers were both favorable toward HDFS because
> they missed a fundamental issue: for the former FS's, network speed is
> critical.
> HDFS doesn't need network on reads (ideally), and so is simultaneously
> immune to network
> speed, but also cannot take advantage of network speed.  For slow networks
> (1GigE)
> this plays into HDFS's strength, but for fast networks (10GigE,
> Infiniband),
> the balance tips the other way. (My testing: for a heavily loaded network,
> a 3-4x read
> speed factor for Lustre.  For writes, the difference is even more extreme
> (10x),
> since HDFS has to hop all write data over the network twice.)
>
> Let me say clearly that your choice of FS should depend on which of many
> factors
> are most important to you -- there is no "one size fits all", although that
> sadly makes our
> decisions more complex.  For those using Hadoop that have a high weighting
> on
> IO performance (as well as some other factors I listed in my original
> mail), I suggest you
> at least think about spending money on a fast network and using a FS that
> can utilize it.
>
>
> So I agree with Nathan HDFS was designed and optimized as a storage layer
> for
> map-reduce type tasks, but it performs well as a general purpose fs as
> well.
>
> Thanks,
> --Konstantin
>
>
>
>
> On Wed, Feb 2, 2011 at 6:08 PM, Stuart Smith <[EMAIL PROTECTED]> wrote:
>
>>
>> This is the best coverage I've seen from a source that would know:
>>
>>
>> http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/
>>
>> One relevant quote:
>>
>> To store 100 million files (referencing 200 million blocks), a name-node
>> should have at least 60 GB of RAM.
>>
>> But, honestly, if you're just building out your cluster, you'll probably
>> run into a lot of other limits first: hard drive space, regionserver memory,
>> the infamous ulimit/xciever :), etc...
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB