Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: HDFS without Hadoop: Why?


+
Bharath Mundlapudi 2011-02-05, 06:52
+
Nathan Rutman 2011-01-25, 20:37
+
Sean Bigdatafun 2011-02-01, 02:34
+
Nathan Rutman 2011-02-01, 03:51
+
Jeff Hammerbacher 2011-02-02, 23:31
+
Dhodapkar, Chinmay 2011-02-03, 00:28
+
Ian Holsman 2011-02-03, 00:38
+
Stuart Smith 2011-02-03, 00:40
+
Dhodapkar, Chinmay 2011-02-03, 01:11
+
Dhruba Borthakur 2011-02-03, 02:00
+
Stuart Smith 2011-02-03, 02:08
+
Gaurav Sharma 2011-02-03, 02:31
+
Stuart Smith 2011-02-03, 03:32
Copy link to this message
-
Re: HDFS without Hadoop: Why?
Thanks for the link Stu.
More details are on limitations are here:
http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf

I think that Nathan raised an interesting question and his assessment of
HDFS use
cases are generally right.
Some assumptions though are outdated at this point.
And people mentioned about it in the thread.
We have append implementation, which allows reopening files for updates.
We also have symbolic links and quotas (space and name-space).
The api to HDFS is not posix, true. But in addition to Fuse people also use
Thrift to access hdfs.
Most of these features are explained in HDFS overview paper:
http://storageconference.org/2010/Papers/MSST/Shvachko.pdf

Stand-alone HDFS is actually used in several places. I like what
Brian Bockelman at University of Nebraska does.
They store CERN data in their cluster, and physicists use Fortran to access
the data,
not map-reduce, as I heard.
http://storageconference.org/2010/Presentations/MSST/3.Bockelman.pdf

With respect to other distributed file systems. HDFS performance was
compared to
PVFS, GPFS and Lustre. The results were in favor of HDFS. See e.g.
http://www.cs.cmu.edu/~wtantisi/files/hadooppvfs-pdl08.pdf<http://www.cs.cmu.edu/%7Ewtantisi/files/hadooppvfs-pdl08.pdf>

So I agree with Nathan HDFS was designed and optimized as a storage layer
for
map-reduce type tasks, but it performs well as a general purpose fs as well.

Thanks,
--Konstantin
On Wed, Feb 2, 2011 at 6:08 PM, Stuart Smith <[EMAIL PROTECTED]> wrote:

>
> This is the best coverage I've seen from a source that would know:
>
>
> http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/
>
> One relevant quote:
>
> To store 100 million files (referencing 200 million blocks), a name-node
> should have at least 60 GB of RAM.
>
> But, honestly, if you're just building out your cluster, you'll probably
> run into a lot of other limits first: hard drive space, regionserver memory,
> the infamous ulimit/xciever :), etc...
>
> Take care,
>   -stu
>
> --- On *Wed, 2/2/11, Dhruba Borthakur <[EMAIL PROTECTED]>* wrote:
>
>
> From: Dhruba Borthakur <[EMAIL PROTECTED]>
> Subject: Re: HDFS without Hadoop: Why?
> To: [EMAIL PROTECTED]
> Date: Wednesday, February 2, 2011, 9:00 PM
>
> The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This
> is a very rough calculation.
>
> dhruba
>
> On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <[EMAIL PROTECTED]<http://mc/compose?[EMAIL PROTECTED]>
> > wrote:
>
>  What you describe is pretty much my use case as well. Since I don’t know
> how big the number of files could get , I am trying to figure out if there
> is a theoretical design limitation in hdfs…..
>
>
>
> From what I have read, the name node will store all metadata of all files
> in the RAM. Assuming (in my case), that a file is less than the configured
> block size….there should be a very rough formula that can be used to
> calculate the max number of files that hdfs can serve based on the
> configured RAM on the name node?
>
>
>
> Can any of the implementers comment on this? Am I even thinking on the
> right track…?
>
>
>
> Thanks Ian for the haystack link…very informative indeed.
>
>
>
> -Chinmay
>
>
>
>
>
>
>
> *From:* Stuart Smith [mailto:[EMAIL PROTECTED]<http://mc/compose?[EMAIL PROTECTED]>]
>
> *Sent:* Wednesday, February 02, 2011 4:41 PM
>
> *To:* [EMAIL PROTECTED]<http://mc/compose?[EMAIL PROTECTED]>
> *Subject:* RE: HDFS without Hadoop: Why?
>
>
>
> Hello,
>    I'm actually using hbase/hadoop/hdfs for lots of small files (with a
> long tail of larger files). Well, millions of small files - I don't know
> what you mean by lots :)
>
> Facebook probably knows better, But what I do is:
>
>   - store metadata in hbase
>   - files smaller than 10 MB or so in hbase
>    -larger files in a hdfs directory tree.
>
> I started storing 64 MB files and smaller in hbase (chunk size), but that
> causes issues with regionservers when running M/R jobs. This is related to
+
Nathan Rutman 2011-02-03, 18:48
+
Konstantin Shvachko 2011-02-03, 20:24
+
Scott Golby 2011-01-25, 22:05
+
Gerrit Jansen van Vuuren 2011-01-25, 23:56
+
Nathan Rutman 2011-01-26, 00:32
+
stu24mail@... 2011-01-26, 01:08
+
Nathan Rutman 2011-01-26, 01:31
+
stu24mail@... 2011-01-26, 03:58
+
Dhruba Borthakur 2011-01-26, 05:54
+
Gerrit Jansen van Vuuren 2011-01-26, 09:59
+
Gerrit Jansen van Vuuren 2011-01-26, 15:26
+
Nathan Rutman 2011-01-26, 17:41
+
stu24mail@... 2011-01-27, 03:04
+
Friso van Vollenhoven 2011-01-26, 09:55
+
Gerrit Jansen van Vuuren 2011-01-27, 11:09
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB