|
Bharath Mundlapudi
2011-02-05, 06:52
Nathan Rutman
2011-01-25, 20:37
Sean Bigdatafun
2011-02-01, 02:34
Nathan Rutman
2011-02-01, 03:51
Jeff Hammerbacher
2011-02-02, 23:31
Dhodapkar, Chinmay
2011-02-03, 00:28
Ian Holsman
2011-02-03, 00:38
Stuart Smith
2011-02-03, 00:40
Dhodapkar, Chinmay
2011-02-03, 01:11
Dhruba Borthakur
2011-02-03, 02:00
Stuart Smith
2011-02-03, 02:08
Gaurav Sharma
2011-02-03, 02:31
Stuart Smith
2011-02-03, 03:32
Konstantin Shvachko
2011-02-03, 02:42
Nathan Rutman
2011-02-03, 18:48
Konstantin Shvachko
2011-02-03, 20:24
Scott Golby
2011-01-25, 22:05
Gerrit Jansen van Vuuren
2011-01-25, 23:56
Nathan Rutman
2011-01-26, 00:32
stu24mail@...
2011-01-26, 01:08
Nathan Rutman
2011-01-26, 01:31
stu24mail@...
2011-01-26, 03:58
Dhruba Borthakur
2011-01-26, 05:54
Gerrit Jansen van Vuuren
2011-01-26, 09:59
Gerrit Jansen van Vuuren
2011-01-26, 15:26
Nathan Rutman
2011-01-26, 17:41
stu24mail@...
2011-01-27, 03:04
Friso van Vollenhoven
2011-01-26, 09:55
Gerrit Jansen van Vuuren
2011-01-27, 11:09
|
-
Re: HDFS without Hadoop: Why?Bharath Mundlapudi 2011-02-05, 06:52
Note that there are other data structures in memory for the Namenode
like BlockMap, Directory etc. Just by having number of bytes for file and block is not sufficient. But it is true that File and Block structures occupy most of the memory, I would say these two will be in the top 10 list of high memory objects. Probably, Konstantin's paper will give you more holistic information. Also, There were quite a bit of memory optimizations went into Namenode. With the recent optimizations, you can expect > 60 million files (with 1 block each) on a 32GB RAM machine. I am being conservative here. You can work your way based on these numbers. Assumption here is Namenode running on a 64-bit JVM. -Bharath From: Stuart Smith <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: Sent: Wednesday, February 2, 2011 7:32 PM Subject: Re: HDFS without Hadoop: Why? > Stuart - if Dhruba is giving hdfs file and block sizes used by the namenode, you really cannot get a more authoritative number elsewhere :) Yes - very true! :) I spaced out on the name there ... ;) One more thing - I believe that if you're storing a lot of your smaller files in hbase, you'll end up with a lot less files on hdfs, since several of your smaller files will end up in one HFile?? I'm storing 5-7 million files, with at least 70-80% ending up in hbase. I only have 16 GB of RAM for my name-node, and it's very far from overloading the memory. Off the top of my head, I think it's << 8 GB of RAM used... Take care, -stu --- On Wed, 2/2/11, Gaurav Sharma <[EMAIL PROTECTED]> wrote: >From: Gaurav Sharma <[EMAIL PROTECTED]> >Subject: Re: HDFS without Hadoop: Why? >To: [EMAIL PROTECTED] >Date: Wednesday, February 2, 2011, 9:31 PM > > >Stuart - > if Dhruba is giving hdfs file and block sizes used by the namenode, you really cannot get a more authoritative number elsewhere :) I would do the back-of-envelope with ~160 bytes/file and ~150 bytes/block. > > >On Wed, Feb 2, 2011 at 9:08 PM, Stuart Smith <[EMAIL PROTECTED]> wrote: > >>> >> >>This is the best coverage I've seen from a source that would know: >> >>http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/ >> >>One relevant quote: >> >>To store 100 million files (referencing 200 million blocks), a name-node should have at least 60 GB of RAM. >> >>But, honestly, if you're just building out your cluster, you'll probably run into a lot of other limits first: hard drive space, regionserver memory, the infamous ulimit/xciever :), etc...the >> >>Take care, >> -stu >> >>--- On Wed, 2/2/11, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: >> >>>>> >>>From: Dhruba Borthakur <[EMAIL PROTECTED]> >>> >>>Subject: Re: HDFS without Hadoop: Why? >>> >>>To: [EMAIL PROTECTED] >>>Date: >>> Wednesday, February 2, 2011, 9:00 PM >>> >>> >>> >>>The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This is a very rough calculation. >>> >>> >>>dhruba >>> >>> >>>>>> >>>On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <[EMAIL PROTECTED]> wrote: >>> >>>>>>> >>>> >>>> >>>> >>>> >>>>What you describe is pretty much my use case as well. Since I don’t know how big the number of files could get , I am trying to figure out if there is a theoretical >>>> design limitation in hdfs….. >>>> >>>>From what I have read, the name node will store all metadata of all files in the RAM. Assuming (in my case), that a file is less than the configured block size….there >>>> should be a very rough formula that can be used to calculate the max number of files that hdfs can serve based on the configured RAM on the name node? >>>> >>>>Can any of the implementers comment on this? Am I even thinking on the right track…? >>>> >>>>Thanks Ian for the haystack link���very informative indeed. >>>> >>>>-Chinmay >>>> >>>> >>>> >>>>From:Stuart Smith [mailto:[EMAIL PROTECTED]] >>>> >>>>Sent: Wednesday, February 02, 2011 4:41 PM >>>> >>>>To: [EMAIL PROTECTED] +
Bharath Mundlapudi 2011-02-05, 06:52
-
HDFS without Hadoop: Why?Nathan Rutman 2011-01-25, 20:37
I have a very general question on the usefulness of HDFS for purposes other than running distributed compute jobs for Hadoop. Hadoop and HDFS seem very popular these days, but the use of HDFS for other purposes (database backend, records archiving, etc) confuses me, since there are other free distributed filesystems out there (I personally work on Lustre), with significantly better general-purpose performance.
So please tell me if I'm wrong about any of this. Note I've gathered most of my info from documentation rather than reading the source code. As I understand it, HDFS was written specifically for Hadoop compute jobs, with the following design factors in mind: write-once-read-many (worm) access model use commodity hardware with relatively high failures rates (i.e. assumptive failures) long, sequential streaming data access large files hardware/OS agnostic moving computation is cheaper than moving data While appropriate for processing many large-input Hadoop data-processing jobs, there are significant penalties to be paid when trying to use these design factors for more general-purpose storage: Commodity hardware requires data replication for safety. The HDFS implementation has three penalties: storage redundancy, network loading, and blocking writes. By default, HDFS blocks are replicated 3x: local, "nearby", and "far away" to minimize the impact of data center catastrophe. In addition to the obvious 3x cost for storage, the result is that every data block must be written "far away" - exactly the opposite of the "Move Computation to Data" mantra. Furthermore, these over-network writes are synchronous; the client write blocks until all copies are complete on disk, with the longest latency path of 2 network hops plus a disk write gating the overall write speed. Note that while this would be disastrous for a general-purpose filesystem, with true WORM usage it may be acceptable to penalize writes this way. Large block size implies fewer files. HDFS reaches limits in the tens of millions of files. Large block size wastes space for small file. The minimum file size is 1 block. There is no data caching. When delivering large contiguous streaming data, this doesn't matter. But when the read load is random, seeky, or partial, this is a missing high-impact performance feature. In a WORM model, changing a small part of a file requires all the file data to be copied, so e.g. database record modifications would be very expensive. There are no hardlinks, softlinks, or quotas. HDFS isn't directly mountable, and therefore requires a non-standard API to use. (FUSE workaround exists.) Java source code is very portable and easy to install, but not very quick. Moving computation is cheaper than moving data. But the data nonetheless always has to be moved: either read off of a local hard drive or read over the network into the compute node's memory. It is not necessarily the case that reading a local hard drive is faster than reading a distributed (striped) file over a fast network. Commodity network (e.g. 1GigE), probably yes. But a fast (and expensive) network (e.g. 4xDDR Infiniband) can deliver data significantly faster than a local commodity hard drive. If I'm missing other points, pro- or con-, I would appreciate hearing them. Note again I'm not questioning the success of HDFS in achieving those stated design choices, but rather trying to understand HDFS's applicability to other storage domains beyond Hadoop. Thanks for your time. +
Nathan Rutman 2011-01-25, 20:37
-
Re: HDFS without Hadoop: Why?Sean Bigdatafun 2011-02-01, 02:34
I feel this is a great discussion, so let's think of HDFS' customers.
(1) MapReduce --- definitely a perfect fit as Nathan has pointed out *(2) HBase --- it seems HBase (Bigtable's log structured file) did a great job on this. The solution comes out of Google, it must be right. But would Google necessarily has chosen this approach in its Bigtable system should GFS did not exist in the first place? i.e, can we have alternative 'best' approach?* Anything else? I do not think HDFS is a good file system choice for enterprise applications. On Tue, Jan 25, 2011 at 12:37 PM, Nathan Rutman <[EMAIL PROTECTED]> wrote: > I have a very general question on the usefulness of HDFS for purposes other > than running distributed compute jobs for Hadoop. Hadoop and HDFS seem very > popular these days, but the use of HDFS for other purposes (database > backend, records archiving, etc) confuses me, since there are other free > distributed filesystems out there (I personally work on Lustre), with > significantly better general-purpose performance. > > So please tell me if I'm wrong about any of this. Note I've gathered most > of my info from documentation rather than reading the source code. > > As I understand it, HDFS was written specifically for Hadoop compute jobs, > with the following design factors in mind: > > - write-once-read-many (worm) access model > - use commodity hardware with relatively high failures rates (i.e. > assumptive failures) > - long, sequential streaming data access > - large files > - hardware/OS agnostic > - moving computation is cheaper than moving data > > > While appropriate for processing many large-input Hadoop data-processing > jobs, there are significant penalties to be paid when trying to use these > design factors for more general-purpose storage: > > - Commodity hardware requires data replication for safety. The HDFS > implementation has three penalties: storage redundancy, network loading, and > blocking writes. By default, HDFS blocks are replicated 3x: local, > "nearby", and "far away" to minimize the impact of data center catastrophe. > In addition to the obvious 3x cost for storage, the result is that every > data block must be written "far away" - exactly the opposite of the "Move > Computation to Data" mantra. Furthermore, these over-network writes are > synchronous; the client write blocks until all copies are complete on disk, > with the longest latency path of 2 network hops plus a disk write gating the > overall write speed. Note that while this would be disastrous for a > general-purpose filesystem, with true WORM usage it may be acceptable to > penalize writes this way. > > Facebook seems to have a more cost effective way to do replication, but I am not sure about its MapReduce performance -- at the end of the day, there are only two 'proper' map slot machines that can host a 'cheap' mapper operation. > > - Large block size implies fewer files. HDFS reaches limits in the > tens of millions of files. > - Large block size wastes space for small file. The minimum file size > is 1 block. > - There is no data caching. When delivering large contiguous streaming > data, this doesn't matter. But when the read load is random, seeky, or > partial, this is a missing high-impact performance feature. > > Yes, can anyone answer this question? -- I want to ask the same question as well. > > - In a WORM model, changing a small part of a file requires all the > file data to be copied, so e.g. database record modifications would be very > expensive. > > Yes, can anyone answer this question? > > - There are no hardlinks, softlinks, or quotas. > - HDFS isn't directly mountable, and therefore requires a non-standard > API to use. (FUSE workaround exists.) > - Java source code is very portable and easy to install, but not very > quick. > - Moving computation is cheaper than moving data. But the data local hard drive is faster than reading a distributed (striped) file over a fast network. ", probably Infiniband as well as 10GigE network. And this is why I feel it might not be a good strategy that HBase entirely attach its design to HDFS. +
Sean Bigdatafun 2011-02-01, 02:34
-
Re: HDFS without Hadoop: Why?Nathan Rutman 2011-02-01, 03:51
On Mon, Jan 31, 2011 at 6:34 PM, Sean Bigdatafun
<[EMAIL PROTECTED]> wrote: > I feel this is a great discussion, so let's think of HDFS' customers. > (1) MapReduce --- definitely a perfect fit as Nathan has pointed out I would add the caveat that this depends on your particular weighting factors of performance, ease of setup, hardware type, sysadmin sophistication, failure scenarios, and total cost of ownership. And that cost is a non-linear function of scale. It's not true that HDFS is always the best choice even for MapReduce. > (2) HBase --- it seems HBase (Bigtable's log structured file) did a great > job on this. The solution comes out of Google, it must be right. I think this attitude is a major factor in why people choose HBase (and HDFS). But Google also sits at a particular point on the many-dimensional factor space I alluded to above. Best for Google does not mean best for everyone. > But would > Google necessarily has chosen this approach in its Bigtable system should > GFS did not exist in the first place? i.e, can we have alternative 'best' > approach? I bet you can guess my answer :) > Anything else? I do not think HDFS is a good file system choice for > enterprise applications. Certainly not for most. > On Tue, Jan 25, 2011 at 12:37 PM, Nathan Rutman <[EMAIL PROTECTED]> wrote: >> >> I have a very general question on the usefulness of HDFS for purposes >> other than running distributed compute jobs for Hadoop. Hadoop and HDFS >> seem very popular these days, but the use of HDFS for other purposes >> (database backend, records archiving, etc) confuses me, since there are >> other free distributed filesystems out there (I personally work on Lustre), >> with significantly better general-purpose performance. >> So please tell me if I'm wrong about any of this. Note I've gathered most >> of my info from documentation rather than reading the source code. >> As I understand it, HDFS was written specifically for Hadoop compute jobs, >> with the following design factors in mind: >> >> write-once-read-many (worm) access model >> use commodity hardware with relatively high failures rates (i.e. >> assumptive failures) >> long, sequential streaming data access >> large files >> hardware/OS agnostic >> moving computation is cheaper than moving data >> >> While appropriate for processing many large-input Hadoop data-processing >> jobs, there are significant penalties to be paid when trying to use these >> design factors for more general-purpose storage: >> >> Commodity hardware requires data replication for safety. The HDFS >> implementation has three penalties: storage redundancy, network loading, and >> blocking writes. By default, HDFS blocks are replicated 3x: local, >> "nearby", and "far away" to minimize the impact of data center catastrophe. >> In addition to the obvious 3x cost for storage, the result is that every >> data block must be written "far away" - exactly the opposite of the "Move >> Computation to Data" mantra. Furthermore, these over-network writes are >> synchronous; the client write blocks until all copies are complete on disk, >> with the longest latency path of 2 network hops plus a disk write gating the >> overall write speed. Note that while this would be disastrous for a >> general-purpose filesystem, with true WORM usage it may be acceptable to >> penalize writes this way. > > Facebook seems to have a more cost effective way to do replication, but I am > not sure about its MapReduce performance -- at the end of the day, there are > only two 'proper' map slot machines that can host a 'cheap' mapper > operation. > >> >> Large block size implies fewer files. HDFS reaches limits in the tens of >> millions of files. >> Large block size wastes space for small file. The minimum file size is 1 >> block. >> There is no data caching. When delivering large contiguous streaming >> data, this doesn't matter. But when the read load is random, seeky, or >> partial, this is a missing high-impact performance feature. I talked to one of the principal HDFS designers, and he agreed with me on all these points... ... except that one. HDFS now does softlinks and quotas. I've proved this to my own satisfaction with a simple TestDFSIO benchmark on HDFS and Lustre. I posted the results in another thread here. +
Nathan Rutman 2011-02-01, 03:51
-
Re: HDFS without Hadoop: Why?Jeff Hammerbacher 2011-02-02, 23:31
>
> > - Large block size wastes space for small file. The minimum file size > is 1 block. > > That's incorrect. If a file is smaller than the block size, it will only consume as much space as there is data in the file. > > - There are no hardlinks, softlinks, or quotas. > > That's incorrect; there are quotas and softlinks. +
Jeff Hammerbacher 2011-02-02, 23:31
-
RE: HDFS without Hadoop: Why?Dhodapkar, Chinmay 2011-02-03, 00:28
Hello,
I have been following this thread for some time now. I am very comfortable with the advantages of hdfs, but still have lingering questions about the usage of hdfs for general purpose storage (no mapreduce/hbase etc). Can somebody shed light on what the limitations are on the number of files that can be stored. Is it limited in anyway by the namenode? The use case I am interested in is to store a very large number of relatively small files (1MB to 25MB). Interestingly, I saw a facebook presentation on how they use hbase/hdfs internally. Them seem to store all metadata in hbase and the actual images/files/etc in something called "haystack" (why not use hdfs since they already have it?). Anybody know what "haystack" is? Thanks! Chinmay From: Jeff Hammerbacher [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 02, 2011 3:31 PM To: [EMAIL PROTECTED] Subject: Re: HDFS without Hadoop: Why? * Large block size wastes space for small file. The minimum file size is 1 block. That's incorrect. If a file is smaller than the block size, it will only consume as much space as there is data in the file. * There are no hardlinks, softlinks, or quotas. That's incorrect; there are quotas and softlinks. +
Dhodapkar, Chinmay 2011-02-03, 00:28
-
Re: HDFS without Hadoop: Why?Ian Holsman 2011-02-03, 00:38
Haystack is described here
http://www.facebook.com/note.php?note_id=76191543919 Regards Ian --- Ian Holsman AOL Inc [EMAIL PROTECTED] (703) 879-3128 / AIM:ianholsman it's just a technicality On Feb 2, 2011, at 7:28 PM, "Dhodapkar, Chinmay" <[EMAIL PROTECTED]> wrote: > Hello, > > > > I have been following this thread for some time now. I am very comfortable with the advantages of hdfs, but still have lingering questions about the usage of hdfs for general purpose storage (no mapreduce/hbase etc). > > > > Can somebody shed light on what the limitations are on the number of files that can be stored. Is it limited in anyway by the namenode? The use case I am interested in is to store a very large number of relatively small files (1MB to 25MB). > > > > Interestingly, I saw a facebook presentation on how they use hbase/hdfs internally. Them seem to store all metadata in hbase and the actual images/files/etc in something called “haystack” (why not use hdfs since they already have it?). Anybody know what “haystack” is? > > > > Thanks! > > Chinmay > > > > > > > > > From: Jeff Hammerbacher [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, February 02, 2011 3:31 PM > To: [EMAIL PROTECTED] > Subject: Re: HDFS without Hadoop: Why? > > > > Large block size wastes space for small file. The minimum file size is 1 block. > That's incorrect. If a file is smaller than the block size, it will only consume as much space as there is data in the file. > > There are no hardlinks, softlinks, or quotas. > That's incorrect; there are quotas and softlinks. +
Ian Holsman 2011-02-03, 00:38
-
RE: HDFS without Hadoop: Why?Stuart Smith 2011-02-03, 00:40
Hello,
I'm actually using hbase/hadoop/hdfs for lots of small files (with a long tail of larger files). Well, millions of small files - I don't know what you mean by lots :) Facebook probably knows better, But what I do is: - store metadata in hbase - files smaller than 10 MB or so in hbase -larger files in a hdfs directory tree. I started storing 64 MB files and smaller in hbase (chunk size), but that causes issues with regionservers when running M/R jobs. This is related to the fact that I'm running a cobbled together cluster & my region servers don't have that much memory. I would play the size to see what works for you.. Take care, -stu --- On Wed, 2/2/11, Dhodapkar, Chinmay <[EMAIL PROTECTED]> wrote: From: Dhodapkar, Chinmay <[EMAIL PROTECTED]> Subject: RE: HDFS without Hadoop: Why? To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Date: Wednesday, February 2, 2011, 7:28 PM Hello, I have been following this thread for some time now. I am very comfortable with the advantages of hdfs, but still have lingering questions about the usage of hdfs for general purpose storage (no mapreduce/hbase etc). Can somebody shed light on what the limitations are on the number of files that can be stored. Is it limited in anyway by the namenode? The use case I am interested in is to store a very large number of relatively small files (1MB to 25MB). Interestingly, I saw a facebook presentation on how they use hbase/hdfs internally. Them seem to store all metadata in hbase and the actual images/files/etc in something called “haystack” (why not use hdfs since they already have it?). Anybody know what “haystack” is? Thanks! Chinmay From: Jeff Hammerbacher [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 02, 2011 3:31 PM To: [EMAIL PROTECTED] Subject: Re: HDFS without Hadoop: Why? Large block size wastes space for small file. The minimum file size is 1 block. That's incorrect. If a file is smaller than the block size, it will only consume as much space as there is data in the file. There are no hardlinks, softlinks, or quotas. That's incorrect; there are quotas and softlinks. +
Stuart Smith 2011-02-03, 00:40
-
RE: HDFS without Hadoop: Why?Dhodapkar, Chinmay 2011-02-03, 01:11
What you describe is pretty much my use case as well. Since I don’t know how big the number of files could get , I am trying to figure out if there is a theoretical design limitation in hdfs…..
From what I have read, the name node will store all metadata of all files in the RAM. Assuming (in my case), that a file is less than the configured block size….there should be a very rough formula that can be used to calculate the max number of files that hdfs can serve based on the configured RAM on the name node? Can any of the implementers comment on this? Am I even thinking on the right track…? Thanks Ian for the haystack link…very informative indeed. -Chinmay From: Stuart Smith [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 02, 2011 4:41 PM To: [EMAIL PROTECTED] Subject: RE: HDFS without Hadoop: Why? Hello, I'm actually using hbase/hadoop/hdfs for lots of small files (with a long tail of larger files). Well, millions of small files - I don't know what you mean by lots :) Facebook probably knows better, But what I do is: - store metadata in hbase - files smaller than 10 MB or so in hbase -larger files in a hdfs directory tree. I started storing 64 MB files and smaller in hbase (chunk size), but that causes issues with regionservers when running M/R jobs. This is related to the fact that I'm running a cobbled together cluster & my region servers don't have that much memory. I would play the size to see what works for you.. Take care, -stu --- On Wed, 2/2/11, Dhodapkar, Chinmay <[EMAIL PROTECTED]> wrote: From: Dhodapkar, Chinmay <[EMAIL PROTECTED]> Subject: RE: HDFS without Hadoop: Why? To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Date: Wednesday, February 2, 2011, 7:28 PM Hello, I have been following this thread for some time now. I am very comfortable with the advantages of hdfs, but still have lingering questions about the usage of hdfs for general purpose storage (no mapreduce/hbase etc). Can somebody shed light on what the limitations are on the number of files that can be stored. Is it limited in anyway by the namenode? The use case I am interested in is to store a very large number of relatively small files (1MB to 25MB). Interestingly, I saw a facebook presentation on how they use hbase/hdfs internally. Them seem to store all metadata in hbase and the actual images/files/etc in something called “haystack” (why not use hdfs since they already have it?). Anybody know what “haystack” is? Thanks! Chinmay From: Jeff Hammerbacher [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 02, 2011 3:31 PM To: [EMAIL PROTECTED] Subject: Re: HDFS without Hadoop: Why? * Large block size wastes space for small file. The minimum file size is 1 block. That's incorrect. If a file is smaller than the block size, it will only consume as much space as there is data in the file. * There are no hardlinks, softlinks, or quotas. That's incorrect; there are quotas and softlinks. +
Dhodapkar, Chinmay 2011-02-03, 01:11
-
Re: HDFS without Hadoop: Why?Dhruba Borthakur 2011-02-03, 02:00
The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This is
a very rough calculation. dhruba On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <[EMAIL PROTECTED]>wrote: > What you describe is pretty much my use case as well. Since I don’t know > how big the number of files could get , I am trying to figure out if there > is a theoretical design limitation in hdfs….. > > > > From what I have read, the name node will store all metadata of all files > in the RAM. Assuming (in my case), that a file is less than the configured > block size….there should be a very rough formula that can be used to > calculate the max number of files that hdfs can serve based on the > configured RAM on the name node? > > > > Can any of the implementers comment on this? Am I even thinking on the > right track…? > > > > Thanks Ian for the haystack link…very informative indeed. > > > > -Chinmay > > > > > > > > *From:* Stuart Smith [mailto:[EMAIL PROTECTED]] > *Sent:* Wednesday, February 02, 2011 4:41 PM > > *To:* [EMAIL PROTECTED] > *Subject:* RE: HDFS without Hadoop: Why? > > > > Hello, > I'm actually using hbase/hadoop/hdfs for lots of small files (with a > long tail of larger files). Well, millions of small files - I don't know > what you mean by lots :) > > Facebook probably knows better, But what I do is: > > - store metadata in hbase > - files smaller than 10 MB or so in hbase > -larger files in a hdfs directory tree. > > I started storing 64 MB files and smaller in hbase (chunk size), but that > causes issues with regionservers when running M/R jobs. This is related to > the fact that I'm running a cobbled together cluster & my region servers > don't have that much memory. I would play the size to see what works for > you.. > > Take care, > -stu > > --- On *Wed, 2/2/11, Dhodapkar, Chinmay <[EMAIL PROTECTED]>* wrote: > > > From: Dhodapkar, Chinmay <[EMAIL PROTECTED]> > Subject: RE: HDFS without Hadoop: Why? > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Date: Wednesday, February 2, 2011, 7:28 PM > > Hello, > > > > I have been following this thread for some time now. I am very comfortable > with the advantages of hdfs, but still have lingering questions about the > usage of hdfs for general purpose storage (no mapreduce/hbase etc). > > > > Can somebody shed light on what the limitations are on the number of files > that can be stored. Is it limited in anyway by the namenode? The use case I > am interested in is to store a very large number of relatively small files > (1MB to 25MB). > > > > Interestingly, I saw a facebook presentation on how they use hbase/hdfs > internally. Them seem to store all metadata in hbase and the actual > images/files/etc in something called “haystack” (why not use hdfs since they > already have it?). Anybody know what “haystack” is? > > > > Thanks! > > Chinmay > > > > > > > > *From:* Jeff Hammerbacher [mailto:[EMAIL PROTECTED]] > *Sent:* Wednesday, February 02, 2011 3:31 PM > *To:* [EMAIL PROTECTED] > *Subject:* Re: HDFS without Hadoop: Why? > > > > > - Large block size wastes space for small file. The minimum file size > is 1 block. > > That's incorrect. If a file is smaller than the block size, it will only > consume as much space as there is data in the file. > > > - There are no hardlinks, softlinks, or quotas. > > That's incorrect; there are quotas and softlinks. > > > -- Connect to me at http://www.facebook.com/dhruba +
Dhruba Borthakur 2011-02-03, 02:00
-
Re: HDFS without Hadoop: Why?Stuart Smith 2011-02-03, 02:08
This is the best coverage I've seen from a source that would know: http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/ One relevant quote: To store 100 million files (referencing 200 million blocks), a name-node should have at least 60 GB of RAM. But, honestly, if you're just building out your cluster, you'll probably run into a lot of other limits first: hard drive space, regionserver memory, the infamous ulimit/xciever :), etc... Take care, -stu --- On Wed, 2/2/11, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: From: Dhruba Borthakur <[EMAIL PROTECTED]> Subject: Re: HDFS without Hadoop: Why? To: [EMAIL PROTECTED] Date: Wednesday, February 2, 2011, 9:00 PM The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This is a very rough calculation. dhruba On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <[EMAIL PROTECTED]> wrote: What you describe is pretty much my use case as well. Since I don’t know how big the number of files could get , I am trying to figure out if there is a theoretical design limitation in hdfs….. From what I have read, the name node will store all metadata of all files in the RAM. Assuming (in my case), that a file is less than the configured block size….there should be a very rough formula that can be used to calculate the max number of files that hdfs can serve based on the configured RAM on the name node? Can any of the implementers comment on this? Am I even thinking on the right track…? Thanks Ian for the haystack link…very informative indeed. -Chinmay From: Stuart Smith [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 02, 2011 4:41 PM To: [EMAIL PROTECTED] Subject: RE: HDFS without Hadoop: Why? Hello, I'm actually using hbase/hadoop/hdfs for lots of small files (with a long tail of larger files). Well, millions of small files - I don't know what you mean by lots :) Facebook probably knows better, But what I do is: - store metadata in hbase - files smaller than 10 MB or so in hbase -larger files in a hdfs directory tree. I started storing 64 MB files and smaller in hbase (chunk size), but that causes issues with regionservers when running M/R jobs. This is related to the fact that I'm running a cobbled together cluster & my region servers don't have that much memory. I would play the size to see what works for you.. Take care, -stu --- On Wed, 2/2/11, Dhodapkar, Chinmay <[EMAIL PROTECTED]> wrote: From: Dhodapkar, Chinmay <[EMAIL PROTECTED]> Subject: RE: HDFS without Hadoop: Why? To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Date: Wednesday, February 2, 2011, 7:28 PM Hello, I have been following this thread for some time now. I am very comfortable with the advantages of hdfs, but still have lingering questions about the usage of hdfs for general purpose storage (no mapreduce/hbase etc). Can somebody shed light on what the limitations are on the number of files that can be stored. Is it limited in anyway by the namenode? The use case I am interested in is to store a very large number of relatively small files (1MB to 25MB). Interestingly, I saw a facebook presentation on how they use hbase/hdfs internally. Them seem to store all metadata in hbase and the actual images/files/etc in something called “haystack” (why not use hdfs since they already have it?). Anybody know what “haystack” is? Thanks! Chinmay From: Jeff Hammerbacher [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 02, 2011 3:31 PM To: [EMAIL PROTECTED] Subject: Re: HDFS without Hadoop: Why? Large block size wastes space for small file. The minimum file size is 1 block. That's incorrect. If a file is smaller than the block size, it will only consume as much space as there is data in the file. There are no hardlinks, softlinks, or quotas. That's incorrect; there are quotas and softlinks. Connect to me at http://www.facebook.com/dhruba +
Stuart Smith 2011-02-03, 02:08
-
Re: HDFS without Hadoop: Why?Gaurav Sharma 2011-02-03, 02:31
Stuart - if Dhruba is giving hdfs file and block sizes used by the namenode,
you really cannot get a more authoritative number elsewhere :) I would do the back-of-envelope with ~160 bytes/file and ~150 bytes/block. On Wed, Feb 2, 2011 at 9:08 PM, Stuart Smith <[EMAIL PROTECTED]> wrote: > > This is the best coverage I've seen from a source that would know: > > > http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/ > > One relevant quote: > > To store 100 million files (referencing 200 million blocks), a name-node > should have at least 60 GB of RAM. > > But, honestly, if you're just building out your cluster, you'll probably > run into a lot of other limits first: hard drive space, regionserver memory, > the infamous ulimit/xciever :), etc...the > > Take care, > -stu > > --- On *Wed, 2/2/11, Dhruba Borthakur <[EMAIL PROTECTED]>* wrote: > > > From: Dhruba Borthakur <[EMAIL PROTECTED]> > > Subject: Re: HDFS without Hadoop: Why? > To: [EMAIL PROTECTED] > Date: Wednesday, February 2, 2011, 9:00 PM > > > The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This > is a very rough calculation. > > dhruba > > On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <[EMAIL PROTECTED]<http://mc/compose?[EMAIL PROTECTED]> > > wrote: > > What you describe is pretty much my use case as well. Since I don’t know > how big the number of files could get , I am trying to figure out if there > is a theoretical design limitation in hdfs….. > > > > From what I have read, the name node will store all metadata of all files > in the RAM. Assuming (in my case), that a file is less than the configured > block size….there should be a very rough formula that can be used to > calculate the max number of files that hdfs can serve based on the > configured RAM on the name node? > > > > Can any of the implementers comment on this? Am I even thinking on the > right track…? > > > > Thanks Ian for the haystack link…very informative indeed. > > > > -Chinmay > > > > > > > > *From:* Stuart Smith [mailto:[EMAIL PROTECTED]<http://mc/compose?[EMAIL PROTECTED]>] > > *Sent:* Wednesday, February 02, 2011 4:41 PM > > *To:* [EMAIL PROTECTED]<http://mc/compose?[EMAIL PROTECTED]> > *Subject:* RE: HDFS without Hadoop: Why? > > > > Hello, > I'm actually using hbase/hadoop/hdfs for lots of small files (with a > long tail of larger files). Well, millions of small files - I don't know > what you mean by lots :) > > Facebook probably knows better, But what I do is: > > - store metadata in hbase > - files smaller than 10 MB or so in hbase > -larger files in a hdfs directory tree. > > I started storing 64 MB files and smaller in hbase (chunk size), but that > causes issues with regionservers when running M/R jobs. This is related to > the fact that I'm running a cobbled together cluster & my region servers > don't have that much memory. I would play the size to see what works for > you.. > > Take care, > -stu > > --- On *Wed, 2/2/11, Dhodapkar, Chinmay <[EMAIL PROTECTED]<http://mc/compose?[EMAIL PROTECTED]> > >* wrote: > > > From: Dhodapkar, Chinmay <[EMAIL PROTECTED]<http://mc/compose?[EMAIL PROTECTED]> > > > Subject: RE: HDFS without Hadoop: Why? > To: "[EMAIL PROTECTED]<http://mc/compose?[EMAIL PROTECTED]>" > <[EMAIL PROTECTED]<http://mc/compose?[EMAIL PROTECTED]> > > > Date: Wednesday, February 2, 2011, 7:28 PM > > Hello, > > > > I have been following this thread for some time now. I am very comfortable > with the advantages of hdfs, but still have lingering questions about the > usage of hdfs for general purpose storage (no mapreduce/hbase etc). > > > > Can somebody shed light on what the limitations are on the number of files > that can be stored. Is it limited in anyway by the namenode? The use case I > am interested in is to store a very large number of relatively small files > (1MB to 25MB). > > +
Gaurav Sharma 2011-02-03, 02:31
-
Re: HDFS without Hadoop: Why?Stuart Smith 2011-02-03, 03:32
> Stuart - if Dhruba is giving hdfs file and block sizes used by the namenode, you really cannot get a more authoritative number elsewhere :) Yes - very true! :) I spaced out on the name there ... ;) One more thing - I believe that if you're storing a lot of your smaller files in hbase, you'll end up with a lot less files on hdfs, since several of your smaller files will end up in one HFile?? I'm storing 5-7 million files, with at least 70-80% ending up in hbase. I only have 16 GB of RAM for my name-node, and it's very far from overloading the memory. Off the top of my head, I think it's << 8 GB of RAM used... Take care, -stu --- On Wed, 2/2/11, Gaurav Sharma <[EMAIL PROTECTED]> wrote: From: Gaurav Sharma <[EMAIL PROTECTED]> Subject: Re: HDFS without Hadoop: Why? To: [EMAIL PROTECTED] Date: Wednesday, February 2, 2011, 9:31 PM Stuart - if Dhruba is giving hdfs file and block sizes used by the namenode, you really cannot get a more authoritative number elsewhere :) I would do the back-of-envelope with ~160 bytes/file and ~150 bytes/block. On Wed, Feb 2, 2011 at 9:08 PM, Stuart Smith <[EMAIL PROTECTED]> wrote: This is the best coverage I've seen from a source that would know: http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/ One relevant quote: To store 100 million files (referencing 200 million blocks), a name-node should have at least 60 GB of RAM. But, honestly, if you're just building out your cluster, you'll probably run into a lot of other limits first: hard drive space, regionserver memory, the infamous ulimit/xciever :), etc...the Take care, -stu --- On Wed, 2/2/11, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: From: Dhruba Borthakur <[EMAIL PROTECTED]> Subject: Re: HDFS without Hadoop: Why? To: [EMAIL PROTECTED] Date: Wednesday, February 2, 2011, 9:00 PM The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This is a very rough calculation. dhruba On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <[EMAIL PROTECTED]> wrote: What you describe is pretty much my use case as well. Since I don’t know how big the number of files could get , I am trying to figure out if there is a theoretical design limitation in hdfs….. From what I have read, the name node will store all metadata of all files in the RAM. Assuming (in my case), that a file is less than the configured block size….there should be a very rough formula that can be used to calculate the max number of files that hdfs can serve based on the configured RAM on the name node? Can any of the implementers comment on this? Am I even thinking on the right track…? Thanks Ian for the haystack link…very informative indeed. -Chinmay From: Stuart Smith [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 02, 2011 4:41 PM To: [EMAIL PROTECTED] Subject: RE: HDFS without Hadoop: Why? Hello, I'm actually using hbase/hadoop/hdfs for lots of small files (with a long tail of larger files). Well, millions of small files - I don't know what you mean by lots :) Facebook probably knows better, But what I do is: - store metadata in hbase - files smaller than 10 MB or so in hbase -larger files in a hdfs directory tree. I started storing 64 MB files and smaller in hbase (chunk size), but that causes issues with regionservers when running M/R jobs. This is related to the fact that I'm running a cobbled together cluster & my region servers don't have that much memory. I would play the size to see what works for you.. Take care, -stu --- On Wed, 2/2/11, Dhodapkar, Chinmay <[EMAIL PROTECTED]> wrote: From: Dhodapkar, Chinmay <[EMAIL PROTECTED]> Subject: RE: HDFS without Hadoop: Why? To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Date: Wednesday, February 2, 2011, 7:28 PM Hello, I have been following this thread for some time now. I am very comfortable with the advantages of hdfs, but still have lingering questions about the usage of hdfs for general purpose storage (no mapreduce/hbase etc). Can somebody shed light on what the limitations are on the number of files that can be stored. Is it limited in anyway by the namenode? The use case I am interested in is to store a very large number of relatively small files (1MB to 25MB). Interestingly, I saw a facebook presentation on how they use hbase/hdfs internally. Them seem to store all metadata in hbase and the actual images/files/etc in something called “haystack” (why not use hdfs since they already have it?). Anybody know what “haystack” is? Thanks! Chinmay From: Jeff Hammerbacher [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 02, 2011 3:31 PM To: [EMAIL PROTECTED] Subject: Re: HDFS without Hadoop: Why? Large block size wastes space for small file. The minimum file size is 1 block. That's incorrect. If a file is smaller than the block size, it will only consume as much space as there is data in the file. There are no hardlinks, softlinks, or quotas. That's incorrect; there are quotas and softlinks. Connect to me at http://www.facebook.com/dhruba +
Stuart Smith 2011-02-03, 03:32
-
Re: HDFS without Hadoop: Why?Konstantin Shvachko 2011-02-03, 02:42
Thanks for the link Stu.
More details are on limitations are here: http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf I think that Nathan raised an interesting question and his assessment of HDFS use cases are generally right. Some assumptions though are outdated at this point. And people mentioned about it in the thread. We have append implementation, which allows reopening files for updates. We also have symbolic links and quotas (space and name-space). The api to HDFS is not posix, true. But in addition to Fuse people also use Thrift to access hdfs. Most of these features are explained in HDFS overview paper: http://storageconference.org/2010/Papers/MSST/Shvachko.pdf Stand-alone HDFS is actually used in several places. I like what Brian Bockelman at University of Nebraska does. They store CERN data in their cluster, and physicists use Fortran to access the data, not map-reduce, as I heard. http://storageconference.org/2010/Presentations/MSST/3.Bockelman.pdf With respect to other distributed file systems. HDFS performance was compared to PVFS, GPFS and Lustre. The results were in favor of HDFS. See e.g. http://www.cs.cmu.edu/~wtantisi/files/hadooppvfs-pdl08.pdf<http://www.cs.cmu.edu/%7Ewtantisi/files/hadooppvfs-pdl08.pdf> So I agree with Nathan HDFS was designed and optimized as a storage layer for map-reduce type tasks, but it performs well as a general purpose fs as well. Thanks, --Konstantin On Wed, Feb 2, 2011 at 6:08 PM, Stuart Smith <[EMAIL PROTECTED]> wrote: > > This is the best coverage I've seen from a source that would know: > > > http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/ > > One relevant quote: > > To store 100 million files (referencing 200 million blocks), a name-node > should have at least 60 GB of RAM. > > But, honestly, if you're just building out your cluster, you'll probably > run into a lot of other limits first: hard drive space, regionserver memory, > the infamous ulimit/xciever :), etc... > > Take care, > -stu > > --- On *Wed, 2/2/11, Dhruba Borthakur <[EMAIL PROTECTED]>* wrote: > > > From: Dhruba Borthakur <[EMAIL PROTECTED]> > Subject: Re: HDFS without Hadoop: Why? > To: [EMAIL PROTECTED] > Date: Wednesday, February 2, 2011, 9:00 PM > > The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This > is a very rough calculation. > > dhruba > > On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <[EMAIL PROTECTED]<http://mc/compose?[EMAIL PROTECTED]> > > wrote: > > What you describe is pretty much my use case as well. Since I don’t know > how big the number of files could get , I am trying to figure out if there > is a theoretical design limitation in hdfs….. > > > > From what I have read, the name node will store all metadata of all files > in the RAM. Assuming (in my case), that a file is less than the configured > block size….there should be a very rough formula that can be used to > calculate the max number of files that hdfs can serve based on the > configured RAM on the name node? > > > > Can any of the implementers comment on this? Am I even thinking on the > right track…? > > > > Thanks Ian for the haystack link…very informative indeed. > > > > -Chinmay > > > > > > > > *From:* Stuart Smith [mailto:[EMAIL PROTECTED]<http://mc/compose?[EMAIL PROTECTED]>] > > *Sent:* Wednesday, February 02, 2011 4:41 PM > > *To:* [EMAIL PROTECTED]<http://mc/compose?[EMAIL PROTECTED]> > *Subject:* RE: HDFS without Hadoop: Why? > > > > Hello, > I'm actually using hbase/hadoop/hdfs for lots of small files (with a > long tail of larger files). Well, millions of small files - I don't know > what you mean by lots :) > > Facebook probably knows better, But what I do is: > > - store metadata in hbase > - files smaller than 10 MB or so in hbase > -larger files in a hdfs directory tree. > > I started storing 64 MB files and smaller in hbase (chunk size), but that > causes issues with regionservers when running M/R jobs. This is related to +
Konstantin Shvachko 2011-02-03, 02:42
-
Re: HDFS without Hadoop: Why?Nathan Rutman 2011-02-03, 18:48
On Feb 2, 2011, at 6:42 PM, Konstantin Shvachko wrote: > Thanks for the link Stu. > More details are on limitations are here: > http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf > > I think that Nathan raised an interesting question and his assessment of HDFS use > cases are generally right. > Some assumptions though are outdated at this point. > And people mentioned about it in the thread. > We have append implementation, which allows reopening files for updates. > We also have symbolic links and quotas (space and name-space). > The api to HDFS is not posix, true. But in addition to Fuse people also use > Thrift to access hdfs. > Most of these features are explained in HDFS overview paper: > http://storageconference.org/2010/Papers/MSST/Shvachko.pdf > > Stand-alone HDFS is actually used in several places. I like what > Brian Bockelman at University of Nebraska does. > They store CERN data in their cluster, and physicists use Fortran to access the data, > not map-reduce, as I heard. > http://storageconference.org/2010/Presentations/MSST/3.Bockelman.pdf This doesn't seem to mention what storage they're using. > > With respect to other distributed file systems. HDFS performance was compared to > PVFS, GPFS and Lustre. The results were in favor of HDFS. See e.g. PVFS > http://www.cs.cmu.edu/~wtantisi/files/hadooppvfs-pdl08.pdf > Some other references for those interested: HDFS vs GPFS Cloud analytics: Do we really need to reinvent the storage stack? Lustre http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf Ceph www.usenix.org—maltzahn.pdf These GPFS and Lustre papers were both favorable toward HDFS because they missed a fundamental issue: for the former FS's, network speed is critical. HDFS doesn't need network on reads (ideally), and so is simultaneously immune to network speed, but also cannot take advantage of network speed. For slow networks (1GigE) this plays into HDFS's strength, but for fast networks (10GigE, Infiniband), the balance tips the other way. (My testing: for a heavily loaded network, a 3-4x read speed factor for Lustre. For writes, the difference is even more extreme (10x), since HDFS has to hop all write data over the network twice.) Let me say clearly that your choice of FS should depend on which of many factors are most important to you -- there is no "one size fits all", although that sadly makes our decisions more complex. For those using Hadoop that have a high weighting on IO performance (as well as some other factors I listed in my original mail), I suggest you at least think about spending money on a fast network and using a FS that can utilize it. > So I agree with Nathan HDFS was designed and optimized as a storage layer for > map-reduce type tasks, but it performs well as a general purpose fs as well. > > Thanks, > --Konstantin > > > > > On Wed, Feb 2, 2011 at 6:08 PM, Stuart Smith <[EMAIL PROTECTED]> wrote: > > This is the best coverage I've seen from a source that would know: > > http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/ > > One relevant quote: > > To store 100 million files (referencing 200 million blocks), a name-node should have at least 60 GB of RAM. > > But, honestly, if you're just building out your cluster, you'll probably run into a lot of other limits first: hard drive space, regionserver memory, the infamous ulimit/xciever :), etc... > > Take care, > -stu > > --- On Wed, 2/2/11, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: > > From: Dhruba Borthakur <[EMAIL PROTECTED]> > Subject: Re: HDFS without Hadoop: Why? > To: [EMAIL PROTECTED] > Date: Wednesday, February 2, 2011, 9:00 PM > > The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This is a very rough calculation. > > dhruba > > On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <[EMAIL PROTECTED]> wrote: > What you describe is pretty much my use case as well. Since I don’t know how big the number of files could get , I am trying to figure out if there is a theoretical design limitation in hdfs….. +
Nathan Rutman 2011-02-03, 18:48
-
Re: HDFS without Hadoop: Why?Konstantin Shvachko 2011-02-03, 20:24
Nathan,
Great references. There is a good place to put them to: http://wiki.apache.org/hadoop/HDFS_Publications GPFS and Lustre papers are not there yet, I believe. Thanks, --Konstantin On Thu, Feb 3, 2011 at 10:48 AM, Nathan Rutman <[EMAIL PROTECTED]> wrote: > > On Feb 2, 2011, at 6:42 PM, Konstantin Shvachko wrote: > > Thanks for the link Stu. > More details are on limitations are here: > http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf > > I think that Nathan raised an interesting question and his assessment of > HDFS use > cases are generally right. > Some assumptions though are outdated at this point. > And people mentioned about it in the thread. > We have append implementation, which allows reopening files for updates. > We also have symbolic links and quotas (space and name-space). > The api to HDFS is not posix, true. But in addition to Fuse people also use > > Thrift to access hdfs. > Most of these features are explained in HDFS overview paper: > http://storageconference.org/2010/Papers/MSST/Shvachko.pdf > > Stand-alone HDFS is actually used in several places. I like what > Brian Bockelman at University of Nebraska does. > They store CERN data in their cluster, and physicists use Fortran to access > the data, > not map-reduce, as I heard. > http://storageconference.org/2010/Presentations/MSST/3.Bockelman.pdf > > This doesn't seem to mention what storage they're using. > > > With respect to other distributed file systems. HDFS performance was > compared to > PVFS, GPFS and Lustre. The results were in favor of HDFS. See e.g. > > PVFS > > http://www.cs.cmu.edu/~wtantisi/files/hadooppvfs-pdl08.pdf<http://www.cs.cmu.edu/%7Ewtantisi/files/hadooppvfs-pdl08.pdf> > > > Some other references for those interested: HDFS vs > GPFS > Cloud analytics: Do we really need to reinvent the storage stack?<http://www.usenix.org/event/hotcloud09/tech/full_papers/ananthanarayanan.pdf> > Lustre > http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf > Ceph > www.usenix.org—maltzahn.pdf<http://www.usenix.org/publications/login/2010-08/openpdfs/maltzahn.pdf> > > These GPFS and Lustre papers were both favorable toward HDFS because > they missed a fundamental issue: for the former FS's, network speed is > critical. > HDFS doesn't need network on reads (ideally), and so is simultaneously > immune to network > speed, but also cannot take advantage of network speed. For slow networks > (1GigE) > this plays into HDFS's strength, but for fast networks (10GigE, > Infiniband), > the balance tips the other way. (My testing: for a heavily loaded network, > a 3-4x read > speed factor for Lustre. For writes, the difference is even more extreme > (10x), > since HDFS has to hop all write data over the network twice.) > > Let me say clearly that your choice of FS should depend on which of many > factors > are most important to you -- there is no "one size fits all", although that > sadly makes our > decisions more complex. For those using Hadoop that have a high weighting > on > IO performance (as well as some other factors I listed in my original > mail), I suggest you > at least think about spending money on a fast network and using a FS that > can utilize it. > > > So I agree with Nathan HDFS was designed and optimized as a storage layer > for > map-reduce type tasks, but it performs well as a general purpose fs as > well. > > Thanks, > --Konstantin > > > > > On Wed, Feb 2, 2011 at 6:08 PM, Stuart Smith <[EMAIL PROTECTED]> wrote: > >> >> This is the best coverage I've seen from a source that would know: >> >> >> http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/ >> >> One relevant quote: >> >> To store 100 million files (referencing 200 million blocks), a name-node >> should have at least 60 GB of RAM. >> >> But, honestly, if you're just building out your cluster, you'll probably >> run into a lot of other limits first: hard drive space, regionserver memory, >> the infamous ulimit/xciever :), etc... +
Konstantin Shvachko 2011-02-03, 20:24
-
RE: HDFS without Hadoop: Why?Scott Golby 2011-01-25, 22:05
Hi Nathan,
> I have a very general question on the usefulness of HDFS for purposes other than running distributed compute jobs for Hadoop. Hadoop and HDFS seem very popular these days, but the use of > HDFS for other purposes (database backend, records archiving, etc) confuses me, since there are other free distributed filesystems out there (I personally work on Lustre), with significantly better general-purpose performance. Coming it from a SysAdmin point of view I could say these things about HDFS - It's very easy to setup, basically 3 steps Build Box, Format NameNode, Start. It's easy to be up and running in Production mode HDFS in <4 hours with a 5+ machine cluster. (how big that "+" is dependent on your node deployment method, Kickstart, VMware, EC2 images, more than the complexity of HDFS setup) - At least in Hadoop 0.20 it lacks a lot of "Ops" features, getting a simple "ls" through an API/hadoop as you mention. - The 3x Data seems very wasteful to me - Multidisk deployments are difficult for monitoring to deal with. Adding a disk is easy, one xml file and a ",/new/disk_path". But, the 1st disk will go 100% and say there, while the new disk is perhaps 10% full (if you added the new disk when the old one was 90% full). They fill at equal rates. There is no easy way to balance the disks within the machine (the balance command works only across machines), so your Ops guys will get the "Disk 100% full" page 24x7 until they kill someone. :) I haven't done any benchmarking, but my expectations for speed aren't high with HDFS. Scott +
Scott Golby 2011-01-25, 22:05
-
Re: HDFS without Hadoop: Why?Gerrit Jansen van Vuuren 2011-01-25, 23:56
Hi,
Why would 3x data seem wasteful? This is exactly what you want. I would never store any serious business data without some form of replication. What happens if you store a single file on a single server without replicas and that server goes, or just the disk on that the file is on goes ? HDFS and any decent distributed file system uses replication to prevent data loss. As a side affect having the same replica of a data piece on separate servers means that more than one task can work on the server in parallel. I do agree that balancing data between disks on the same node is lacking currently. Filling up a single disk in hdfs and adding a new one is not that big a problem. Each DataNode will save blocks on the new disk and not just keep on filling the already full disk. For data durability this is also not a problem because with 3x replication you will have 1 replica on the local node, 1 in the same rack plus another off rack. With hadoop fs -ls <file> you get a listing of files , Yes its lacking, but I've found that if you use awk you can get the columns you want, Its a bit tedious I know. Cheers, Gerrit On Tue, Jan 25, 2011 at 10:05 PM, Scott Golby <[EMAIL PROTECTED]> wrote: > Hi Nathan, > > > > > I have a very general question on the usefulness of HDFS for purposes > other than running distributed compute jobs for Hadoop. Hadoop and HDFS > seem very popular these days, but the use of > > > HDFS for other purposes (database backend, records archiving, etc) > confuses me, since there are other free distributed filesystems out there (I > personally work on Lustre), with significantly better general-purpose > performance. > > > > Coming it from a SysAdmin point of view I could say these things about HDFS > > > > - It’s very easy to setup, basically 3 steps Build Box, Format > NameNode, Start. It’s easy to be up and running in Production mode HDFS in > <4 hours with a 5+ machine cluster. (how big that “+” is dependent on your > node deployment method, Kickstart, VMware, EC2 images, more than the > complexity of HDFS setup) > > - At least in Hadoop 0.20 it lacks a lot of “Ops” features, > getting a simple “ls” through an API/hadoop as you mention. > > - The 3x Data seems very wasteful to me > > - Multidisk deployments are difficult for monitoring to deal > with. Adding a disk is easy, one xml file and a “,/new/disk_path”. But, > the 1st disk will go 100% and say there, while the new disk is perhaps 10% > full (if you added the new disk when the old one was 90% full). They fill > at equal rates. There is no easy way to balance the disks within the > machine (the balance command works only across machines), so your Ops guys > will get the “Disk 100% full” page 24x7 until they kill someone. :) > > > > I haven’t done any benchmarking, but my expectations for speed aren’t high > with HDFS. > > > > Scott > > > +
Gerrit Jansen van Vuuren 2011-01-25, 23:56
-
Re: HDFS without Hadoop: Why?Nathan Rutman 2011-01-26, 00:32
On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote: > Hi, > > Why would 3x data seem wasteful? > This is exactly what you want. I would never store any serious business data without some form of replication. I agree that you want data backup, but 3x replication is the least efficient / most expensive (space-wise) way to do it. This is what RAID was invented for: RAID 6 gives you fault tolerance against loss of any two drives, for only 20% disk space overhead. (Sorry, I see I forgot to note this in my original email, but that's what I had in mind.) RAID is also not necessarily $ expensive either; Linux MD RAID is free and effective. > What happens if you store a single file on a single server without replicas and that server goes, or just the disk on that the file is on goes ? HDFS and any decent distributed file system uses replication to prevent data loss. As a side affect having the same replica of a data piece on separate servers means that more than one task can work on the server in parallel. Indeed, replicated data does mean Hadoop could work on the same block on separate nodes. But outside of Hadoop compute jobs, I don't think this is useful in general. And in any case, a distributed filesystem would let you work on the same block of data from however many nodes you wanted. +
Nathan Rutman 2011-01-26, 00:32
-
Re: HDFS without Hadoop: Why?stu24mail@... 2011-01-26, 01:08
I don't think, as a recovery strategy, RAID scales to large amounts of data. Even as some kind of attached storage device (e.g. Vtrack), you're only talking about a few terabytes of data, and it doesn't tolerate node failure.
A key part of hdfs is the distributed part. Best, -stu -----Original Message----- From: Nathan Rutman <[EMAIL PROTECTED]> Date: Tue, 25 Jan 2011 16:32:07 To: <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: Re: HDFS without Hadoop: Why? On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote: > Hi, > > Why would 3x data seem wasteful? > This is exactly what you want. I would never store any serious business data without some form of replication. I agree that you want data backup, but 3x replication is the least efficient / most expensive (space-wise) way to do it. This is what RAID was invented for: RAID 6 gives you fault tolerance against loss of any two drives, for only 20% disk space overhead. (Sorry, I see I forgot to note this in my original email, but that's what I had in mind.) RAID is also not necessarily $ expensive either; Linux MD RAID is free and effective. > What happens if you store a single file on a single server without replicas and that server goes, or just the disk on that the file is on goes ? HDFS and any decent distributed file system uses replication to prevent data loss. As a side affect having the same replica of a data piece on separate servers means that more than one task can work on the server in parallel. Indeed, replicated data does mean Hadoop could work on the same block on separate nodes. But outside of Hadoop compute jobs, I don't think this is useful in general. And in any case, a distributed filesystem would let you work on the same block of data from however many nodes you wanted. +
stu24mail@... 2011-01-26, 01:08
-
Re: HDFS without Hadoop: Why?Nathan Rutman 2011-01-26, 01:31
On Jan 25, 2011, at 5:08 PM, [EMAIL PROTECTED] wrote: > I don't think, as a recovery strategy, RAID scales to large amounts of data. Even as some kind of attached storage device (e.g. Vtrack), you're only talking about a few terabytes of data, and it doesn't tolerate node failure. When talking about large amounts of data, 3x redundancy absolutely doesn't scale. Nobody is going to pay for 3 petabytes worth of disk if they only need 1 PB worth of data. This is where dedicated high-end raid systems come in (this is in fact what my company, Xyratex, builds). Redundant controllers, battery backup, etc. The incremental cost for an additional drive in such systems is negligible. > > A key part of hdfs is the distributed part. Granted, single-point-of-failure arguments are valid when concentrating all the storage together, but can be generally dealt with using hardware and software failover techniques. The scale argument in my mind is exactly reversed -- HDFS works fine for smaller installations that can't afford RAID hardware overhead and access redundancy, and where buying 30 drives instead of 10 is an acceptable cost for the simplicity of HDFS setup. > > Best, > -stu > -----Original Message----- > From: Nathan Rutman <[EMAIL PROTECTED]> > Date: Tue, 25 Jan 2011 16:32:07 > To: <[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > Subject: Re: HDFS without Hadoop: Why? > > > On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote: > >> Hi, >> >> Why would 3x data seem wasteful? >> This is exactly what you want. I would never store any serious business data without some form of replication. > > I agree that you want data backup, but 3x replication is the least efficient / most expensive (space-wise) way to do it. This is what RAID was invented for: RAID 6 gives you fault tolerance against loss of any two drives, for only 20% disk space overhead. (Sorry, I see I forgot to note this in my original email, but that's what I had in mind.) RAID is also not necessarily $ expensive either; Linux MD RAID is free and effective. > >> What happens if you store a single file on a single server without replicas and that server goes, or just the disk on that the file is on goes ? HDFS and any decent distributed file system uses replication to prevent data loss. As a side affect having the same replica of a data piece on separate servers means that more than one task can work on the server in parallel. > > Indeed, replicated data does mean Hadoop could work on the same block on separate nodes. But outside of Hadoop compute jobs, I don't think this is useful in general. And in any case, a distributed filesystem would let you work on the same block of data from however many nodes you wanted. > > +
Nathan Rutman 2011-01-26, 01:31
-
Re: HDFS without Hadoop: Why?stu24mail@... 2011-01-26, 03:58
My point was it's not RAID or whatever versus HDFS. HDFS is a distributed file system that solves different problems.
HDFS is a file system. It's like asking NTFS or RAID? >but can be generally dealt with using hardware and software failover techniques. Like hdfs. Best, -stu -----Original Message----- From: Nathan Rutman <[EMAIL PROTECTED]> Date: Tue, 25 Jan 2011 17:31:25 To: <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: Re: HDFS without Hadoop: Why? On Jan 25, 2011, at 5:08 PM, [EMAIL PROTECTED] wrote: > I don't think, as a recovery strategy, RAID scales to large amounts of data. Even as some kind of attached storage device (e.g. Vtrack), you're only talking about a few terabytes of data, and it doesn't tolerate node failure. When talking about large amounts of data, 3x redundancy absolutely doesn't scale. Nobody is going to pay for 3 petabytes worth of disk if they only need 1 PB worth of data. This is where dedicated high-end raid systems come in (this is in fact what my company, Xyratex, builds). Redundant controllers, battery backup, etc. The incremental cost for an additional drive in such systems is negligible. > > A key part of hdfs is the distributed part. Granted, single-point-of-failure arguments are valid when concentrating all the storage together, but can be generally dealt with using hardware and software failover techniques. The scale argument in my mind is exactly reversed -- HDFS works fine for smaller installations that can't afford RAID hardware overhead and access redundancy, and where buying 30 drives instead of 10 is an acceptable cost for the simplicity of HDFS setup. > > Best, > -stu > -----Original Message----- > From: Nathan Rutman <[EMAIL PROTECTED]> > Date: Tue, 25 Jan 2011 16:32:07 > To: <[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > Subject: Re: HDFS without Hadoop: Why? > > > On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote: > >> Hi, >> >> Why would 3x data seem wasteful? >> This is exactly what you want. I would never store any serious business data without some form of replication. > > I agree that you want data backup, but 3x replication is the least efficient / most expensive (space-wise) way to do it. This is what RAID was invented for: RAID 6 gives you fault tolerance against loss of any two drives, for only 20% disk space overhead. (Sorry, I see I forgot to note this in my original email, but that's what I had in mind.) RAID is also not necessarily $ expensive either; Linux MD RAID is free and effective. > >> What happens if you store a single file on a single server without replicas and that server goes, or just the disk on that the file is on goes ? HDFS and any decent distributed file system uses replication to prevent data loss. As a side affect having the same replica of a data piece on separate servers means that more than one task can work on the server in parallel. > > Indeed, replicated data does mean Hadoop could work on the same block on separate nodes. But outside of Hadoop compute jobs, I don't think this is useful in general. And in any case, a distributed filesystem would let you work on the same block of data from however many nodes you wanted. > > +
stu24mail@... 2011-01-26, 03:58
-
Re: HDFS without Hadoop: Why?Dhruba Borthakur 2011-01-26, 05:54
Hi Nathan,
we are using HDFS-RAID for our 30 PB cluster. Most datasets have a replication factor of 2.2 and a few datasets have a replication factor of 1.4. Some details here: http://wiki.apache.org/hadoop/HDFS-RAID http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html thanks, dhruba On Tue, Jan 25, 2011 at 7:58 PM, <[EMAIL PROTECTED]> wrote: > My point was it's not RAID or whatr versus HDFS. HDFS is a distributed file > system that solves different problems. > > HDFS is a file system. It's like asking NTFS or RAID? > > >but can be generally dealt with using hardware and software failover > techniques. > > Like hdfs. > > Best, > -stu > -----Original Message----- > From: Nathan Rutman <[EMAIL PROTECTED]> > Date: Tue, 25 Jan 2011 17:31:25 > To: <[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > Subject: Re: HDFS without Hadoop: Why? > > > On Jan 25, 2011, at 5:08 PM, [EMAIL PROTECTED] wrote: > > > I don't think, as a recovery strategy, RAID scales to large amounts of > data. Even as some kind of attached storage device (e.g. Vtrack), you're > only talking about a few terabytes of data, and it doesn't tolerate node > failure. > > When talking about large amounts of data, 3x redundancy absolutely doesn't > scale. Nobody is going to pay for 3 petabytes worth of disk if they only > need 1 PB worth of data. This is where dedicated high-end raid systems come > in (this is in fact what my company, Xyratex, builds). Redundant > controllers, battery backup, etc. The incremental cost for an additional > drive in such systems is negligible. > > > > > A key part of hdfs is the distributed part. > > Granted, single-point-of-failure arguments are valid when concentrating all > the storage together, but can be generally dealt with using hardware and > software failover techniques. > > The scale argument in my mind is exactly reversed -- HDFS works fine for > smaller installations that can't afford RAID hardware overhead and access > redundancy, and where buying 30 drives instead of 10 is an acceptable cost > for the simplicity of HDFS setup. > > > > > Best, > > -stu > > -----Original Message----- > > From: Nathan Rutman <[EMAIL PROTECTED]> > > Date: Tue, 25 Jan 2011 16:32:07 > > To: <[EMAIL PROTECTED]> > > Reply-To: [EMAIL PROTECTED] > > Subject: Re: HDFS without Hadoop: Why? > > > > > > On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote: > > > >> Hi, > >> > >> Why would 3x data seem wasteful? > >> This is exactly what you want. I would never store any serious business > data without some form of replication. > > > > I agree that you want data backup, but 3x replication is the least > efficient / most expensive (space-wise) way to do it. This is what RAID was > invented for: RAID 6 gives you fault tolerance against loss of any two > drives, for only 20% disk space overhead. (Sorry, I see I forgot to note > this in my original email, but that's what I had in mind.) RAID is also not > necessarily $ expensive either; Linux MD RAID is free and effective. > > > >> What happens if you store a single file on a single server without > replicas and that server goes, or just the disk on that the file is on goes > ? HDFS and any decent distributed file system uses replication to prevent > data loss. As a side affect having the same replica of a data piece on > separate servers means that more than one task can work on the server in > parallel. > > > > Indeed, replicated data does mean Hadoop could work on the same block on > separate nodes. But outside of Hadoop compute jobs, I don't think this is > useful in general. And in any case, a distributed filesystem would let you > work on the same block of data from however many nodes you wanted. > > > > > > -- Connect to me at http://www.facebook.com/dhruba +
Dhruba Borthakur 2011-01-26, 05:54
-
Re: HDFS without Hadoop: Why?Gerrit Jansen van Vuuren 2011-01-26, 09:59
Hi,
For true data durability RAID is not enough. The conditions I operate on are the following: (1) Data loss is not acceptable under any terms (2) Data unavailability is not acceptable under any terms for any period of time. (3) Data loss for certain data sets become a legal issue and is again not acceptable, an might lead to loss of my employment. (4) Having 2 nodes fail in a month on average under for volumes we operate is to be expected, i.e. 100 to 400 nodes per cluster. (5) Having a data centre outage once a year is to be expected. (We've already had one this year) A word on node failure: Nodes do not just fail because of disks, any component can fail e.g. RAM, NetworkCard, SCSI controller, CPU etc. Now data loss or unavailability can happen under the following conditions: (1) Multiple of single disk failure (2) Node failure (a whole U goes down) (3) Rack failure (4) Data Centre failure Raid covers (1) but I do not know of any raid setup that will cover the rest. HDFS with 3 way replication covers 1,2, and 3 but not 4. HDFS 3 way replication with replication (via distcp) across data centres covers 1-4. The question to ask business is how valuable is the data in question to them? If they go RAID and only cover (1), they should be asked if its acceptable to have data unavailable with the possibility of permanent data loss at any point of time for any amount of data for any amount of time. If they come back to you and say yes we accept that if a node fails we loose data or that it becomes unavailable for any period of time, then yes go for RAID. If the answer is NO, you need replication, even DBAs understand this and thats why for DBs we backup, replicate and load/fail-over balance, why should we not do them same for critical business data on file storage? We run all of our nodes non raided (JBOD), because having 3 replicas means you don't require extra replicas on the same disk or node. Yes its true that any distributed file system will make data available to any number of nodes but this was not my point earlier. Having data replicas on multiple nodes means that data can be worked from in parallel on multiple physical nodes without requiring to read/copy the data from a single node. Cheers, Gerrit On Wed, Jan 26, 2011 at 5:54 AM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: > Hi Nathan, > > we are using HDFS-RAID for our 30 PB cluster. Most datasets have a > replication factor of 2.2 and a few datasets have a replication factor of > 1.4. Some details here: > > http://wiki.apache.org/hadoop/HDFS-RAID > > http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html > > thanks, > dhruba > > > On Tue, Jan 25, 2011 at 7:58 PM, <[EMAIL PROTECTED]> wrote: > >> My point was it's not RAID or whatr versus HDFS. HDFS is a distributed >> file system that solves different problems. >> >> >> HDFS is a file system. It's like asking NTFS or RAID? >> >> >but can be generally dealt with using hardware and software failover >> techniques. >> >> Like hdfs. >> >> Best, >> -stu >> -----Original Message----- >> From: Nathan Rutman <[EMAIL PROTECTED]> >> Date: Tue, 25 Jan 2011 17:31:25 >> To: <[EMAIL PROTECTED]> >> Reply-To: [EMAIL PROTECTED] >> Subject: Re: HDFS without Hadoop: Why? >> >> >> On Jan 25, 2011, at 5:08 PM, [EMAIL PROTECTED] wrote: >> >> > I don't think, as a recovery strategy, RAID scales to large amounts of >> data. Even as some kind of attached storage device (e.g. Vtrack), you're >> only talking about a few terabytes of data, and it doesn't tolerate node >> failure. >> >> When talking about large amounts of data, 3x redundancy absolutely doesn't >> scale. Nobody is going to pay for 3 petabytes worth of disk if they only >> need 1 PB worth of data. This is where dedicated high-end raid systems come >> in (this is in fact what my company, Xyratex, builds). Redundant >> controllers, battery backup, etc. The incremental cost for an additional >> drive in such systems is negligible. +
Gerrit Jansen van Vuuren 2011-01-26, 09:59
-
Re: HDFS without Hadoop: Why?Gerrit Jansen van Vuuren 2011-01-26, 15:26
The smallest size in HDFS is not the blocksize. The blocksize is an upper
limit, but if you store smaller files it will not take up extra space. HDFS is not meant for fast random access but built specifically for large files and sequential access. On Wed, Jan 26, 2011 at 9:59 AM, Gerrit Jansen van Vuuren < [EMAIL PROTECTED]> wrote: > Hi, > > For true data durability RAID is not enough. > The conditions I operate on are the following: > > (1) Data loss is not acceptable under any terms > (2) Data unavailability is not acceptable under any terms for any period of > time. > (3) Data loss for certain data sets become a legal issue and is again not > acceptable, an might lead to loss of my employment. > (4) Having 2 nodes fail in a month on average under for volumes we operate > is to be expected, i.e. 100 to 400 nodes per cluster. > (5) Having a data centre outage once a year is to be expected. (We've > already had one this year) > > A word on node failure: Nodes do not just fail because of disks, any > component can fail e.g. RAM, NetworkCard, SCSI controller, CPU etc. > > Now data loss or unavailability can happen under the following conditions: > (1) Multiple of single disk failure > (2) Node failure (a whole U goes down) > (3) Rack failure > (4) Data Centre failure > > Raid covers (1) but I do not know of any raid setup that will cover the > rest. > HDFS with 3 way replication covers 1,2, and 3 but not 4. > HDFS 3 way replication with replication (via distcp) across data centres > covers 1-4. > > The question to ask business is how valuable is the data in question to > them? If they go RAID and only cover (1), they should be asked if its > acceptable to have data unavailable with the possibility of permanent data > loss at any point of time for any amount of data for any amount of time. > If they come back to you and say yes we accept that if a node fails we > loose data or that it becomes unavailable for any period of time, then yes > go for RAID. If the answer is NO, you need replication, even DBAs understand > this and thats why for DBs we backup, replicate and load/fail-over balance, > why should we not do them same for critical business data on file storage? > > > We run all of our nodes non raided (JBOD), because having 3 replicas means > you don't require extra replicas on the same disk or node. > > Yes its true that any distributed file system will make data available to > any number of nodes but this was not my point earlier. Having data replicas > on multiple nodes means that data can be worked from in parallel on multiple > physical nodes without requiring to read/copy the data from a single node. > > Cheers, > Gerrit > > > On Wed, Jan 26, 2011 at 5:54 AM, Dhruba Borthakur <[EMAIL PROTECTED]>wrote: > >> Hi Nathan, >> >> we are using HDFS-RAID for our 30 PB cluster. Most datasets have a >> replication factor of 2.2 and a few datasets have a replication factor of >> 1.4. Some details here: >> >> http://wiki.apache.org/hadoop/HDFS-RAID >> >> http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html >> >> thanks, >> dhruba >> >> >> On Tue, Jan 25, 2011 at 7:58 PM, <[EMAIL PROTECTED]> wrote: >> >>> My point was it's not RAID or whatr versus HDFS. HDFS is a distributed >>> file system that solves different problems. >>> >>> >>> HDFS is a file system. It's like asking NTFS or RAID? >>> >>> >but can be generally dealt with using hardware and software failover >>> techniques. >>> >>> Like hdfs. >>> >>> Best, >>> -stu >>> -----Original Message----- >>> From: Nathan Rutman <[EMAIL PROTECTED]> >>> Date: Tue, 25 Jan 2011 17:31:25 >>> To: <[EMAIL PROTECTED]> >>> Reply-To: [EMAIL PROTECTED] >>> Subject: Re: HDFS without Hadoop: Why? >>> >>> >>> On Jan 25, 2011, at 5:08 PM, [EMAIL PROTECTED] wrote: >>> >>> > I don't think, as a recovery strategy, RAID scales to large amounts of >>> data. Even as some kind of attached storage device (e.g. Vtrack), you're >>> only talking about a few terabytes of data, and it doesn't tolerate node +
Gerrit Jansen van Vuuren 2011-01-26, 15:26
-
Re: HDFS without Hadoop: Why?Nathan Rutman 2011-01-26, 17:41
Ok. Is your statement, "I use HDFS for general-purpose data storage because it does this replication well", or is it more, "the most important benefit of using HDFS as the Map-Reduce or HBase backend fs is data safety." In other words, I'd like to relate this back to my original question of the broader usage of HDFS - does it make sense to use HDFS outside of the special application space for which it was designed?
On Jan 26, 2011, at 1:59 AM, Gerrit Jansen van Vuuren wrote: > Hi, > > For true data durability RAID is not enough. > The conditions I operate on are the following: > > (1) Data loss is not acceptable under any terms > (2) Data unavailability is not acceptable under any terms for any period of time. > (3) Data loss for certain data sets become a legal issue and is again not acceptable, an might lead to loss of my employment. > (4) Having 2 nodes fail in a month on average under for volumes we operate is to be expected, i.e. 100 to 400 nodes per cluster. > (5) Having a data centre outage once a year is to be expected. (We've already had one this year) > > A word on node failure: Nodes do not just fail because of disks, any component can fail e.g. RAM, NetworkCard, SCSI controller, CPU etc. > > Now data loss or unavailability can happen under the following conditions: > (1) Multiple of single disk failure > (2) Node failure (a whole U goes down) > (3) Rack failure > (4) Data Centre failure > > Raid covers (1) but I do not know of any raid setup that will cover the rest. > HDFS with 3 way replication covers 1,2, and 3 but not 4. > HDFS 3 way replication with replication (via distcp) across data centres covers 1-4. > > The question to ask business is how valuable is the data in question to them? If they go RAID and only cover (1), they should be asked if its acceptable to have data unavailable with the possibility of permanent data loss at any point of time for any amount of data for any amount of time. > If they come back to you and say yes we accept that if a node fails we loose data or that it becomes unavailable for any period of time, then yes go for RAID. If the answer is NO, you need replication, even DBAs understand this and thats why for DBs we backup, replicate and load/fail-over balance, why should we not do them same for critical business data on file storage? > > > We run all of our nodes non raided (JBOD), because having 3 replicas means you don't require extra replicas on the same disk or node. > > Yes its true that any distributed file system will make data available to any number of nodes but this was not my point earlier. Having data replicas on multiple nodes means that data can be worked from in parallel on multiple physical nodes without requiring to read/copy the data from a single node. > > Cheers, > Gerrit > > > On Wed, Jan 26, 2011 at 5:54 AM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: > Hi Nathan, > > we are using HDFS-RAID for our 30 PB cluster. Most datasets have a replication factor of 2.2 and a few datasets have a replication factor of 1.4. Some details here: > > http://wiki.apache.org/hadoop/HDFS-RAID > http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html > > thanks, > dhruba > > > On Tue, Jan 25, 2011 at 7:58 PM, <[EMAIL PROTECTED]> wrote: > My point was it's not RAID or whatr versus HDFS. HDFS is a distributed file system that solves different problems. > > > HDFS is a file system. It's like asking NTFS or RAID? > > >but can be generally dealt with using hardware and software failover techniques. > > Like hdfs. > > Best, > -stu > -----Original Message----- > From: Nathan Rutman <[EMAIL PROTECTED]> > Date: Tue, 25 Jan 2011 17:31:25 > To: <[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > Subject: Re: HDFS without Hadoop: Why? > > > On Jan 25, 2011, at 5:08 PM, [EMAIL PROTECTED] wrote: > > > I don't think, as a recovery strategy, RAID scales to large amounts of data. Even as some kind of attached storage device (e.g. Vtrack), you're only talking about a few terabytes of data, and it doesn't tolerate node failure. +
Nathan Rutman 2011-01-26, 17:41
-
Re: HDFS without Hadoop: Why?stu24mail@... 2011-01-27, 03:04
I believe for most people, the answer is "Yes"
-----Original Message----- From: Nathan Rutman <[EMAIL PROTECTED]> Date: Wed, 26 Jan 2011 09:41:37 To: <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: Re: HDFS without Hadoop: Why? Ok. Is your statement, "I use HDFS for general-purpose data storage because it does this replication well", or is it more, "the most important benefit of using HDFS as the Map-Reduce or HBase backend fs is data safety." In other words, I'd like to relate this back to my original question of the broader usage of HDFS - does it make sense to use HDFS outside of the special application space for which it was designed? On Jan 26, 2011, at 1:59 AM, Gerrit Jansen van Vuuren wrote: > Hi, > > For true data durability RAID is not enough. > The conditions I operate on are the following: > > (1) Data loss is not acceptable under any terms > (2) Data unavailability is not acceptable under any terms for any period of time. > (3) Data loss for certain data sets become a legal issue and is again not acceptable, an might lead to loss of my employment. > (4) Having 2 nodes fail in a month on average under for volumes we operate is to be expected, i.e. 100 to 400 nodes per cluster. > (5) Having a data centre outage once a year is to be expected. (We've already had one this year) > > A word on node failure: Nodes do not just fail because of disks, any component can fail e.g. RAM, NetworkCard, SCSI controller, CPU etc. > > Now data loss or unavailability can happen under the following conditions: > (1) Multiple of single disk failure > (2) Node failure (a whole U goes down) > (3) Rack failure > (4) Data Centre failure > > Raid covers (1) but I do not know of any raid setup that will cover the rest. > HDFS with 3 way replication covers 1,2, and 3 but not 4. > HDFS 3 way replication with replication (via distcp) across data centres covers 1-4. > > The question to ask business is how valuable is the data in question to them? If they go RAID and only cover (1), they should be asked if its acceptable to have data unavailable with the possibility of permanent data loss at any point of time for any amount of data for any amount of time. > If they come back to you and say yes we accept that if a node fails we loose data or that it becomes unavailable for any period of time, then yes go for RAID. If the answer is NO, you need replication, even DBAs understand this and thats why for DBs we backup, replicate and load/fail-over balance, why should we not do them same for critical business data on file storage? > > > We run all of our nodes non raided (JBOD), because having 3 replicas means you don't require extra replicas on the same disk or node. > > Yes its true that any distributed file system will make data available to any number of nodes but this was not my point earlier. Having data replicas on multiple nodes means that data can be worked from in parallel on multiple physical nodes without requiring to read/copy the data from a single node. > > Cheers, > Gerrit > > > On Wed, Jan 26, 2011 at 5:54 AM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: > Hi Nathan, > > we are using HDFS-RAID for our 30 PB cluster. Most datasets have a replication factor of 2.2 and a few datasets have a replication factor of 1.4. Some details here: > > http://wiki.apache.org/hadoop/HDFS-RAID > http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html > > thanks, > dhruba > > > On Tue, Jan 25, 2011 at 7:58 PM, <[EMAIL PROTECTED]> wrote: > My point was it's not RAID or whatr versus HDFS. HDFS is a distributed file system that solves different problems. > > > HDFS is a file system. It's like asking NTFS or RAID? > > >but can be generally dealt with using hardware and software failover techniques. > > Like hdfs. > > Best, > -stu > -----Original Message----- > From: Nathan Rutman <[EMAIL PROTECTED]> > Date: Tue, 25 Jan 2011 17:31:25 > To: <[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] +
stu24mail@... 2011-01-27, 03:04
-
Re: HDFS without Hadoop: Why?Friso van Vollenhoven 2011-01-26, 09:55
HBase is a database that runs on top of HDFS. So that's another one. It has an append-only usage pattern, which makes it a good fit.
I don't see how not-so-commodity hardware could go without replication to achieve the same as HDFS. It's not only about data safety, but also about availability. HDFS can survive complete machines dying, RAM banks going bad, motherboards going haywire and network partitions going off line because of switch failures. Anything that needs to survive a single box or single rack failure needs replication, I guess. That said, I think when you have two boxes doing 2PC for writes and each box is in itself setup with redundant storage (RAID or otherwise), you get a faster filesystem that is fully redundant. It would not make a nice fit for MapReduce though. Friso On 25 jan 2011, at 21:37, Nathan Rutman wrote: I have a very general question on the usefulness of HDFS for purposes other than running distributed compute jobs for Hadoop. Hadoop and HDFS seem very popular these days, but the use of HDFS for other purposes (database backend, records archiving, etc) confuses me, since there are other free distributed filesystems out there (I personally work on Lustre), with significantly better general-purpose performance. So please tell me if I'm wrong about any of this. Note I've gathered most of my info from documentation rather than reading the source code. As I understand it, HDFS was written specifically for Hadoop compute jobs, with the following design factors in mind: * write-once-read-many (worm) access model * use commodity hardware with relatively high failures rates (i.e. assumptive failures) * long, sequential streaming data access * large files * hardware/OS agnostic * moving computation is cheaper than moving data While appropriate for processing many large-input Hadoop data-processing jobs, there are significant penalties to be paid when trying to use these design factors for more general-purpose storage: * Commodity hardware requires data replication for safety. The HDFS implementation has three penalties: storage redundancy, network loading, and blocking writes. By default, HDFS blocks are replicated 3x: local, "nearby", and "far away" to minimize the impact of data center catastrophe. In addition to the obvious 3x cost for storage, the result is that every data block must be written "far away" - exactly the opposite of the "Move Computation to Data" mantra. Furthermore, these over-network writes are synchronous; the client write blocks until all copies are complete on disk, with the longest latency path of 2 network hops plus a disk write gating the overall write speed. Note that while this would be disastrous for a general-purpose filesystem, with true WORM usage it may be acceptable to penalize writes this way. * Large block size implies fewer files. HDFS reaches limits in the tens of millions of files. * Large block size wastes space for small file. The minimum file size is 1 block. * There is no data caching. When delivering large contiguous streaming data, this doesn't matter. But when the read load is random, seeky, or partial, this is a missing high-impact performance feature. * In a WORM model, changing a small part of a file requires all the file data to be copied, so e.g. database record modifications would be very expensive. * There are no hardlinks, softlinks, or quotas. * HDFS isn't directly mountable, and therefore requires a non-standard API to use. (FUSE workaround exists.) * Java source code is very portable and easy to install, but not very quick. * Moving computation is cheaper than moving data. But the data nonetheless always has to be moved: either read off of a local hard drive or read over the network into the compute node's memory. It is not necessarily the case that reading a local hard drive is faster than reading a distributed (striped) file over a fast network. Commodity network (e.g. 1GigE), probably yes. But a fast (and expensive) network (e.g. 4xDDR Infiniband) can deliver data significantly faster than a local commodity hard drive. If I'm missing other points, pro- or con-, I would appreciate hearing them. Note again I'm not questioning the success of HDFS in achieving those stated design choices, but rather trying to understand HDFS's applicability to other storage domains beyond Hadoop. Thanks for your time. +
Friso van Vollenhoven 2011-01-26, 09:55
-
Re: HDFS without Hadoop: Why?Gerrit Jansen van Vuuren 2011-01-27, 11:09
For me it depends on the requirements for the data, but if
I'm responsible for it and the data is deemed critical my choice would be HDFS. One more example: Currently where I work we are thinking of backing up data for at least 5-7 years, suddenly all sorts of issues come into play like disk bit rot with offline backups. Tape?? not a choice where I'm working. We're evaluating using big 48 2TB disk Us with 2 core cpu(s) in an HDFS setup (no mapreduce) to just store the data. HDFS will take care of disks failing and because blocks get calculated on a 3 week cycle for checksums the issue of bit rot is eliminated also. On Thu, Jan 27, 2011 at 3:04 AM, <[EMAIL PROTECTED]> wrote: > I believe for most people, the answer is "Yes" > ------------------------------ > *From: * Nathan Rutman <[EMAIL PROTECTED]> > *Date: *Wed, 26 Jan 2011 09:41:37 -0800 > *To: *<[EMAIL PROTECTED]> > *ReplyTo: * [EMAIL PROTECTED] > *Subject: *Re: HDFS without Hadoop: Why? > > Ok. Is your statement, "I use HDFS for general-purpose data storage > because it does this replication well", or is it more, "the most important > benefit of using HDFS as the Map-Reduce or HBase backend fs is data safety." > In other words, I'd like to relate this back to my original question of the > broader usage of HDFS - does it make sense to use HDFS outside of the > special application space for which it was designed? > > > On Jan 26, 2011, at 1:59 AM, Gerrit Jansen van Vuuren wrote: > > Hi, > > For true data durability RAID is not enough. > The conditions I operate on are the following: > > (1) Data loss is not acceptable under any terms > (2) Data unavailability is not acceptable under any terms for any period of > time. > (3) Data loss for certain data sets become a legal issue and is again not > acceptable, an might lead to loss of my employment. > (4) Having 2 nodes fail in a month on average under for volumes we operate > is to be expected, i.e. 100 to 400 nodes per cluster. > (5) Having a data centre outage once a year is to be expected. (We've > already had one this year) > > A word on node failure: Nodes do not just fail because of disks, any > component can fail e.g. RAM, NetworkCard, SCSI controller, CPU etc. > > Now data loss or unavailability can happen under the following conditions: > (1) Multiple of single disk failure > (2) Node failure (a whole U goes down) > (3) Rack failure > (4) Data Centre failure > > Raid covers (1) but I do not know of any raid setup that will cover the > rest. > HDFS with 3 way replication covers 1,2, and 3 but not 4. > HDFS 3 way replication with replication (via distcp) across data centres > covers 1-4. > > The question to ask business is how valuable is the data in question to > them? If they go RAID and only cover (1), they should be asked if its > acceptable to have data unavailable with the possibility of permanent data > loss at any point of time for any amount of data for any amount of time. > If they come back to you and say yes we accept that if a node fails we > loose data or that it becomes unavailable for any period of time, then yes > go for RAID. If the answer is NO, you need replication, even DBAs understand > this and thats why for DBs we backup, replicate and load/fail-over balance, > why should we not do them same for critical business data on file storage? > > > We run all of our nodes non raided (JBOD), because having 3 replicas means > you don't require extra replicas on the same disk or node. > > Yes its true that any distributed file system will make data available to > any number of nodes but this was not my point earlier. Having data replicas > on multiple nodes means that data can be worked from in parallel on multiple > physical nodes without requiring to read/copy the data from a single node. > > Cheers, > Gerrit > > > On Wed, Jan 26, 2011 at 5:54 AM, Dhruba Borthakur <[EMAIL PROTECTED]>wrote: > >> Hi Nathan, >> >> we are using HDFS-RAID for our 30 PB cluster. Most datasets have a +
Gerrit Jansen van Vuuren 2011-01-27, 11:09
|