|
|
-
using HDFS for a distributed storage system
Amit Chandel 2009-02-09, 04:06
Hi Group, I am planning to use HDFS as a reliable and distributed file system for batch operations. No plans as of now to run any map reduce job on top of it, but in future we will be having map reduce operations on files in HDFS. The current (test) system has 3 machines: NameNode: dual core CPU, 2GB RAM, 500GB HDD 2 DataNodes: Both of them with a dual core CPU, 2GB of RAM and 1.5TB of space with ext3 filesystem. I just need to put and retrieve files from this system. The files which I will put in HDFS varies from a few Bytes to a around 100MB, with the average file-size being 5MB. and the number of files would grow around 20-50 million. To avoid hitting limit of number of files under a directory, I store each file at the path derived by the SHA1 hash of its content (which is 20bytes long, and I create a 10 level deep path using 2bytes for each level). When I started the cluster a month back, I had kept the default block size to 1MB. The hardware specs mentioned at http://wiki.apache.org/hadoop/MachineScalingconsiders running map reduce operations. So not sure if my setup is good enough. I would like to get input on this setup. The final cluster would have each datanode with 8GB RAM, a quad core CPU, and 25 TB attached storage. I played with this setup a little and then planned to increase the disk space on both the DataNodes. I started by increasing its disk capacity of first dataNode to 15TB and changing the underlying filesystem to XFS (which made it a clean datanode), and put it back in the system. Before performing this operation, I had inserted around 70000 files in HDFS. **NameNode:50070/dfshealth.jsp showd *677323 files and directories, 332419 blocks = 1009742 total *. I guess the way I create a 10 level deep path for the file results in ~10 times the number of actual files in HDFS. Please correct me if I am wrong. I then ran the rebalancer on the cleaned up DataNode, which was too slow (writing 2blocks per second i.e. 2MBps) to begin with and died after a few hours saying too many open files. I checked all the machiens and all the DataNode and NameNode processes were running fine on all the respective machines, but the dfshealth.jsp showd both the datanodes to be dead. Re-starting the cluster brought both of them up. I guess this has to do with RAM requirements. My question is how to figure out the RAM requirements of DataNode and NameNode in this situation. The documentation states that both Datanode and NameNode stores the block index. Its not quite clear if all the index is in memory. Once I have figured that out, how can I instruct the hadoop to rebalance with high priority? Another question is regarding the "Non DFS used:" statistics shown on the dfshealth.jsp: Is it the space used to store the files and directory metadata information (apart from the actual file content blocks)? Right now it is 1/4th of the total space used by HDFS. Some points which I have thought of over the last month to improve this model are: 1. I should keep very small files (lets say smaller than 1KB) out of HDFS. 2. Reduce the dir level of the file path created by SHA1 hash (instead of 10, I can keep 3). 3. I should increase the block size to reduce the number of blocks in HDFS ( http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200805.mbox/<[EMAIL PROTECTED]> says it won't result in waste of disk space). More improvement advices are appreciated. Thanks, Amit
-
Re: using HDFS for a distributed storage system
Brian Bockelman 2009-02-09, 17:39
Hey Amit, Your current thoughts on keeping block size larger and removing the very small files are along the right line. Why not chose the default size of 64MB or larger? You don't seem too concerned about the number of replicas. However, you're still fighting against the tide. You've got enough files that you'll be pushing against block report and namenode limitations, especially with 20 - 50 million files. We find that about 500k blocks per node is a good stopping point right now. You really, really need to figure out how to organize your files in such a way that the average file size is above 64MB. Is there a "primary key" for each file? If so, maybe consider HBASE? If you just are going to be sequentially scanning through all your files, consider archiving them all to a single sequence file. Your individual data nodes are quite large ... I hope you're not expecting to measure throughput in 10's of Gbps? It's hard to give advice without knowing more about your application. I can tell you that you're going to run into a significant wall if you can't figure out a means for making your average file size at least greater than 64MB. Brian On Feb 8, 2009, at 10:06 PM, Amit Chandel wrote: > Hi Group, > > I am planning to use HDFS as a reliable and distributed file system > for > batch operations. No plans as of now to run any map reduce job on > top of it, > but in future we will be having map reduce operations on files in > HDFS. > > The current (test) system has 3 machines: > NameNode: dual core CPU, 2GB RAM, 500GB HDD > 2 DataNodes: Both of them with a dual core CPU, 2GB of RAM and 1.5TB > of > space with ext3 filesystem. > > I just need to put and retrieve files from this system. The files > which I > will put in HDFS varies from a few Bytes to a around 100MB, with the > average > file-size being 5MB. and the number of files would grow around 20-50 > million. To avoid hitting limit of number of files under a > directory, I > store each file at the path derived by the SHA1 hash of its content > (which > is 20bytes long, and I create a 10 level deep path using 2bytes for > each > level). When I started the cluster a month back, I had kept the > default > block size to 1MB. > > The hardware specs mentioned at > http://wiki.apache.org/hadoop/MachineScalingconsiders running map > reduce operations. So not sure if my setup is good > enough. I would like to get input on this setup. > The final cluster would have each datanode with 8GB RAM, a quad core > CPU, > and 25 TB attached storage. > > I played with this setup a little and then planned to increase the > disk > space on both the DataNodes. I started by increasing its disk > capacity of > first dataNode to 15TB and changing the underlying filesystem to XFS > (which > made it a clean datanode), and put it back in the system. Before > performing > this operation, I had inserted around 70000 files in HDFS. > **NameNode:50070/dfshealth.jsp > showd *677323 files and directories, 332419 blocks = 1009742 total > *. I > guess the way I create a 10 level deep path for the file results in > ~10 > times the number of actual files in HDFS. Please correct me if I am > wrong. I > then ran the rebalancer on the cleaned up DataNode, which was too slow > (writing 2blocks per second i.e. 2MBps) to begin with and died after > a few > hours saying too many open files. I checked all the machiens and all > the > DataNode and NameNode processes were running fine on all the > respective > machines, but the dfshealth.jsp showd both the datanodes to be dead. > Re-starting the cluster brought both of them up. I guess this has to > do with > RAM requirements. My question is how to figure out the RAM > requirements of > DataNode and NameNode in this situation. The documentation states > that both > Datanode and NameNode stores the block index. Its not quite clear if > all the > index is in memory. Once I have figured that out, how can I instruct
-
Re: using HDFS for a distributed storage system
Amit Chandel 2009-02-09, 22:27
Thanks Brian for your inputs.
I am eventually targeting to store 200k directories each containing 75 files on avg, with average size of directory being 300MB (ranging from 50MB to 650MB) in this storage system.
It will mostly be an archival storage from where I should be able to access any of the old files easily. But the recent directories would be accessed frequently for a day or 2 as they are being added. They are added in batches of 500-1000 per week, and there can be rare bursts of adding 50k directories once in 3 months. One such burst is about to come in a month, and I want to test the current test setup against that burst. We have upgraded our test hardware a little bit from what I last mentioned. The test setup will have 3 DataNodes with 15TB space on each, 6G RAM, dual core processor, and a NameNode 500G storage, 6G RAM, dual core processor.
I am planning to add the individual files initially, and after a while (lets say 2 days after insertion) will make a SequenceFile out of each directory (I am currently looking into SequenceFile) and delete the previous files of that directory from HDFS. That way in future, I can access any file given its directory without much effort. Now that SequenceFile is in picture, I can make default block size to 64MB or even 128MB. For replication, I am just replicating a file at 1 extra location (i.e. replication factor = 2, since a replication factor 3 will leave me with only 33% of the usable storage). Regarding reading back from HDFS, if I can read at ~50MBps (for recent files), that would be enough.
Let me know if you see any more pitfalls in this setup, or have more suggestions. I really appreciate it. Once I test this setup, I will put the results back to the list.
Thanks, Amit On Mon, Feb 9, 2009 at 12:39 PM, Brian Bockelman <[EMAIL PROTECTED]>wrote:
> Hey Amit, > > Your current thoughts on keeping block size larger and removing the very > small files are along the right line. Why not chose the default size of > 64MB or larger? You don't seem too concerned about the number of replicas. > > However, you're still fighting against the tide. You've got enough files > that you'll be pushing against block report and namenode limitations, > especially with 20 - 50 million files. We find that about 500k blocks per > node is a good stopping point right now. > > You really, really need to figure out how to organize your files in such a > way that the average file size is above 64MB. Is there a "primary key" for > each file? If so, maybe consider HBASE? If you just are going to be > sequentially scanning through all your files, consider archiving them all to > a single sequence file. > > Your individual data nodes are quite large ... I hope you're not expecting > to measure throughput in 10's of Gbps? > > It's hard to give advice without knowing more about your application. I > can tell you that you're going to run into a significant wall if you can't > figure out a means for making your average file size at least greater than > 64MB. > > Brian > > On Feb 8, 2009, at 10:06 PM, Amit Chandel wrote: > > Hi Group, >> >> I am planning to use HDFS as a reliable and distributed file system for >> batch operations. No plans as of now to run any map reduce job on top of >> it, >> but in future we will be having map reduce operations on files in HDFS. >> >> The current (test) system has 3 machines: >> NameNode: dual core CPU, 2GB RAM, 500GB HDD >> 2 DataNodes: Both of them with a dual core CPU, 2GB of RAM and 1.5TB of >> space with ext3 filesystem. >> >> I just need to put and retrieve files from this system. The files which I >> will put in HDFS varies from a few Bytes to a around 100MB, with the >> average >> file-size being 5MB. and the number of files would grow around 20-50 >> million. To avoid hitting limit of number of files under a directory, I >> store each file at the path derived by the SHA1 hash of its content (which >> is 20bytes long, and I create a 10 level deep path using 2bytes for each
-
Re: using HDFS for a distributed storage system
Brian Bockelman 2009-02-10, 00:00
Hey Amit,
That plan sounds much better. I think you will find the system much more scalable.
From our experience, it takes a while to get the right amount of monitoring and infrastructure in place to have a very dependable system with 2 replicas. I would recommend using 3 replicas until you feel you've mastered the setup.
Brian
On Feb 9, 2009, at 4:27 PM, Amit Chandel wrote:
> Thanks Brian for your inputs. > > I am eventually targeting to store 200k directories each containing > 75 > files on avg, with average size of directory being 300MB (ranging > from 50MB > to 650MB) in this storage system. > > It will mostly be an archival storage from where I should be able to > access > any of the old files easily. But the recent directories would be > accessed > frequently for a day or 2 as they are being added. They are added in > batches > of 500-1000 per week, and there can be rare bursts of adding 50k > directories > once in 3 months. One such burst is about to come in a month, and I > want to > test the current test setup against that burst. We have upgraded our > test > hardware a little bit from what I last mentioned. The test setup > will have 3 > DataNodes with 15TB space on each, 6G RAM, dual core processor, and a > NameNode 500G storage, 6G RAM, dual core processor. > > I am planning to add the individual files initially, and after a > while (lets > say 2 days after insertion) will make a SequenceFile out of each > directory > (I am currently looking into SequenceFile) and delete the previous > files of > that directory from HDFS. That way in future, I can access any file > given > its directory without much effort. > Now that SequenceFile is in picture, I can make default block size > to 64MB > or even 128MB. For replication, I am just replicating a file at 1 > extra > location (i.e. replication factor = 2, since a replication factor 3 > will > leave me with only 33% of the usable storage). Regarding reading > back from > HDFS, if I can read at ~50MBps (for recent files), that would be > enough. > > Let me know if you see any more pitfalls in this setup, or have more > suggestions. I really appreciate it. Once I test this setup, I will > put the > results back to the list. > > Thanks, > Amit > > > On Mon, Feb 9, 2009 at 12:39 PM, Brian Bockelman > <[EMAIL PROTECTED]>wrote: > >> Hey Amit, >> >> Your current thoughts on keeping block size larger and removing the >> very >> small files are along the right line. Why not chose the default >> size of >> 64MB or larger? You don't seem too concerned about the number of >> replicas. >> >> However, you're still fighting against the tide. You've got enough >> files >> that you'll be pushing against block report and namenode limitations, >> especially with 20 - 50 million files. We find that about 500k >> blocks per >> node is a good stopping point right now. >> >> You really, really need to figure out how to organize your files in >> such a >> way that the average file size is above 64MB. Is there a "primary >> key" for >> each file? If so, maybe consider HBASE? If you just are going to be >> sequentially scanning through all your files, consider archiving >> them all to >> a single sequence file. >> >> Your individual data nodes are quite large ... I hope you're not >> expecting >> to measure throughput in 10's of Gbps? >> >> It's hard to give advice without knowing more about your >> application. I >> can tell you that you're going to run into a significant wall if >> you can't >> figure out a means for making your average file size at least >> greater than >> 64MB. >> >> Brian >> >> On Feb 8, 2009, at 10:06 PM, Amit Chandel wrote: >> >> Hi Group, >>> >>> I am planning to use HDFS as a reliable and distributed file >>> system for >>> batch operations. No plans as of now to run any map reduce job on >>> top of >>> it, >>> but in future we will be having map reduce operations on files in
-
Re: using HDFS for a distributed storage system
lohit 2009-02-10, 02:14
> I am planning to add the individual files initially, and after a while (lets > say 2 days after insertion) will make a SequenceFile out of each directory > (I am currently looking into SequenceFile) and delete the previous files of > that directory from HDFS. That way in future, I can access any file given > its directory without much effort. Have you considered Hadoop archive? http://hadoop.apache.org/core/docs/current/hadoop_archives.htmlDepending on your access pattern, you could store files in archive step in the first place. ----- Original Message ---- From: Brian Bockelman <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Monday, February 9, 2009 4:00:42 PM Subject: Re: using HDFS for a distributed storage system Hey Amit, That plan sounds much better. I think you will find the system much more scalable. >From our experience, it takes a while to get the right amount of monitoring and infrastructure in place to have a very dependable system with 2 replicas. I would recommend using 3 replicas until you feel you've mastered the setup. Brian On Feb 9, 2009, at 4:27 PM, Amit Chandel wrote: > Thanks Brian for your inputs. > > I am eventually targeting to store 200k directories each containing 75 > files on avg, with average size of directory being 300MB (ranging from 50MB > to 650MB) in this storage system. > > It will mostly be an archival storage from where I should be able to access > any of the old files easily. But the recent directories would be accessed > frequently for a day or 2 as they are being added. They are added in batches > of 500-1000 per week, and there can be rare bursts of adding 50k directories > once in 3 months. One such burst is about to come in a month, and I want to > test the current test setup against that burst. We have upgraded our test > hardware a little bit from what I last mentioned. The test setup will have 3 > DataNodes with 15TB space on each, 6G RAM, dual core processor, and a > NameNode 500G storage, 6G RAM, dual core processor. > > I am planning to add the individual files initially, and after a while (lets > say 2 days after insertion) will make a SequenceFile out of each directory > (I am currently looking into SequenceFile) and delete the previous files of > that directory from HDFS. That way in future, I can access any file given > its directory without much effort. > Now that SequenceFile is in picture, I can make default block size to 64MB > or even 128MB. For replication, I am just replicating a file at 1 extra > location (i.e. replication factor = 2, since a replication factor 3 will > leave me with only 33% of the usable storage). Regarding reading back from > HDFS, if I can read at ~50MBps (for recent files), that would be enough. > > Let me know if you see any more pitfalls in this setup, or have more > suggestions. I really appreciate it. Once I test this setup, I will put the > results back to the list. > > Thanks, > Amit > > > On Mon, Feb 9, 2009 at 12:39 PM, Brian Bockelman <[EMAIL PROTECTED]>wrote: > >> Hey Amit, >> >> Your current thoughts on keeping block size larger and removing the very >> small files are along the right line. Why not chose the default size of >> 64MB or larger? You don't seem too concerned about the number of replicas. >> >> However, you're still fighting against the tide. You've got enough files >> that you'll be pushing against block report and namenode limitations, >> especially with 20 - 50 million files. We find that about 500k blocks per >> node is a good stopping point right now. >> >> You really, really need to figure out how to organize your files in such a >> way that the average file size is above 64MB. Is there a "primary key" for >> each file? If so, maybe consider HBASE? If you just are going to be >> sequentially scanning through all your files, consider archiving them all to >> a single sequence file. >> >> Your individual data nodes are quite large ... I hope you're not expecting >> to measure throughput in 10's of Gbps?
-
Re: using HDFS for a distributed storage system
Jeff Hammerbacher 2009-02-10, 02:35
Yo, I don't want to sound all spammy, but Tom White wrote a pretty nice blog post about small files in HDFS recently that you might find helpful. The post covers some potential solutions, including Hadoop Archives: http://www.cloudera.com/blog/2009/02/02/the-small-files-problem. Later, Jeff On Mon, Feb 9, 2009 at 6:14 PM, lohit <[EMAIL PROTECTED]> wrote: > > I am planning to add the individual files initially, and after a while > (lets > > say 2 days after insertion) will make a SequenceFile out of each > directory > > (I am currently looking into SequenceFile) and delete the previous files > of > > that directory from HDFS. That way in future, I can access any file given > > its directory without much effort. > > Have you considered Hadoop archive? > http://hadoop.apache.org/core/docs/current/hadoop_archives.html> Depending on your access pattern, you could store files in archive step in > the first place. > > > > ----- Original Message ---- > From: Brian Bockelman <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Monday, February 9, 2009 4:00:42 PM > Subject: Re: using HDFS for a distributed storage system > > Hey Amit, > > That plan sounds much better. I think you will find the system much more > scalable. > > From our experience, it takes a while to get the right amount of monitoring > and infrastructure in place to have a very dependable system with 2 > replicas. I would recommend using 3 replicas until you feel you've mastered > the setup. > > Brian > > On Feb 9, 2009, at 4:27 PM, Amit Chandel wrote: > > > Thanks Brian for your inputs. > > > > I am eventually targeting to store 200k directories each containing 75 > > files on avg, with average size of directory being 300MB (ranging from > 50MB > > to 650MB) in this storage system. > > > > It will mostly be an archival storage from where I should be able to > access > > any of the old files easily. But the recent directories would be accessed > > frequently for a day or 2 as they are being added. They are added in > batches > > of 500-1000 per week, and there can be rare bursts of adding 50k > directories > > once in 3 months. One such burst is about to come in a month, and I want > to > > test the current test setup against that burst. We have upgraded our test > > hardware a little bit from what I last mentioned. The test setup will > have 3 > > DataNodes with 15TB space on each, 6G RAM, dual core processor, and a > > NameNode 500G storage, 6G RAM, dual core processor. > > > > I am planning to add the individual files initially, and after a while > (lets > > say 2 days after insertion) will make a SequenceFile out of each > directory > > (I am currently looking into SequenceFile) and delete the previous files > of > > that directory from HDFS. That way in future, I can access any file given > > its directory without much effort. > > Now that SequenceFile is in picture, I can make default block size to > 64MB > > or even 128MB. For replication, I am just replicating a file at 1 extra > > location (i.e. replication factor = 2, since a replication factor 3 will > > leave me with only 33% of the usable storage). Regarding reading back > from > > HDFS, if I can read at ~50MBps (for recent files), that would be enough. > > > > Let me know if you see any more pitfalls in this setup, or have more > > suggestions. I really appreciate it. Once I test this setup, I will put > the > > results back to the list. > > > > Thanks, > > Amit > > > > > > On Mon, Feb 9, 2009 at 12:39 PM, Brian Bockelman <[EMAIL PROTECTED] > >wrote: > > > >> Hey Amit, > >> > >> Your current thoughts on keeping block size larger and removing the very > >> small files are along the right line. Why not chose the default size of > >> 64MB or larger? You don't seem too concerned about the number of > replicas. > >> > >> However, you're still fighting against the tide. You've got enough > files > >> that you'll be pushing against block report and namenode limitations, > >> especially with 20 - 50 million files. We find that about 500k blocks
-
Re: using HDFS for a distributed storage system
Mark Kerzner 2009-02-10, 02:48
It is a good and useful overview,thank you. It also mentions Stuart Sierra's post, where Stuart mentions that the process is slow. Does anybody know why? I have written code to write from the PC file system to HDFS, and I also noticed that it is very slow. Instead of 40M/sec, as promised by the Tom White's book, it seems to be 40 sec/Meg. Stuart's tars would work about 5 times faster. But still, why is it so slow? Is there a way to speed this up? Thanks! Mark On Mon, Feb 9, 2009 at 8:35 PM, Jeff Hammerbacher <[EMAIL PROTECTED]>wrote: > Yo, > > I don't want to sound all spammy, but Tom White wrote a pretty nice blog > post about small files in HDFS recently that you might find helpful. The > post covers some potential solutions, including Hadoop Archives: > http://www.cloudera.com/blog/2009/02/02/the-small-files-problem. > > Later, > Jeff > > On Mon, Feb 9, 2009 at 6:14 PM, lohit <[EMAIL PROTECTED]> wrote: > > > > I am planning to add the individual files initially, and after a while > > (lets > > > say 2 days after insertion) will make a SequenceFile out of each > > directory > > > (I am currently looking into SequenceFile) and delete the previous > files > > of > > > that directory from HDFS. That way in future, I can access any file > given > > > its directory without much effort. > > > > Have you considered Hadoop archive? > > http://hadoop.apache.org/core/docs/current/hadoop_archives.html> > Depending on your access pattern, you could store files in archive step > in > > the first place. > > > > > > > > ----- Original Message ---- > > From: Brian Bockelman <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Monday, February 9, 2009 4:00:42 PM > > Subject: Re: using HDFS for a distributed storage system > > > > Hey Amit, > > > > That plan sounds much better. I think you will find the system much more > > scalable. > > > > From our experience, it takes a while to get the right amount of > monitoring > > and infrastructure in place to have a very dependable system with 2 > > replicas. I would recommend using 3 replicas until you feel you've > mastered > > the setup. > > > > Brian > > > > On Feb 9, 2009, at 4:27 PM, Amit Chandel wrote: > > > > > Thanks Brian for your inputs. > > > > > > I am eventually targeting to store 200k directories each containing 75 > > > files on avg, with average size of directory being 300MB (ranging from > > 50MB > > > to 650MB) in this storage system. > > > > > > It will mostly be an archival storage from where I should be able to > > access > > > any of the old files easily. But the recent directories would be > accessed > > > frequently for a day or 2 as they are being added. They are added in > > batches > > > of 500-1000 per week, and there can be rare bursts of adding 50k > > directories > > > once in 3 months. One such burst is about to come in a month, and I > want > > to > > > test the current test setup against that burst. We have upgraded our > test > > > hardware a little bit from what I last mentioned. The test setup will > > have 3 > > > DataNodes with 15TB space on each, 6G RAM, dual core processor, and a > > > NameNode 500G storage, 6G RAM, dual core processor. > > > > > > I am planning to add the individual files initially, and after a while > > (lets > > > say 2 days after insertion) will make a SequenceFile out of each > > directory > > > (I am currently looking into SequenceFile) and delete the previous > files > > of > > > that directory from HDFS. That way in future, I can access any file > given > > > its directory without much effort. > > > Now that SequenceFile is in picture, I can make default block size to > > 64MB > > > or even 128MB. For replication, I am just replicating a file at 1 extra > > > location (i.e. replication factor = 2, since a replication factor 3 > will > > > leave me with only 33% of the usable storage). Regarding reading back > > from > > > HDFS, if I can read at ~50MBps (for recent files), that would be > enough. > > > > > > Let me know if you see any more pitfalls in this setup, or have more
|
|