Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop, mail # user - using HDFS for a distributed storage system


+
Amit Chandel 2009-02-09, 04:06
+
Brian Bockelman 2009-02-09, 17:39
+
Amit Chandel 2009-02-09, 22:27
+
Brian Bockelman 2009-02-10, 00:00
+
lohit 2009-02-10, 02:14
Copy link to this message
-
Re: using HDFS for a distributed storage system
Jeff Hammerbacher 2009-02-10, 02:35
Yo,

I don't want to sound all spammy, but Tom White wrote a pretty nice blog
post about small files in HDFS recently that you might find helpful. The
post covers some potential solutions, including Hadoop Archives:
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.

Later,
Jeff

On Mon, Feb 9, 2009 at 6:14 PM, lohit <[EMAIL PROTECTED]> wrote:

> > I am planning to add the individual files initially, and after a while
> (lets
> > say 2 days after insertion) will make a SequenceFile out of each
> directory
> > (I am currently looking into SequenceFile) and delete the previous files
> of
> > that directory from HDFS. That way in future, I can access any file given
> > its directory without much effort.
>
> Have you considered Hadoop archive?
> http://hadoop.apache.org/core/docs/current/hadoop_archives.html
> Depending on your access pattern, you could store files in archive step in
> the first place.
>
>
>
> ----- Original Message ----
> From: Brian Bockelman <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Monday, February 9, 2009 4:00:42 PM
> Subject: Re: using HDFS for a distributed storage system
>
> Hey Amit,
>
> That plan sounds much better.  I think you will find the system much more
> scalable.
>
> From our experience, it takes a while to get the right amount of monitoring
> and infrastructure in place to have a very dependable system with 2
> replicas.  I would recommend using 3 replicas until you feel you've mastered
> the setup.
>
> Brian
>
> On Feb 9, 2009, at 4:27 PM, Amit Chandel wrote:
>
> > Thanks Brian for your inputs.
> >
> > I am eventually targeting to store 200k directories each containing  75
> > files on avg, with average size of directory being 300MB (ranging from
> 50MB
> > to 650MB) in this storage system.
> >
> > It will mostly be an archival storage from where I should be able to
> access
> > any of the old files easily. But the recent directories would be accessed
> > frequently for a day or 2 as they are being added. They are added in
> batches
> > of 500-1000 per week, and there can be rare bursts of adding 50k
> directories
> > once in 3 months. One such burst is about to come in a month, and I want
> to
> > test the current test setup against that burst. We have upgraded our test
> > hardware a little bit from what I last mentioned. The test setup will
> have 3
> > DataNodes with 15TB space on each, 6G RAM, dual core processor, and a
> > NameNode 500G storage, 6G RAM, dual core processor.
> >
> > I am planning to add the individual files initially, and after a while
> (lets
> > say 2 days after insertion) will make a SequenceFile out of each
> directory
> > (I am currently looking into SequenceFile) and delete the previous files
> of
> > that directory from HDFS. That way in future, I can access any file given
> > its directory without much effort.
> > Now that SequenceFile is in picture, I can make default block size to
> 64MB
> > or even 128MB. For replication, I am just replicating a file at 1 extra
> > location (i.e. replication factor = 2, since a replication factor 3 will
> > leave me with only 33% of the usable storage). Regarding reading back
> from
> > HDFS, if I can read at ~50MBps (for recent files), that would be enough.
> >
> > Let me know if you see any more pitfalls in this setup, or have more
> > suggestions. I really appreciate it. Once I test this setup, I will put
> the
> > results back to the list.
> >
> > Thanks,
> > Amit
> >
> >
> > On Mon, Feb 9, 2009 at 12:39 PM, Brian Bockelman <[EMAIL PROTECTED]
> >wrote:
> >
> >> Hey Amit,
> >>
> >> Your current thoughts on keeping block size larger and removing the very
> >> small files are along the right line.  Why not chose the default size of
> >> 64MB or larger?  You don't seem too concerned about the number of
> replicas.
> >>
> >> However, you're still fighting against the tide.  You've got enough
> files
> >> that you'll be pushing against block report and namenode limitations,
> >> especially with 20 - 50 million files.  We find that about 500k blocks
+
Mark Kerzner 2009-02-10, 02:48