Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> using HDFS for a distributed storage system


Copy link to this message
-
Re: using HDFS for a distributed storage system
Yo,

I don't want to sound all spammy, but Tom White wrote a pretty nice blog
post about small files in HDFS recently that you might find helpful. The
post covers some potential solutions, including Hadoop Archives:
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.

Later,
Jeff

On Mon, Feb 9, 2009 at 6:14 PM, lohit <[EMAIL PROTECTED]> wrote:

> > I am planning to add the individual files initially, and after a while
> (lets
> > say 2 days after insertion) will make a SequenceFile out of each
> directory
> > (I am currently looking into SequenceFile) and delete the previous files
> of
> > that directory from HDFS. That way in future, I can access any file given
> > its directory without much effort.
>
> Have you considered Hadoop archive?
> http://hadoop.apache.org/core/docs/current/hadoop_archives.html
> Depending on your access pattern, you could store files in archive step in
> the first place.
>
>
>
> ----- Original Message ----
> From: Brian Bockelman <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Monday, February 9, 2009 4:00:42 PM
> Subject: Re: using HDFS for a distributed storage system
>
> Hey Amit,
>
> That plan sounds much better.  I think you will find the system much more
> scalable.
>
> From our experience, it takes a while to get the right amount of monitoring
> and infrastructure in place to have a very dependable system with 2
> replicas.  I would recommend using 3 replicas until you feel you've mastered
> the setup.
>
> Brian
>
> On Feb 9, 2009, at 4:27 PM, Amit Chandel wrote:
>
> > Thanks Brian for your inputs.
> >
> > I am eventually targeting to store 200k directories each containing  75
> > files on avg, with average size of directory being 300MB (ranging from
> 50MB
> > to 650MB) in this storage system.
> >
> > It will mostly be an archival storage from where I should be able to
> access
> > any of the old files easily. But the recent directories would be accessed
> > frequently for a day or 2 as they are being added. They are added in
> batches
> > of 500-1000 per week, and there can be rare bursts of adding 50k
> directories
> > once in 3 months. One such burst is about to come in a month, and I want
> to
> > test the current test setup against that burst. We have upgraded our test
> > hardware a little bit from what I last mentioned. The test setup will
> have 3
> > DataNodes with 15TB space on each, 6G RAM, dual core processor, and a
> > NameNode 500G storage, 6G RAM, dual core processor.
> >
> > I am planning to add the individual files initially, and after a while
> (lets
> > say 2 days after insertion) will make a SequenceFile out of each
> directory
> > (I am currently looking into SequenceFile) and delete the previous files
> of
> > that directory from HDFS. That way in future, I can access any file given
> > its directory without much effort.
> > Now that SequenceFile is in picture, I can make default block size to
> 64MB
> > or even 128MB. For replication, I am just replicating a file at 1 extra
> > location (i.e. replication factor = 2, since a replication factor 3 will
> > leave me with only 33% of the usable storage). Regarding reading back
> from
> > HDFS, if I can read at ~50MBps (for recent files), that would be enough.
> >
> > Let me know if you see any more pitfalls in this setup, or have more
> > suggestions. I really appreciate it. Once I test this setup, I will put
> the
> > results back to the list.
> >
> > Thanks,
> > Amit
> >
> >
> > On Mon, Feb 9, 2009 at 12:39 PM, Brian Bockelman <[EMAIL PROTECTED]
> >wrote:
> >
> >> Hey Amit,
> >>
> >> Your current thoughts on keeping block size larger and removing the very
> >> small files are along the right line.  Why not chose the default size of
> >> 64MB or larger?  You don't seem too concerned about the number of
> replicas.
> >>
> >> However, you're still fighting against the tide.  You've got enough
> files
> >> that you'll be pushing against block report and namenode limitations,
> >> especially with 20 - 50 million files.  We find that about 500k blocks
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB