Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> using HDFS for a distributed storage system


Copy link to this message
-
Re: using HDFS for a distributed storage system
> I am planning to add the individual files initially, and after a while (lets
> say 2 days after insertion) will make a SequenceFile out of each directory
> (I am currently looking into SequenceFile) and delete the previous files of
> that directory from HDFS. That way in future, I can access any file given
> its directory without much effort.

Have you considered Hadoop archive?
http://hadoop.apache.org/core/docs/current/hadoop_archives.html
Depending on your access pattern, you could store files in archive step in the first place.

----- Original Message ----
From: Brian Bockelman <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Monday, February 9, 2009 4:00:42 PM
Subject: Re: using HDFS for a distributed storage system

Hey Amit,

That plan sounds much better.  I think you will find the system much more scalable.

>From our experience, it takes a while to get the right amount of monitoring and infrastructure in place to have a very dependable system with 2 replicas.  I would recommend using 3 replicas until you feel you've mastered the setup.

Brian

On Feb 9, 2009, at 4:27 PM, Amit Chandel wrote:

> Thanks Brian for your inputs.
>
> I am eventually targeting to store 200k directories each containing  75
> files on avg, with average size of directory being 300MB (ranging from 50MB
> to 650MB) in this storage system.
>
> It will mostly be an archival storage from where I should be able to access
> any of the old files easily. But the recent directories would be accessed
> frequently for a day or 2 as they are being added. They are added in batches
> of 500-1000 per week, and there can be rare bursts of adding 50k directories
> once in 3 months. One such burst is about to come in a month, and I want to
> test the current test setup against that burst. We have upgraded our test
> hardware a little bit from what I last mentioned. The test setup will have 3
> DataNodes with 15TB space on each, 6G RAM, dual core processor, and a
> NameNode 500G storage, 6G RAM, dual core processor.
>
> I am planning to add the individual files initially, and after a while (lets
> say 2 days after insertion) will make a SequenceFile out of each directory
> (I am currently looking into SequenceFile) and delete the previous files of
> that directory from HDFS. That way in future, I can access any file given
> its directory without much effort.
> Now that SequenceFile is in picture, I can make default block size to 64MB
> or even 128MB. For replication, I am just replicating a file at 1 extra
> location (i.e. replication factor = 2, since a replication factor 3 will
> leave me with only 33% of the usable storage). Regarding reading back from
> HDFS, if I can read at ~50MBps (for recent files), that would be enough.
>
> Let me know if you see any more pitfalls in this setup, or have more
> suggestions. I really appreciate it. Once I test this setup, I will put the
> results back to the list.
>
> Thanks,
> Amit
>
>
> On Mon, Feb 9, 2009 at 12:39 PM, Brian Bockelman <[EMAIL PROTECTED]>wrote:
>
>> Hey Amit,
>>
>> Your current thoughts on keeping block size larger and removing the very
>> small files are along the right line.  Why not chose the default size of
>> 64MB or larger?  You don't seem too concerned about the number of replicas.
>>
>> However, you're still fighting against the tide.  You've got enough files
>> that you'll be pushing against block report and namenode limitations,
>> especially with 20 - 50 million files.  We find that about 500k blocks per
>> node is a good stopping point right now.
>>
>> You really, really need to figure out how to organize your files in such a
>> way that the average file size is above 64MB.  Is there a "primary key" for
>> each file?  If so, maybe consider HBASE?  If you just are going to be
>> sequentially scanning through all your files, consider archiving them all to
>> a single sequence file.
>>
>> Your individual data nodes are quite large ... I hope you're not expecting
>> to measure throughput in 10's of Gbps?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB