The issue with using lots of small files is that your processing
overhead increases (repeated, avoidable file open-read(little)-close
calls). HDFS is also used by those who wish to also heavily process
the data they've stored and with a huge number of files such a process
is not gonna be quick to cut through quick for them. RAM is just
another factor, due to the design of NameNode. But ideally you do not
want to end up with having to go through millions of files when you
wish to process them all, as they can be stored more efficiently for
those purposes via several tools/formats/etc.
You can probably utilize HBase for such storage. It will allow you to
store large amounts of data in compact files while at the same time
allowing random access to them, if thats needed by your use-case as
well. Check out this one previous discussion on this topic at:
http://search-hadoop.com/m/j95CxojSOC which was related to storing
image files. Should apply to your question as well. Head over to
[EMAIL PROTECTED] if you have further questions on Apache HBase.
On Tue, May 22, 2012 at 3:09 PM, Brendan cheng <[EMAIL PROTECTED]> wrote:
> I read HDFS architecture doc and it said HDFS is tuned for at storing large file, typically gigabyte to terabytes.What is the downsize of storing million of small files like <10MB? or what setting of HDFS is suitable for storing small files?
> Actually, I plan to find a distribute filed system for storing mult million of files.