|
|
-
Re: how to improve the Hadoop's capability of dealing with small filesJeff Hammerbacher 2009-05-07, 07:41
Hey,
You can read more about why small files are difficult for HDFS at http://www.cloudera.com/blog/2009/02/02/the-small-files-problem. Regards, Jeff 2009/5/7 Piotr Praczyk <[EMAIL PROTECTED]> > If You want to use many small files, they are probably having the same > purpose and struc? > Why not use HBase instead of a raw HDFS ? Many small files would be packed > together and the problem would disappear. > > cheers > Piotr > > 2009/5/7 Jonathan Cao <[EMAIL PROTECTED]> > > > There are at least two design choices in Hadoop that have implications > for > > your scenario. > > 1. All the HDFS meta data is stored in name node memory -- the memory > size > > is one limitation on how many "small" files you can have > > > > 2. The efficiency of map/reduce paradigm dictates that each > mapper/reducer > > job has enough work to offset the overhead of spawning the job. It > relies > > on each task reading contiguous chuck of data (typically 64MB), your > small > > file situation will change those efficient sequential reads to larger > > number > > of inefficient random reads. > > > > Of course, small is a relative term? > > > > Jonathan > > > > 2009/5/6 陈桂芬 <[EMAIL PROTECTED]> > > > > > Hi: > > > > > > In my application, there are many small files. But the hadoop is > designed > > > to deal with many large files. > > > > > > I want to know why hadoop doesn’t support small files very well and > where > > > is the bottleneck. And what can I do to improve the Hadoop’s capability > > of > > > dealing with small files. > > > > > > Thanks. > > > > > > > > > |