Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> how to improve the Hadoop's capability of  dealing with small files


Copy link to this message
-
Re: how to improve the Hadoop's capability of dealing with small files
Hey,

You can read more about why small files are difficult for HDFS at
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.

Regards,
Jeff

2009/5/7 Piotr Praczyk <[EMAIL PROTECTED]>

> If You want to use many small files, they are probably having the same
> purpose and struc?
> Why not use HBase instead of a raw HDFS ? Many small files would be packed
> together and the problem would disappear.
>
> cheers
> Piotr
>
> 2009/5/7 Jonathan Cao <[EMAIL PROTECTED]>
>
> > There are at least two design choices in Hadoop that have implications
> for
> > your scenario.
> > 1. All the HDFS meta data is stored in name node memory -- the memory
> size
> > is one limitation on how many "small" files you can have
> >
> > 2. The efficiency of map/reduce paradigm dictates that each
> mapper/reducer
> > job has enough work to offset the overhead of spawning the job.  It
> relies
> > on each task reading contiguous chuck of data (typically 64MB), your
> small
> > file situation will change those efficient sequential reads to larger
> > number
> > of inefficient random reads.
> >
> > Of course, small is a relative term?
> >
> > Jonathan
> >
> > 2009/5/6 陈桂芬 <[EMAIL PROTECTED]>
> >
> > > Hi:
> > >
> > > In my application, there are many small files. But the hadoop is
> designed
> > > to deal with many large files.
> > >
> > > I want to know why hadoop doesn’t support small files very well and
> where
> > > is the bottleneck. And what can I do to improve the Hadoop’s capability
> > of
> > > dealing with small files.
> > >
> > > Thanks.
> > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB