Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Handling of small files in hadoop


Copy link to this message
-
Re: Handling of small files in hadoop
Hi Naveen,

> I use hadoop-0.21.0 distribution. I have a large number of small files (KB).

Word of warning, 0.21 is not a stable release. The recommended version
is in the 0.20.x range.

> Is there any efficient way of handling it in hadoop?
>
> I have heard that solution for that problem is using:
>            1. HAR (hadoop archives)
>            2. cat on files
>
> I would like to know if there are any other solutions for processing large
> number of small files.

You could also stick each file as a record in a sequence file. The
name of the file becomes the key, the bytes of the file the value.
That gives you compression and splitability, but not random access.
You already noted HAR, which does give you random access.

-Joey

--
Joseph Echeverria
Cloudera, Inc.
443.305.9434
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB