Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Handling of small files in hadoop


Copy link to this message
-
Re: Handling of small files in hadoop
Hi Naveen,

> I use hadoop-0.21.0 distribution. I have a large number of small files (KB).

Word of warning, 0.21 is not a stable release. The recommended version
is in the 0.20.x range.

> Is there any efficient way of handling it in hadoop?
>
> I have heard that solution for that problem is using:
>            1. HAR (hadoop archives)
>            2. cat on files
>
> I would like to know if there are any other solutions for processing large
> number of small files.

You could also stick each file as a record in a sequence file. The
name of the file becomes the key, the bytes of the file the value.
That gives you compression and splitability, but not random access.
You already noted HAR, which does give you random access.

-Joey

--
Joseph Echeverria
Cloudera, Inc.
443.305.9434