-Re: Handling of small files in hadoop
Joey Echeverria 2011-09-14, 12:02
> I use hadoop-0.21.0 distribution. I have a large number of small files (KB).
Word of warning, 0.21 is not a stable release. The recommended version
is in the 0.20.x range.
> Is there any efficient way of handling it in hadoop?
> I have heard that solution for that problem is using:
> 1. HAR (hadoop archives)
> 2. cat on files
> I would like to know if there are any other solutions for processing large
> number of small files.
You could also stick each file as a record in a sequence file. The
name of the file becomes the key, the bytes of the file the value.
That gives you compression and splitability, but not random access.
You already noted HAR, which does give you random access.