Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - A new way to merge up those small files!

Copy link to this message
A new way to merge up those small files!
Edward Capriolo 2010-09-25, 06:41
Many times a hadoop job produces a file per reducer and the job has
many reducers. Or a map only job one output file per input file and
you have many input files. Or you just have many small files from some
external process. Hadoop has sub optimal handling of small files.
There are some ways to handle this inside a map reduce program,
IdentityMapper + IdentityReducer for example, or multi outputs However
we wanted a tool that could be used by people using hive, or pig, or
map reduce. We wanted to allow people to combine a directory with
multiple files or a hierarchy of directories like the root of a hive
partitioned table. We also wanted to be able to combine text or
sequence files.

What we came up with is the filecrusher.

/usr/bin/hadoop jar filecrush.jar crush.Crush /directory/to/compact
/user/edward/backup 50 SEQUENCE
(50 is the number of mappers here)

Code is Apache V2 and you can get it here: