Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> A new way to merge up those small files!


Copy link to this message
-
Re: A new way to merge up those small files!
Edward:
Thanks for the tool.

I think the last parameter can be omitted if you follow what hadoop fs -text
does.
It looks at a file's magic number so that it can attempt to *detect* the
type of the file.

Cheers

On Fri, Sep 24, 2010 at 11:41 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> Many times a hadoop job produces a file per reducer and the job has
> many reducers. Or a map only job one output file per input file and
> you have many input files. Or you just have many small files from some
> external process. Hadoop has sub optimal handling of small files.
> There are some ways to handle this inside a map reduce program,
> IdentityMapper + IdentityReducer for example, or multi outputs However
> we wanted a tool that could be used by people using hive, or pig, or
> map reduce. We wanted to allow people to combine a directory with
> multiple files or a hierarchy of directories like the root of a hive
> partitioned table. We also wanted to be able to combine text or
> sequence files.
>
> What we came up with is the filecrusher.
>
> Usage:
> /usr/bin/hadoop jar filecrush.jar crush.Crush /directory/to/compact
> /user/edward/backup 50 SEQUENCE
> (50 is the number of mappers here)
>
> Code is Apache V2 and you can get it here:
> http://www.jointhegrid.com/hadoop_filecrush/index.jsp
>
> Enjoy,
> Edward
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB