Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> A new way to merge up those small files!


Copy link to this message
-
Re: A new way to merge up those small files!
Ted,

Good point. Patches are welcome :) I will add it onto my to-do list.

Edward

On Sat, Sep 25, 2010 at 12:05 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> Edward:
> Thanks for the tool.
>
> I think the last parameter can be omitted if you follow what hadoop fs -text
> does.
> It looks at a file's magic number so that it can attempt to *detect* the
> type of the file.
>
> Cheers
>
> On Fri, Sep 24, 2010 at 11:41 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:
>
>> Many times a hadoop job produces a file per reducer and the job has
>> many reducers. Or a map only job one output file per input file and
>> you have many input files. Or you just have many small files from some
>> external process. Hadoop has sub optimal handling of small files.
>> There are some ways to handle this inside a map reduce program,
>> IdentityMapper + IdentityReducer for example, or multi outputs However
>> we wanted a tool that could be used by people using hive, or pig, or
>> map reduce. We wanted to allow people to combine a directory with
>> multiple files or a hierarchy of directories like the root of a hive
>> partitioned table. We also wanted to be able to combine text or
>> sequence files.
>>
>> What we came up with is the filecrusher.
>>
>> Usage:
>> /usr/bin/hadoop jar filecrush.jar crush.Crush /directory/to/compact
>> /user/edward/backup 50 SEQUENCE
>> (50 is the number of mappers here)
>>
>> Code is Apache V2 and you can get it here:
>> http://www.jointhegrid.com/hadoop_filecrush/index.jsp
>>
>> Enjoy,
>> Edward
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB