Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Merging files


Copy link to this message
-
Re: Merging files
How big are your 50 files?  How long are the reducers taking?

- HC

On Jul 30, 2013, at 10:26 PM, Something Something <[EMAIL PROTECTED]> wrote:

> Hello,
>
> One of our pig scripts creates over 500 small part files.  To save on
> namespace, we need to cut down the # of files, so instead of saving 500
> small files we need to merge them into 50.  We tried the following:
>
> 1)  When we set parallel number to 50, the Pig script takes a long time -
> for obvious reasons.
> 2)  If we use Hadoop Streaming, it puts some garbage values into the key
> field.
> 3)  We wrote our own Map Reducer program that reads these 500 small part
> files & uses 50 reducers.  Basically, the Mappers simply write the line &
> reducers loop thru values & write them out.  We set
> job.setOutputKeyClass(NullWritable.class) so that the key is not written to
> the output file.  This is performing better than Pig.  Actually Mappers run
> very fast, but Reducers take some time to complete, but this approach seems
> to be working well.
>
> Is there a better way to do this?  What strategy can you think of to
> increase speed of reducers.
>
> Any help in this regard will be greatly appreciated.  Thanks.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB