Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Merging files


+
Something Something 2013-07-31, 05:26
+
Ben Juhn 2013-07-31, 05:34
+
Something Something 2013-07-31, 06:40
+
John Meagher 2013-07-31, 13:28
+
Something Something 2013-07-31, 16:21
+
j.barrett Strausser 2013-07-31, 16:42
+
John Meagher 2013-07-31, 17:28
+
Something Something 2013-07-31, 20:39
+
j.barrett Strausser 2013-07-31, 21:01
Copy link to this message
-
Re: Merging files
How big are your 50 files?  How long are the reducers taking?

- HC

On Jul 30, 2013, at 10:26 PM, Something Something <[EMAIL PROTECTED]> wrote:

> Hello,
>
> One of our pig scripts creates over 500 small part files.  To save on
> namespace, we need to cut down the # of files, so instead of saving 500
> small files we need to merge them into 50.  We tried the following:
>
> 1)  When we set parallel number to 50, the Pig script takes a long time -
> for obvious reasons.
> 2)  If we use Hadoop Streaming, it puts some garbage values into the key
> field.
> 3)  We wrote our own Map Reducer program that reads these 500 small part
> files & uses 50 reducers.  Basically, the Mappers simply write the line &
> reducers loop thru values & write them out.  We set
> job.setOutputKeyClass(NullWritable.class) so that the key is not written to
> the output file.  This is performing better than Pig.  Actually Mappers run
> very fast, but Reducers take some time to complete, but this approach seems
> to be working well.
>
> Is there a better way to do this?  What strategy can you think of to
> increase speed of reducers.
>
> Any help in this regard will be greatly appreciated.  Thanks.