Something Something 2013-07-31, 05:26
Ben Juhn 2013-07-31, 05:34
Something Something 2013-07-31, 06:40
John Meagher 2013-07-31, 13:28
Something Something 2013-07-31, 16:21
j.barrett Strausser 2013-07-31, 16:42
John Meagher 2013-07-31, 17:28
Something Something 2013-07-31, 20:39
j.barrett Strausser 2013-07-31, 21:01
How big are your 50 files? How long are the reducers taking?
On Jul 30, 2013, at 10:26 PM, Something Something <[EMAIL PROTECTED]> wrote:
> One of our pig scripts creates over 500 small part files. To save on
> namespace, we need to cut down the # of files, so instead of saving 500
> small files we need to merge them into 50. We tried the following:
> 1) When we set parallel number to 50, the Pig script takes a long time -
> for obvious reasons.
> 2) If we use Hadoop Streaming, it puts some garbage values into the key
> 3) We wrote our own Map Reducer program that reads these 500 small part
> files & uses 50 reducers. Basically, the Mappers simply write the line &
> reducers loop thru values & write them out. We set
> job.setOutputKeyClass(NullWritable.class) so that the key is not written to
> the output file. This is performing better than Pig. Actually Mappers run
> very fast, but Reducers take some time to complete, but this approach seems
> to be working well.
> Is there a better way to do this? What strategy can you think of to
> increase speed of reducers.
> Any help in this regard will be greatly appreciated. Thanks.