Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Large Bag (100GB of Data) in Reduce Step


Copy link to this message
-
Re: Large Bag (100GB of Data) in Reduce Step
Pradeep Gollakota 2013-07-22, 14:12
There's only one thing that comes to mind for this particular toy example.

>From the "Programming Pig" book,
"pig.cached.bag.memusage" property is the "Percentage of the heap that Pig
will allocate for all of the bags in a map or reduce task. Once the bags
fill up this amount, the data is spilled to disk. Setting this to a higher
value will reduce spills to disk during execution but increase the
likelihood of a task running out of heap."
The default value of this property is 0.1

So, you can try setting this to a higher value to see if it can improve
performance.

Other than the above setting, I can only quote the basic patterns for
optimizing performance (also from Programming Pig):
Filter early and often
Project early and often
Set up your joins properly
etc.

On Mon, Jul 22, 2013 at 9:31 AM, Jerry Lam <[EMAIL PROTECTED]> wrote:

> Hi Pig users,
>
> I have a question regarding how to handle a large bag of data in reduce
> step.
> It happens that after I do the following (see below), each group has about
> 100GB of data to process. The bag is spilled continuously and the job is
> very slow. What is your recommendation of speeding the processing when you
> find yourself a large bag of data (over 100GB) to process?
>
> A = LOAD '/tmp/data';
> B = GROUP A by $0;
> C = FOREACH B generate FLATTEN($1); -- this takes very very long because of
> a large bag
>
> Best Regards,
>
> Jerry
>