Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Large Bag (100GB of Data) in Reduce Step

Jerry Lam 2013-07-22, 13:31
Copy link to this message
Re: Large Bag (100GB of Data) in Reduce Step
There's only one thing that comes to mind for this particular toy example.

>From the "Programming Pig" book,
"pig.cached.bag.memusage" property is the "Percentage of the heap that Pig
will allocate for all of the bags in a map or reduce task. Once the bags
fill up this amount, the data is spilled to disk. Setting this to a higher
value will reduce spills to disk during execution but increase the
likelihood of a task running out of heap."
The default value of this property is 0.1

So, you can try setting this to a higher value to see if it can improve

Other than the above setting, I can only quote the basic patterns for
optimizing performance (also from Programming Pig):
Filter early and often
Project early and often
Set up your joins properly

On Mon, Jul 22, 2013 at 9:31 AM, Jerry Lam <[EMAIL PROTECTED]> wrote:

> Hi Pig users,
> I have a question regarding how to handle a large bag of data in reduce
> step.
> It happens that after I do the following (see below), each group has about
> 100GB of data to process. The bag is spilled continuously and the job is
> very slow. What is your recommendation of speeding the processing when you
> find yourself a large bag of data (over 100GB) to process?
> A = LOAD '/tmp/data';
> B = GROUP A by $0;
> C = FOREACH B generate FLATTEN($1); -- this takes very very long because of
> a large bag
> Best Regards,
> Jerry
Jerry Lam 2013-07-22, 17:15
Jerry Lam 2013-07-22, 17:16