Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Large Bag (100GB of Data) in Reduce Step


Copy link to this message
-
Re: Large Bag (100GB of Data) in Reduce Step
There's only one thing that comes to mind for this particular toy example.

>From the "Programming Pig" book,
"pig.cached.bag.memusage" property is the "Percentage of the heap that Pig
will allocate for all of the bags in a map or reduce task. Once the bags
fill up this amount, the data is spilled to disk. Setting this to a higher
value will reduce spills to disk during execution but increase the
likelihood of a task running out of heap."
The default value of this property is 0.1

So, you can try setting this to a higher value to see if it can improve
performance.

Other than the above setting, I can only quote the basic patterns for
optimizing performance (also from Programming Pig):
Filter early and often
Project early and often
Set up your joins properly
etc.

On Mon, Jul 22, 2013 at 9:31 AM, Jerry Lam <[EMAIL PROTECTED]> wrote:

> Hi Pig users,
>
> I have a question regarding how to handle a large bag of data in reduce
> step.
> It happens that after I do the following (see below), each group has about
> 100GB of data to process. The bag is spilled continuously and the job is
> very slow. What is your recommendation of speeding the processing when you
> find yourself a large bag of data (over 100GB) to process?
>
> A = LOAD '/tmp/data';
> B = GROUP A by $0;
> C = FOREACH B generate FLATTEN($1); -- this takes very very long because of
> a large bag
>
> Best Regards,
>
> Jerry
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB