Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Large Bag (100GB of Data) in Reduce Step


Copy link to this message
-
Re: Large Bag (100GB of Data) in Reduce Step
Hi Pradeep,

Although this query looks too simplistic but it is very close to the real
one. :)
The actual one looks like:

A = LOAD '/tmp/data';
C = FOREACH (GROUP A by $0) {
          generate FLATTEN(A.$1); -- this takes very very long because of a
large bag
}

I did tried increase pig.cached.bag.memusage to 0.5, but it is still very
slow.
I followed all recommendations but it didn't help much.

The above query could run for 8 hours which is bottlenecked by 1 reducer
which has 100GB of data. The databag of a group in that particular reducer
spill continuously.

I change the above query to something like below:

A = LOAD '/tmp/data';
D = FOREACH A generate FLATTEN($1); -- notice that I move the FLATTEN
operations to an earlier stage (from reduce side to map side flattening).
B = GROUP D by $0;
STORE B into 'tmp/out';

This query finishes in 2 hours. Contrary to the best practice, it is best
not to flatten the data in the reduce step if the data size is too big
because of the spill to disk behavior.

I wonder if this is a performance issue in spill to disk algorithm?

Best Regards,

Jerry
On Mon, Jul 22, 2013 at 10:12 AM, Pradeep Gollakota <[EMAIL PROTECTED]>wrote:

> There's only one thing that comes to mind for this particular toy example.
>
> From the "Programming Pig" book,
> "pig.cached.bag.memusage" property is the "Percentage of the heap that Pig
> will allocate for all of the bags in a map or reduce task. Once the bags
> fill up this amount, the data is spilled to disk. Setting this to a higher
> value will reduce spills to disk during execution but increase the
> likelihood of a task running out of heap."
> The default value of this property is 0.1
>
> So, you can try setting this to a higher value to see if it can improve
> performance.
>
> Other than the above setting, I can only quote the basic patterns for
> optimizing performance (also from Programming Pig):
> Filter early and often
> Project early and often
> Set up your joins properly
> etc.
>
>
>
> On Mon, Jul 22, 2013 at 9:31 AM, Jerry Lam <[EMAIL PROTECTED]> wrote:
>
> > Hi Pig users,
> >
> > I have a question regarding how to handle a large bag of data in reduce
> > step.
> > It happens that after I do the following (see below), each group has
> about
> > 100GB of data to process. The bag is spilled continuously and the job is
> > very slow. What is your recommendation of speeding the processing when
> you
> > find yourself a large bag of data (over 100GB) to process?
> >
> > A = LOAD '/tmp/data';
> > B = GROUP A by $0;
> > C = FOREACH B generate FLATTEN($1); -- this takes very very long because
> of
> > a large bag
> >
> > Best Regards,
> >
> > Jerry
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB