Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Large Bag (100GB of Data) in Reduce Step

Copy link to this message
Large Bag (100GB of Data) in Reduce Step
Hi Pig users,

I have a question regarding how to handle a large bag of data in reduce
It happens that after I do the following (see below), each group has about
100GB of data to process. The bag is spilled continuously and the job is
very slow. What is your recommendation of speeding the processing when you
find yourself a large bag of data (over 100GB) to process?

A = LOAD '/tmp/data';
B = GROUP A by $0;
C = FOREACH B generate FLATTEN($1); -- this takes very very long because of
a large bag

Best Regards,