sonia gehlot 2012-07-02, 23:59
You can set different parallel levels at different parts of your script by attaching parallel to the different operations. For example:
Y = join W by a, X by b parallel 100;
Z = order Y by a parallel 1;
store Z into 'onefile';
If your output is big I would suggest trying out ordering in parallel as well and then using HDFS's cat command in a separate pass to see if it is faster. It will write twice but it won't flood one reducer with all of the data.
On Jul 2, 2012, at 4:59 PM, sonia gehlot wrote:
> Hi Guys,
> I have use case, where I need to generate data feed using Pig script. Data
> feed in total is of about 12 GB.
> I want Pig script to generate 1 file and data in that data should be sorted
> as well. I know I can run it with one reducer as dataset is big with lot of
> Joins it takes forever to finish.
> What are the other options to get one sorted file with better performance.
> Thanks in advance,
sonia gehlot 2012-07-03, 19:18
Duckworth, Will 2012-07-03, 01:57