|
|
-
One file with sorted results.
sonia gehlot 2012-07-02, 23:59
Hi Guys,
I have use case, where I need to generate data feed using Pig script. Data feed in total is of about 12 GB.
I want Pig script to generate 1 file and data in that data should be sorted as well. I know I can run it with one reducer as dataset is big with lot of Joins it takes forever to finish.
What are the other options to get one sorted file with better performance.
Thanks in advance,
Sonia
+
sonia gehlot 2012-07-02, 23:59
-
Re: One file with sorted results.
Alan Gates 2012-07-03, 14:56
You can set different parallel levels at different parts of your script by attaching parallel to the different operations. For example:
Y = join W by a, X by b parallel 100; Z = order Y by a parallel 1; store Z into 'onefile';
If your output is big I would suggest trying out ordering in parallel as well and then using HDFS's cat command in a separate pass to see if it is faster. It will write twice but it won't flood one reducer with all of the data.
Alan.
On Jul 2, 2012, at 4:59 PM, sonia gehlot wrote:
> Hi Guys, > > I have use case, where I need to generate data feed using Pig script. Data > feed in total is of about 12 GB. > > I want Pig script to generate 1 file and data in that data should be sorted > as well. I know I can run it with one reducer as dataset is big with lot of > Joins it takes forever to finish. > > What are the other options to get one sorted file with better performance. > > Thanks in advance, > > Sonia
+
Alan Gates 2012-07-03, 14:56
-
Re: One file with sorted results.
sonia gehlot 2012-07-03, 19:18
Thanks Alan,
I will try this.
-Sonia
On Tue, Jul 3, 2012 at 7:56 AM, Alan Gates <[EMAIL PROTECTED]> wrote:
> You can set different parallel levels at different parts of your script by > attaching parallel to the different operations. For example: > > Y = join W by a, X by b parallel 100; > Z = order Y by a parallel 1; > store Z into 'onefile'; > > If your output is big I would suggest trying out ordering in parallel as > well and then using HDFS's cat command in a separate pass to see if it is > faster. It will write twice but it won't flood one reducer with all of the > data. > > Alan. > > On Jul 2, 2012, at 4:59 PM, sonia gehlot wrote: > > > Hi Guys, > > > > I have use case, where I need to generate data feed using Pig script. > Data > > feed in total is of about 12 GB. > > > > I want Pig script to generate 1 file and data in that data should be > sorted > > as well. I know I can run it with one reducer as dataset is big with lot > of > > Joins it takes forever to finish. > > > > What are the other options to get one sorted file with better > performance. > > > > Thanks in advance, > > > > Sonia > >
+
sonia gehlot 2012-07-03, 19:18
-
RE: One file with sorted results.
Duckworth, Will 2012-07-03, 01:57
Have you tried breaking it into 2 jobs? The first are the pre-sort work then a final job with the sort and single reducer?
Will Duckworth Senior Vice President, Software Engineering | comScore, Inc.(NASDAQ:SCOR) o +1 (703) 438-2108 | m +1 (301) 606-2977 | mailto:[EMAIL PROTECTED] .....................................................................................................
Introducing Mobile Metrix 2.0 - The next generation of mobile behavioral measurement www.comscore.com/MobileMetrix -----Original Message----- From: sonia gehlot [mailto:[EMAIL PROTECTED]] Sent: Monday, July 02, 2012 7:59 PM To: [EMAIL PROTECTED] Subject: One file with sorted results.
Hi Guys,
I have use case, where I need to generate data feed using Pig script. Data feed in total is of about 12 GB.
I want Pig script to generate 1 file and data in that data should be sorted as well. I know I can run it with one reducer as dataset is big with lot of Joins it takes forever to finish.
What are the other options to get one sorted file with better performance.
Thanks in advance,
Sonia
+
Duckworth, Will 2012-07-03, 01:57
|
|