Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> One file with sorted results.


Copy link to this message
-
Re: One file with sorted results.
You can set different parallel levels at different parts of your script by attaching parallel to the different operations.  For example:

Y = join W by a, X by b parallel 100;
Z = order Y by a parallel 1;
store Z into 'onefile';

If your output is big I would suggest trying out ordering in parallel as well and then using HDFS's cat command in a separate pass to see if it is faster.  It will write twice but it won't flood one reducer with all of the data.

Alan.

On Jul 2, 2012, at 4:59 PM, sonia gehlot wrote:

> Hi Guys,
>
> I have use case, where I need to generate data feed using Pig script. Data
> feed in total is of about 12 GB.
>
> I want Pig script to generate 1 file and data in that data should be sorted
> as well. I know I can run it with one reducer as dataset is big with lot of
> Joins it takes forever to finish.
>
> What are the other options to get one sorted file with better performance.
>
> Thanks in advance,
>
> Sonia
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB