Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Is that possible to use Pig to do an optimized secondary sort.

Copy link to this message
Re: Is that possible to use Pig to do an optimized secondary sort.
I'd love to see an example of a secondary sort in a nested foreach.
Does anyone have one?

Russell Jurney http://datasyndrome.com

On Oct 31, 2012, at 8:22 AM, Alan Gates <[EMAIL PROTECTED]> wrote:

> Seeing your Pig Latin script will help us determine whether this will work in your case.  But in general Pig uses secondary sort when you do an order by in a nested foreach.  So if you are grouping you could order within that group and then pass it to your UDF.
> Alan.
> On Oct 31, 2012, at 1:20 AM, Stanley Xu wrote:
>> Dear buddies,
>> We are trying to write some of the UDF to do some machine learning work. We
>> did a simple experiment to calculate the AUC through a UDF like the
>> following code in gist
>> https://gist.github.com/3985764
>> The map-reduce job will only take a couple of few minutes, but will wait
>> there hours to do the cleanup.
>> I guess the reason is that the sort inside the foreach will generate lots
>> of data spill to local fs and takes a long time to do cleanup there.
>> In a java map-reduce problem, we could made it like a secondary sort. We
>> make the model + ctr as the key so the same model's ctr will be sorted, and
>> group by only the model name part, then the sort is done after shuffling.
>> I  am wondering if we could do that kind of optimization in pig as well?