Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Is that possible to use Pig to do an optimized secondary sort.

Copy link to this message
Re: Is that possible to use Pig to do an optimized secondary sort.
Seeing your Pig Latin script will help us determine whether this will work in your case.  But in general Pig uses secondary sort when you do an order by in a nested foreach.  So if you are grouping you could order within that group and then pass it to your UDF.


On Oct 31, 2012, at 1:20 AM, Stanley Xu wrote:

> Dear buddies,
> We are trying to write some of the UDF to do some machine learning work. We
> did a simple experiment to calculate the AUC through a UDF like the
> following code in gist
> https://gist.github.com/3985764
> The map-reduce job will only take a couple of few minutes, but will wait
> there hours to do the cleanup.
> I guess the reason is that the sort inside the foreach will generate lots
> of data spill to local fs and takes a long time to do cleanup there.
> In a java map-reduce problem, we could made it like a secondary sort. We
> make the model + ctr as the key so the same model's ctr will be sorted, and
> group by only the model name part, then the sort is done after shuffling.
> I  am wondering if we could do that kind of optimization in pig as well?