|
|
+
Stanley Xu 2012-10-31, 08:20
+
Alan Gates 2012-10-31, 15:21
+
Stanley Xu 2012-11-01, 05:03
-
Re: Is that possible to use Pig to do an optimized secondary sort.Russell Jurney 2012-10-31, 15:55
I'd love to see an example of a secondary sort in a nested foreach.
Does anyone have one? Russell Jurney http://datasyndrome.com On Oct 31, 2012, at 8:22 AM, Alan Gates <[EMAIL PROTECTED]> wrote: > Seeing your Pig Latin script will help us determine whether this will work in your case. But in general Pig uses secondary sort when you do an order by in a nested foreach. So if you are grouping you could order within that group and then pass it to your UDF. > > Alan. > > On Oct 31, 2012, at 1:20 AM, Stanley Xu wrote: > >> Dear buddies, >> >> We are trying to write some of the UDF to do some machine learning work. We >> did a simple experiment to calculate the AUC through a UDF like the >> following code in gist >> >> https://gist.github.com/3985764 >> >> The map-reduce job will only take a couple of few minutes, but will wait >> there hours to do the cleanup. >> >> I guess the reason is that the sort inside the foreach will generate lots >> of data spill to local fs and takes a long time to do cleanup there. >> >> In a java map-reduce problem, we could made it like a secondary sort. We >> make the model + ctr as the key so the same model's ctr will be sorted, and >> group by only the model name part, then the sort is done after shuffling. >> >> I am wondering if we could do that kind of optimization in pig as well? > |