|
|
-
Is that possible to use Pig to do an optimized secondary sort.Stanley Xu 2012-10-31, 08:20
Dear buddies,
We are trying to write some of the UDF to do some machine learning work. We did a simple experiment to calculate the AUC through a UDF like the following code in gist https://gist.github.com/3985764 The map-reduce job will only take a couple of few minutes, but will wait there hours to do the cleanup. I guess the reason is that the sort inside the foreach will generate lots of data spill to local fs and takes a long time to do cleanup there. In a java map-reduce problem, we could made it like a secondary sort. We make the model + ctr as the key so the same model's ctr will be sorted, and group by only the model name part, then the sort is done after shuffling. I am wondering if we could do that kind of optimization in pig as well? +
Alan Gates 2012-10-31, 15:21
+
Stanley Xu 2012-11-01, 05:03
+
Russell Jurney 2012-10-31, 15:55
|