Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Is that possible to use Pig to do an optimized secondary sort.


Copy link to this message
-
Is that possible to use Pig to do an optimized secondary sort.
Dear buddies,

We are trying to write some of the UDF to do some machine learning work. We
did a simple experiment to calculate the AUC through a UDF like the
following code in gist

https://gist.github.com/3985764

The map-reduce job will only take a couple of few minutes, but will wait
there hours to do the cleanup.

I guess the reason is that the sort inside the foreach will generate lots
of data spill to local fs and takes a long time to do cleanup there.

In a java map-reduce problem, we could made it like a secondary sort. We
make the model + ctr as the key so the same model's ctr will be sorted, and
group by only the model name part, then the sort is done after shuffling.

I  am wondering if we could do that kind of optimization in pig as well?
+
Alan Gates 2012-10-31, 15:21
+
Stanley Xu 2012-11-01, 05:03
+
Russell Jurney 2012-10-31, 15:55
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB