Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Is that possible to use Pig to do an optimized secondary sort.


+
Stanley Xu 2012-10-31, 08:20
+
Alan Gates 2012-10-31, 15:21
Copy link to this message
-
Re: Is that possible to use Pig to do an optimized secondary sort.
I have posted the code by a gist link in the mail. I just simplify the real
code to make it simple, will that trigger a secondary sort automatically?

If that, is there any other places I should check to understand why the
cleanup of the mapreduce takes that long time?

Thanks.

Best wishes,
Stanley Xu

On Wed, Oct 31, 2012 at 11:21 PM, Alan Gates <[EMAIL PROTECTED]> wrote:

> Seeing your Pig Latin script will help us determine whether this will work
> in your case.  But in general Pig uses secondary sort when you do an order
> by in a nested foreach.  So if you are grouping you could order within that
> group and then pass it to your UDF.
>
> Alan.
>
> On Oct 31, 2012, at 1:20 AM, Stanley Xu wrote:
>
> > Dear buddies,
> >
> > We are trying to write some of the UDF to do some machine learning work.
> We
> > did a simple experiment to calculate the AUC through a UDF like the
> > following code in gist
> >
> > https://gist.github.com/3985764
> >
> > The map-reduce job will only take a couple of few minutes, but will wait
> > there hours to do the cleanup.
> >
> > I guess the reason is that the sort inside the foreach will generate lots
> > of data spill to local fs and takes a long time to do cleanup there.
> >
> > In a java map-reduce problem, we could made it like a secondary sort. We
> > make the model + ctr as the key so the same model's ctr will be sorted,
> and
> > group by only the model name part, then the sort is done after shuffling.
> >
> > I  am wondering if we could do that kind of optimization in pig as well?
>
>
+
Russell Jurney 2012-10-31, 15:55
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB