Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> nested order limit by percentage of overall records


Copy link to this message
-
Re: nested order limit by percentage of overall records
Distributed quantiles aren't an easy problem to solve (as you can see from
LinkedIn's source) but perhaps in time they'll be brought into core
functions.  It wasn't until 0.11.0 that date/time functions were brought
into built-in.  Had to use a combination of Piggybank and custom UDFs.
On Mon, Mar 18, 2013 at 5:13 PM, Marco Cadetg <[EMAIL PROTECTED]> wrote:

> Thanks a lot Mike. This seems to be what I'm looking for ;)
>
> I'm a bit disappointed that what I wanted to achieve isn't possible without
> using any UDF.
>
> Cheers,
> -Marco
>
>
> On Mon, Mar 18, 2013 at 9:40 PM, Mike Sukmanowsky <[EMAIL PROTECTED]>
> wrote:
>
> > You should check out the quantile libraries in LinkedIn's DataFu UDFs:
> > https://github.com/linkedin/datafu specifically
> >
> >
> https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/Quantile.javafor
> > relatively small inputs, and
> >
> >
> https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/StreamingQuantile.javafor
> > larger inputs.
> >
> > You can use this to receive the top x% for any given field and then you
> can
> > use that within a filter
> >
> >
> > On Mon, Mar 18, 2013 at 6:23 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote:
> >
> > > Hi there,
> > >
> > > I would like to do something very similar to a nested foreach with
> using
> > > order by and then limit. But I would like to limit on a relation to the
> > > total number of records.
> > >
> > > users = load 'users' as (userid:chararray, money:long,
> region:chararray);
> > > grouped_region = group users by region;
> > > top_10_percent = foreach grouped_region {
> > >             sorted = order users by money desc;
> > >             top    = limit sorted $UKNOWN_HOWTO_SET; -- e.g. for the
> top
> > > 10% it would be total users/10 in that region.
> > >             generate group, flatten(top);
> > > };
> > >
> > > Thanks a lot for any help on this.
> > >
> > > Cheers,
> > > -Marco
> > >
> >
> >
> >
> > --
> > Mike Sukmanowsky
> >
> > Product Lead, http://parse.ly
> > 989 Avenue of the Americas, 3rd Floor
> > New York, NY  10018
> > p: +1 (416) 953-4248
> > e: [EMAIL PROTECTED]
> >
>

--
Mike Sukmanowsky

Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY  10018
p: +1 (416) 953-4248
e: [EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB