Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - nested order limit by percentage of overall records


+
Marco Cadetg 2013-03-18, 10:23
Copy link to this message
-
Re: nested order limit by percentage of overall records
Mike Sukmanowsky 2013-03-18, 20:40
You should check out the quantile libraries in LinkedIn's DataFu UDFs:
https://github.com/linkedin/datafu specifically
https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/Quantile.javafor
relatively small inputs, and
https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/StreamingQuantile.javafor
larger inputs.

You can use this to receive the top x% for any given field and then you can
use that within a filter
On Mon, Mar 18, 2013 at 6:23 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote:

> Hi there,
>
> I would like to do something very similar to a nested foreach with using
> order by and then limit. But I would like to limit on a relation to the
> total number of records.
>
> users = load 'users' as (userid:chararray, money:long, region:chararray);
> grouped_region = group users by region;
> top_10_percent = foreach grouped_region {
>             sorted = order users by money desc;
>             top    = limit sorted $UKNOWN_HOWTO_SET; -- e.g. for the top
> 10% it would be total users/10 in that region.
>             generate group, flatten(top);
> };
>
> Thanks a lot for any help on this.
>
> Cheers,
> -Marco
>

--
Mike Sukmanowsky

Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY  10018
p: +1 (416) 953-4248
e: [EMAIL PROTECTED]
+
Marco Cadetg 2013-03-18, 21:13
+
Mike Sukmanowsky 2013-03-18, 23:23
+
Marco Cadetg 2013-03-19, 07:49