Jonathan Holloway
20110310, 14:57
Jonathan Holloway
20110310, 15:24
Kris Coward
20110310, 18:38
Josh Devins
20110310, 19:49
Edward Capriolo
20110310, 20:09
Jonathan Holloway
20110310, 21:01
Jonathan Holloway
20110311, 17:10
Dmitriy Ryaboy
20110311, 18:18
Josh Devins
20110311, 23:05
Kaluskar, Sanjay
20110312, 00:40


Percentile UDF
Hi all,
Does anybody know if a Percentile UDF exists at all, I've searched through the manual and the piggybank project, but can't seem to see one there. Many thanks, Jon. 
Percentile UDF
HI all,
Does anybody have a UDF for calculating the percentile (median, 99%) at all? I took a look at the builtins and the piggybank project, but couldn't seem to see anything. Is there a reason why there isn't a builtin for this? Many thanks, Jon. 
Re: Percentile UDF
On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote:
> HI all, > > Does anybody have a UDF for calculating the percentile (median, 99%) at all? > I took a look at the builtins > and the piggybank project, but couldn't seem to see anything. Is there a > reason why there isn't a builtin for > this? I think it would be because there's an inherently serial problem in there (i.e. numbering each entry based on its place in the ordered list). Cheers, Kris  Kris Coward http://unripe.melon.org/ GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 
Re: Percentile UDF
Hey Jon,
I wrote one not long ago that just relies on Apache Commons Math underlying the UDF. It's pretty straightforward as the Percentile implementation will sort your numbers before doing percentile calculations. The UDF then just needs to take a bag/tuple, pull out all the fields as double[] and pass into Percentile. http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/rank/Percentile.html Josh On 10 March 2011 19:38, Kris Coward <[EMAIL PROTECTED]> wrote: > On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote: > > HI all, > > > > Does anybody have a UDF for calculating the percentile (median, 99%) at > all? > > I took a look at the builtins > > and the piggybank project, but couldn't seem to see anything. Is there a > > reason why there isn't a builtin for > > this? > > I think it would be because there's an inherently serial problem in > there (i.e. numbering each entry based on its place in the ordered > list). > > Cheers, > Kris > >  > Kris Coward http://unripe.melon.org/ > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 > 
Re: Percentile UDF
On Thu, Mar 10, 2011 at 2:49 PM, Josh Devins <[EMAIL PROTECTED]> wrote:
> Hey Jon, > > I wrote one not long ago that just relies on Apache Commons Math underlying > the UDF. It's pretty straightforward as the Percentile implementation will > sort your numbers before doing percentile calculations. The UDF then just > needs to take a bag/tuple, pull out all the fields as double[] and pass into > Percentile. > > http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/rank/Percentile.html > > Josh > > > On 10 March 2011 19:38, Kris Coward <[EMAIL PROTECTED]> wrote: > >> On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote: >> > HI all, >> > >> > Does anybody have a UDF for calculating the percentile (median, 99%) at >> all? >> > I took a look at the builtins >> > and the piggybank project, but couldn't seem to see anything. Is there a >> > reason why there isn't a builtin for >> > this? >> >> I think it would be because there's an inherently serial problem in >> there (i.e. numbering each entry based on its place in the ordered >> list). >> >> Cheers, >> Kris >> >>  >> Kris Coward http://unripe.melon.org/ >> GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 >> > You can get an approximated percentile without completely ordering the list. Caveat: Assuming your values have a small distribution. You can use rounding to shrink the distribution. In SQL terms. 'select value as val,count(1) as count from X group by value order by value' This should result in a small result set with very large counts. Get the total count. You can now find the percentile of this smaller list. 
Re: Percentile UDF
Hey Josh,
That's the path I started down today, I don't suppose the UDF you wrote is in the public domain at all  would you consider contributing it to piggybank.jar at all? How does it fare with large datasets as well? Many thanks, Jon. On 10 March 2011 19:49, Josh Devins <[EMAIL PROTECTED]> wrote: > Hey Jon, > > I wrote one not long ago that just relies on Apache Commons Math underlying > the UDF. It's pretty straightforward as the Percentile implementation will > sort your numbers before doing percentile calculations. The UDF then just > needs to take a bag/tuple, pull out all the fields as double[] and pass > into > Percentile. > > > http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/rank/Percentile.html > > Josh > > > On 10 March 2011 19:38, Kris Coward <[EMAIL PROTECTED]> wrote: > > > On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote: > > > HI all, > > > > > > Does anybody have a UDF for calculating the percentile (median, 99%) at > > all? > > > I took a look at the builtins > > > and the piggybank project, but couldn't seem to see anything. Is there > a > > > reason why there isn't a builtin for > > > this? > > > > I think it would be because there's an inherently serial problem in > > there (i.e. numbering each entry based on its place in the ordered > > list). > > > > Cheers, > > Kris > > > >  > > Kris Coward http://unripe.melon.org/ > > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 > 
Re: Percentile UDF
This was my attempt at the Percentile function today:
http://pastebin.com/C883rwbJ Any tips on improving the efficiency of it would be welcome. Dmitriy mentioned using a histogram as an approach for this and looking at the way GROUP BY works. If anybody has any further info on that approach I'd be interested in hearing about it. Cheers, Jon. On 10 March 2011 21:01, Jonathan Holloway <[EMAIL PROTECTED]>wrote: > Hey Josh, > > That's the path I started down today, I don't suppose the UDF you wrote is > in the public domain > at all  would you consider contributing it to piggybank.jar at all? How > does it fare with large datasets > as well? > > Many thanks, > Jon. > > On 10 March 2011 19:49, Josh Devins <[EMAIL PROTECTED]> wrote: > >> Hey Jon, >> >> I wrote one not long ago that just relies on Apache Commons Math >> underlying >> the UDF. It's pretty straightforward as the Percentile implementation will >> sort your numbers before doing percentile calculations. The UDF then just >> needs to take a bag/tuple, pull out all the fields as double[] and pass >> into >> Percentile. >> >> >> http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/rank/Percentile.html >> >> Josh >> >> >> On 10 March 2011 19:38, Kris Coward <[EMAIL PROTECTED]> wrote: >> >> > On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote: >> > > HI all, >> > > >> > > Does anybody have a UDF for calculating the percentile (median, 99%) >> at >> > all? >> > > I took a look at the builtins >> > > and the piggybank project, but couldn't seem to see anything. Is >> there a >> > > reason why there isn't a builtin for >> > > this? >> > >> > I think it would be because there's an inherently serial problem in >> > there (i.e. numbering each entry based on its place in the ordered >> > list). >> > >> > Cheers, >> > Kris >> > >> >  >> > Kris Coward >> http://unripe.melon.org/ >> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 >> > 
Re: Percentile UDF
1) change log.info to log.debug... you don't want that spamming your logs
for a billion tuples. 2) instanceof DefaultDataBag should probably be instanceof DataBag (other databags are ok too) 3) the somewhat more scalable version of this would group by value (or value range, for histograms) and provide a series of (value, numentries) pairs  assuming numentries tends to be above 2, you can get substantial savings in terms of memory there 4) if you use an ArrayList instead of an array, you don't need to precalculate the size of the bag. But then a copy to array is required for Percentile to work. So you can implement your own Percentile that wouldn't require you to triplebuffer... D On Fri, Mar 11, 2011 at 9:10 AM, Jonathan Holloway < [EMAIL PROTECTED]> wrote: > This was my attempt at the Percentile function today: > > http://pastebin.com/C883rwbJ > > Any tips on improving the efficiency of it would be welcome. Dmitriy > mentioned using a histogram as an approach for this and looking at the way > GROUP BY works. If anybody has any further info on that approach I'd be > interested in hearing about it. > > Cheers, > Jon. > > On 10 March 2011 21:01, Jonathan Holloway <[EMAIL PROTECTED] > >wrote: > > > Hey Josh, > > > > That's the path I started down today, I don't suppose the UDF you wrote > is > > in the public domain > > at all  would you consider contributing it to piggybank.jar at all? How > > does it fare with large datasets > > as well? > > > > Many thanks, > > Jon. > > > > On 10 March 2011 19:49, Josh Devins <[EMAIL PROTECTED]> wrote: > > > >> Hey Jon, > >> > >> I wrote one not long ago that just relies on Apache Commons Math > >> underlying > >> the UDF. It's pretty straightforward as the Percentile implementation > will > >> sort your numbers before doing percentile calculations. The UDF then > just > >> needs to take a bag/tuple, pull out all the fields as double[] and pass > >> into > >> Percentile. > >> > >> > >> > http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/rank/Percentile.html > >> > >> Josh > >> > >> > >> On 10 March 2011 19:38, Kris Coward <[EMAIL PROTECTED]> wrote: > >> > >> > On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote: > >> > > HI all, > >> > > > >> > > Does anybody have a UDF for calculating the percentile (median, 99%) > >> at > >> > all? > >> > > I took a look at the builtins > >> > > and the piggybank project, but couldn't seem to see anything. Is > >> there a > >> > > reason why there isn't a builtin for > >> > > this? > >> > > >> > I think it would be because there's an inherently serial problem in > >> > there (i.e. numbering each entry based on its place in the ordered > >> > list). > >> > > >> > Cheers, > >> > Kris > >> > > >> >  > >> > Kris Coward > >> http://unripe.melon.org/ > >> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 > >> > > > 
Re: Percentile UDF
Jon,
I haven't tried with any super large DataBags but would definitely pursue Dmitriy's grouping suggestion if memory usage/size is of concern and you have a very large distribution. Something I will have to try as well! Cheers, Josh On 11 March 2011 19:18, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > 1) change log.info to log.debug... you don't want that spamming your logs > for a billion tuples. > 2) instanceof DefaultDataBag should probably be instanceof DataBag (other > databags are ok too) > 3) the somewhat more scalable version of this would group by value (or > value > range, for histograms) and provide a series of (value, numentries) pairs  > assuming numentries tends to be above 2, you can get substantial savings in > terms of memory there > 4) if you use an ArrayList instead of an array, you don't need to > precalculate the size of the bag. But then a copy to array is required for > Percentile to work. So you can implement your own Percentile that wouldn't > require you to triplebuffer... > > D > > On Fri, Mar 11, 2011 at 9:10 AM, Jonathan Holloway < > [EMAIL PROTECTED]> wrote: > > > This was my attempt at the Percentile function today: > > > > http://pastebin.com/C883rwbJ > > > > Any tips on improving the efficiency of it would be welcome. Dmitriy > > mentioned using a histogram as an approach for this and looking at the > way > > GROUP BY works. If anybody has any further info on that approach I'd be > > interested in hearing about it. > > > > Cheers, > > Jon. > > > > On 10 March 2011 21:01, Jonathan Holloway <[EMAIL PROTECTED] > > >wrote: > > > > > Hey Josh, > > > > > > That's the path I started down today, I don't suppose the UDF you wrote > > is > > > in the public domain > > > at all  would you consider contributing it to piggybank.jar at all? > How > > > does it fare with large datasets > > > as well? > > > > > > Many thanks, > > > Jon. > > > > > > On 10 March 2011 19:49, Josh Devins <[EMAIL PROTECTED]> wrote: > > > > > >> Hey Jon, > > >> > > >> I wrote one not long ago that just relies on Apache Commons Math > > >> underlying > > >> the UDF. It's pretty straightforward as the Percentile implementation > > will > > >> sort your numbers before doing percentile calculations. The UDF then > > just > > >> needs to take a bag/tuple, pull out all the fields as double[] and > pass > > >> into > > >> Percentile. > > >> > > >> > > >> > > > http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/rank/Percentile.html > > >> > > >> Josh > > >> > > >> > > >> On 10 March 2011 19:38, Kris Coward <[EMAIL PROTECTED]> wrote: > > >> > > >> > On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote: > > >> > > HI all, > > >> > > > > >> > > Does anybody have a UDF for calculating the percentile (median, > 99%) > > >> at > > >> > all? > > >> > > I took a look at the builtins > > >> > > and the piggybank project, but couldn't seem to see anything. Is > > >> there a > > >> > > reason why there isn't a builtin for > > >> > > this? > > >> > > > >> > I think it would be because there's an inherently serial problem in > > >> > there (i.e. numbering each entry based on its place in the ordered > > >> > list). > > >> > > > >> > Cheers, > > >> > Kris > > >> > > > >> >  > > >> > Kris Coward > > >> http://unripe.melon.org/ > > >> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 > > >> > > > > > > 
RE: Percentile UDF
It's not very clear how you are doing this in a single eval func efficiently, or are you scanning all the tuples within a single invocation of the eval func?
For this to be scalable and O(n), wouldn't you need at least 2 mapreduce jobs? The first job to calculate histograms (which can be scaled up by the map phase assigning buckets in parallel) and the reduce computing the counts for each bucket; the second job to calculate the percentile rank of each tuple in a map phase using the histograms (which again can be done in parallel). Original Message From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Josh Devins Sent: 12 March 2011 04:36 To: [EMAIL PROTECTED] Cc: Dmitriy Ryaboy; Jonathan Holloway Subject: Re: Percentile UDF Jon, I haven't tried with any super large DataBags but would definitely pursue Dmitriy's grouping suggestion if memory usage/size is of concern and you have a very large distribution. Something I will have to try as well! Cheers, Josh On 11 March 2011 19:18, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > 1) change log.info to log.debug... you don't want that spamming your > logs for a billion tuples. > 2) instanceof DefaultDataBag should probably be instanceof DataBag > (other databags are ok too) > 3) the somewhat more scalable version of this would group by value (or > value range, for histograms) and provide a series of (value, > numentries) pairs  assuming numentries tends to be above 2, you can > get substantial savings in terms of memory there > 4) if you use an ArrayList instead of an array, you don't need to > precalculate the size of the bag. But then a copy to array is required > for Percentile to work. So you can implement your own Percentile that > wouldn't require you to triplebuffer... > > D > > On Fri, Mar 11, 2011 at 9:10 AM, Jonathan Holloway < > [EMAIL PROTECTED]> wrote: > > > This was my attempt at the Percentile function today: > > > > http://pastebin.com/C883rwbJ > > > > Any tips on improving the efficiency of it would be welcome. > > Dmitriy mentioned using a histogram as an approach for this and > > looking at the > way > > GROUP BY works. If anybody has any further info on that approach > > I'd be interested in hearing about it. > > > > Cheers, > > Jon. > > > > On 10 March 2011 21:01, Jonathan Holloway > > <[EMAIL PROTECTED] > > >wrote: > > > > > Hey Josh, > > > > > > That's the path I started down today, I don't suppose the UDF you > > > wrote > > is > > > in the public domain > > > at all  would you consider contributing it to piggybank.jar at all? > How > > > does it fare with large datasets > > > as well? > > > > > > Many thanks, > > > Jon. > > > > > > On 10 March 2011 19:49, Josh Devins <[EMAIL PROTECTED]> wrote: > > > > > >> Hey Jon, > > >> > > >> I wrote one not long ago that just relies on Apache Commons Math > > >> underlying the UDF. It's pretty straightforward as the Percentile > > >> implementation > > will > > >> sort your numbers before doing percentile calculations. The UDF > > >> then > > just > > >> needs to take a bag/tuple, pull out all the fields as double[] > > >> and > pass > > >> into > > >> Percentile. > > >> > > >> > > >> > > > http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/de > scriptive/rank/Percentile.html > > >> > > >> Josh > > >> > > >> > > >> On 10 March 2011 19:38, Kris Coward <[EMAIL PROTECTED]> wrote: > > >> > > >> > On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote: > > >> > > HI all, > > >> > > > > >> > > Does anybody have a UDF for calculating the percentile > > >> > > (median, > 99%) > > >> at > > >> > all? > > >> > > I took a look at the builtins and the piggybank project, but > > >> > > couldn't seem to see anything. Is > > >> there a > > >> > > reason why there isn't a builtin for this? > > >> > > > >> > I think it would be because there's an inherently serial > > >> > problem in there (i.e. numbering each entry based on its place 
