Marco Cadetg
20111027, 09:56
Bill Graham
20111027, 15:05
Norbert Burger
20111027, 16:03
Marco Cadetg
20111027, 16:23
Guy Bayes
20111027, 20:05
pablomar
20111028, 01:59
Norbert Burger
20111028, 13:12
Guy Bayes
20111028, 15:02
Marco Cadetg
20111031, 15:55
Guy Bayes
20111031, 16:58
Jonathan Coveney
20111031, 17:15
Marco Cadetg
20111101, 13:26
Jonathan Coveney
20111101, 17:44
Ashutosh Chauhan
20111102, 18:03
Jonathan Coveney
20111102, 18:52
Marco Cadetg
20111104, 11:33
Stan Rosenberg
20111105, 19:15
Jonathan Coveney
20111114, 18:10


creating a graph over time
I have a problem where I don't know how or if pig is even suitable to solve
it. I have a schema like this: studentid,studentname,starttime,duration,course 1,marco,1319708213,500,math 2,ralf,1319708111,112,english 3,greg,1319708321,333,french 4,diva,1319708444,80,english 5,susanne,1319708123,2000,math 1,marco,1319708564,500,french 2,ralf,1319708789,123,french 7,fred,1319708213,5675,french 8,laura,1319708233,123,math 10,sab,1319708999,777,math 11,fibo,1319708789,565,math 6,dan,1319708456,50,english 9,marco,1319708123,60,english 12,bo,1319708456,345,math 1,marco,1319708789,673,math ... ... I would like to retrieve a graph (interpolation) over time grouped by course. Meaning how many students are learning for a course based on a 30 sec interval. The grouping by course is easy but from there I've no clue how I would achieve the rest. I guess the rest needs to be achieved via some UDF or is there any way how to this in pig? I often think that I need a "for loop" or something similar in pig. Thanks for your help! Marco 
Re: creating a graph over time
You can pass your time to a udf that rounds it down to the nearest 30 second
interval and then group by course, interval to get counts for each course, interval. On Thursday, October 27, 2011, Marco Cadetg <[EMAIL PROTECTED]> wrote: > I have a problem where I don't know how or if pig is even suitable to solve > it. > > I have a schema like this: > > studentid,studentname,starttime,duration,course > 1,marco,1319708213,500,math > 2,ralf,1319708111,112,english > 3,greg,1319708321,333,french > 4,diva,1319708444,80,english > 5,susanne,1319708123,2000,math > 1,marco,1319708564,500,french > 2,ralf,1319708789,123,french > 7,fred,1319708213,5675,french > 8,laura,1319708233,123,math > 10,sab,1319708999,777,math > 11,fibo,1319708789,565,math > 6,dan,1319708456,50,english > 9,marco,1319708123,60,english > 12,bo,1319708456,345,math > 1,marco,1319708789,673,math > ... > ... > > I would like to retrieve a graph (interpolation) over time grouped by > course. Meaning how many students are learning for a course based on a 30 > sec interval. > The grouping by course is easy but from there I've no clue how I would > achieve the rest. I guess the rest needs to be achieved via some UDF > or is there any way how to this in pig? I often think that I need a "for > loop" or something similar in pig. > > Thanks for your help! > Marco > 
Re: creating a graph over time
In case what you're looking for is an analysis over the full learning
duration, and not just the start interval, then one further insight is that each original record can be transformed into a sequence of records, where the size of the sequence corresponds to the session duration. In other words, you can use a UDF to "explode" the original record: 1,marco,1319708213,500,math into: 1,marco,1319708190,500,math 1,marco,1319708220,500,math 1,marco,1319708250,500,math 1,marco,1319708280,500,math 1,marco,1319708310,500,math 1,marco,1319708340,500,math 1,marco,1319708370,500,math 1,marco,1319708400,500,math 1,marco,1319708430,500,math 1,marco,1319708460,500,math 1,marco,1319708490,500,math 1,marco,1319708520,500,math 1,marco,1319708550,500,math 1,marco,1319708580,500,math 1,marco,1319708610,500,math 1,marco,1319708640,500,math 1,marco,1319708670,500,math 1,marco,1319708700,500,math and then use Bill's suggestion to group by course, interval. Norbert On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <[EMAIL PROTECTED]> wrote: > You can pass your time to a udf that rounds it down to the nearest 30 second > interval and then group by course, interval to get counts for each course, > interval. > > On Thursday, October 27, 2011, Marco Cadetg <[EMAIL PROTECTED]> wrote: >> I have a problem where I don't know how or if pig is even suitable to > solve >> it. >> >> I have a schema like this: >> >> studentid,studentname,starttime,duration,course >> 1,marco,1319708213,500,math >> 2,ralf,1319708111,112,english >> 3,greg,1319708321,333,french >> 4,diva,1319708444,80,english >> 5,susanne,1319708123,2000,math >> 1,marco,1319708564,500,french >> 2,ralf,1319708789,123,french >> 7,fred,1319708213,5675,french >> 8,laura,1319708233,123,math >> 10,sab,1319708999,777,math >> 11,fibo,1319708789,565,math >> 6,dan,1319708456,50,english >> 9,marco,1319708123,60,english >> 12,bo,1319708456,345,math >> 1,marco,1319708789,673,math >> ... >> ... >> >> I would like to retrieve a graph (interpolation) over time grouped by >> course. Meaning how many students are learning for a course based on a 30 >> sec interval. >> The grouping by course is easy but from there I've no clue how I would >> achieve the rest. I guess the rest needs to be achieved via some UDF >> or is there any way how to this in pig? I often think that I need a "for >> loop" or something similar in pig. >> >> Thanks for your help! >> Marco >> > 
Re: creating a graph over time
Thanks Bill and Norbert that seems like what I was looking for. I'm a bit
worried about how much data/io this could create. But I'll see ;) Cheers Marco On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <[EMAIL PROTECTED]>wrote: > In case what you're looking for is an analysis over the full learning > duration, and not just the start interval, then one further insight is > that each original record can be transformed into a sequence of > records, where the size of the sequence corresponds to the session > duration. In other words, you can use a UDF to "explode" the original > record: > > 1,marco,1319708213,500,math > > into: > > 1,marco,1319708190,500,math > 1,marco,1319708220,500,math > 1,marco,1319708250,500,math > 1,marco,1319708280,500,math > 1,marco,1319708310,500,math > 1,marco,1319708340,500,math > 1,marco,1319708370,500,math > 1,marco,1319708400,500,math > 1,marco,1319708430,500,math > 1,marco,1319708460,500,math > 1,marco,1319708490,500,math > 1,marco,1319708520,500,math > 1,marco,1319708550,500,math > 1,marco,1319708580,500,math > 1,marco,1319708610,500,math > 1,marco,1319708640,500,math > 1,marco,1319708670,500,math > 1,marco,1319708700,500,math > > and then use Bill's suggestion to group by course, interval. > > Norbert > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <[EMAIL PROTECTED]> > wrote: > > You can pass your time to a udf that rounds it down to the nearest 30 > second > > interval and then group by course, interval to get counts for each > course, > > interval. > > > > On Thursday, October 27, 2011, Marco Cadetg <[EMAIL PROTECTED]> wrote: > >> I have a problem where I don't know how or if pig is even suitable to > > solve > >> it. > >> > >> I have a schema like this: > >> > >> studentid,studentname,starttime,duration,course > >> 1,marco,1319708213,500,math > >> 2,ralf,1319708111,112,english > >> 3,greg,1319708321,333,french > >> 4,diva,1319708444,80,english > >> 5,susanne,1319708123,2000,math > >> 1,marco,1319708564,500,french > >> 2,ralf,1319708789,123,french > >> 7,fred,1319708213,5675,french > >> 8,laura,1319708233,123,math > >> 10,sab,1319708999,777,math > >> 11,fibo,1319708789,565,math > >> 6,dan,1319708456,50,english > >> 9,marco,1319708123,60,english > >> 12,bo,1319708456,345,math > >> 1,marco,1319708789,673,math > >> ... > >> ... > >> > >> I would like to retrieve a graph (interpolation) over time grouped by > >> course. Meaning how many students are learning for a course based on a > 30 > >> sec interval. > >> The grouping by course is easy but from there I've no clue how I would > >> achieve the rest. I guess the rest needs to be achieved via some UDF > >> or is there any way how to this in pig? I often think that I need a "for > >> loop" or something similar in pig. > >> > >> Thanks for your help! > >> Marco > >> > > > 
Re: creating a graph over time
how big is your dataset?
On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote: > Thanks Bill and Norbert that seems like what I was looking for. I'm a bit > worried about > how much data/io this could create. But I'll see ;) > > Cheers > Marco > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <[EMAIL PROTECTED] > >wrote: > > > In case what you're looking for is an analysis over the full learning > > duration, and not just the start interval, then one further insight is > > that each original record can be transformed into a sequence of > > records, where the size of the sequence corresponds to the session > > duration. In other words, you can use a UDF to "explode" the original > > record: > > > > 1,marco,1319708213,500,math > > > > into: > > > > 1,marco,1319708190,500,math > > 1,marco,1319708220,500,math > > 1,marco,1319708250,500,math > > 1,marco,1319708280,500,math > > 1,marco,1319708310,500,math > > 1,marco,1319708340,500,math > > 1,marco,1319708370,500,math > > 1,marco,1319708400,500,math > > 1,marco,1319708430,500,math > > 1,marco,1319708460,500,math > > 1,marco,1319708490,500,math > > 1,marco,1319708520,500,math > > 1,marco,1319708550,500,math > > 1,marco,1319708580,500,math > > 1,marco,1319708610,500,math > > 1,marco,1319708640,500,math > > 1,marco,1319708670,500,math > > 1,marco,1319708700,500,math > > > > and then use Bill's suggestion to group by course, interval. > > > > Norbert > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <[EMAIL PROTECTED]> > > wrote: > > > You can pass your time to a udf that rounds it down to the nearest 30 > > second > > > interval and then group by course, interval to get counts for each > > course, > > > interval. > > > > > > On Thursday, October 27, 2011, Marco Cadetg <[EMAIL PROTECTED]> wrote: > > >> I have a problem where I don't know how or if pig is even suitable to > > > solve > > >> it. > > >> > > >> I have a schema like this: > > >> > > >> studentid,studentname,starttime,duration,course > > >> 1,marco,1319708213,500,math > > >> 2,ralf,1319708111,112,english > > >> 3,greg,1319708321,333,french > > >> 4,diva,1319708444,80,english > > >> 5,susanne,1319708123,2000,math > > >> 1,marco,1319708564,500,french > > >> 2,ralf,1319708789,123,french > > >> 7,fred,1319708213,5675,french > > >> 8,laura,1319708233,123,math > > >> 10,sab,1319708999,777,math > > >> 11,fibo,1319708789,565,math > > >> 6,dan,1319708456,50,english > > >> 9,marco,1319708123,60,english > > >> 12,bo,1319708456,345,math > > >> 1,marco,1319708789,673,math > > >> ... > > >> ... > > >> > > >> I would like to retrieve a graph (interpolation) over time grouped by > > >> course. Meaning how many students are learning for a course based on a > > 30 > > >> sec interval. > > >> The grouping by course is easy but from there I've no clue how I would > > >> achieve the rest. I guess the rest needs to be achieved via some UDF > > >> or is there any way how to this in pig? I often think that I need a > "for > > >> loop" or something similar in pig. > > >> > > >> Thanks for your help! > > >> Marco > > >> > > > > > > 
Re: creating a graph over time
you can loop from python. I've never tried it but you have a pretty good
explanation here ( http://ofps.oreilly.com/titles/9781449302641/embedding.html ) recently, I have to analyze some log files and I needed to loop (to calculate some stats) and I used an UDF in your case, I would go with Bill proposal On Thu, Oct 27, 2011 at 5:56 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote: > I have a problem where I don't know how or if pig is even suitable to solve > it. > > I have a schema like this: > > studentid,studentname,starttime,duration,course > 1,marco,1319708213,500,math > 2,ralf,1319708111,112,english > 3,greg,1319708321,333,french > 4,diva,1319708444,80,english > 5,susanne,1319708123,2000,math > 1,marco,1319708564,500,french > 2,ralf,1319708789,123,french > 7,fred,1319708213,5675,french > 8,laura,1319708233,123,math > 10,sab,1319708999,777,math > 11,fibo,1319708789,565,math > 6,dan,1319708456,50,english > 9,marco,1319708123,60,english > 12,bo,1319708456,345,math > 1,marco,1319708789,673,math > ... > ... > > I would like to retrieve a graph (interpolation) over time grouped by > course. Meaning how many students are learning for a course based on a 30 > sec interval. > The grouping by course is easy but from there I've no clue how I would > achieve the rest. I guess the rest needs to be achieved via some UDF > or is there any way how to this in pig? I often think that I need a "for > loop" or something similar in pig. > > Thanks for your help! > Marco > 
Re: creating a graph over time
Perhaps another way to approach this problem is to visualize it
geometrically. You have a long series of class session instances, where each class session is like 1D line segment, beginning/stopping at some start/end time. These segments naturally overlap, and I think the question you're asking is equivalent to finding the number of overlaps at every subsegment. To answer this, you want to first break every class session into a full list of subsegments, where a subsegment is created by "breaking" each class session/segment into multiple parts at the start/end point of any other class session. You can create this full set of subsegments in one pass by comparing pairwise (CROSS) each start/end point with your original list of class sessions. Once you have the full list of "broken" segments, then a final GROUP BY/COUNT(*) will you give you the number of overlaps. Seems like approach would be faster than the previous approach if your class sessions are very long, or there are many overlaps. Norbert On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <[EMAIL PROTECTED]> wrote: > how big is your dataset? > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote: > > > Thanks Bill and Norbert that seems like what I was looking for. I'm a bit > > worried about > > how much data/io this could create. But I'll see ;) > > > > Cheers > > Marco > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger < > [EMAIL PROTECTED] > > >wrote: > > > > > In case what you're looking for is an analysis over the full learning > > > duration, and not just the start interval, then one further insight is > > > that each original record can be transformed into a sequence of > > > records, where the size of the sequence corresponds to the session > > > duration. In other words, you can use a UDF to "explode" the original > > > record: > > > > > > 1,marco,1319708213,500,math > > > > > > into: > > > > > > 1,marco,1319708190,500,math > > > 1,marco,1319708220,500,math > > > 1,marco,1319708250,500,math > > > 1,marco,1319708280,500,math > > > 1,marco,1319708310,500,math > > > 1,marco,1319708340,500,math > > > 1,marco,1319708370,500,math > > > 1,marco,1319708400,500,math > > > 1,marco,1319708430,500,math > > > 1,marco,1319708460,500,math > > > 1,marco,1319708490,500,math > > > 1,marco,1319708520,500,math > > > 1,marco,1319708550,500,math > > > 1,marco,1319708580,500,math > > > 1,marco,1319708610,500,math > > > 1,marco,1319708640,500,math > > > 1,marco,1319708670,500,math > > > 1,marco,1319708700,500,math > > > > > > and then use Bill's suggestion to group by course, interval. > > > > > > Norbert > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <[EMAIL PROTECTED]> > > > wrote: > > > > You can pass your time to a udf that rounds it down to the nearest 30 > > > second > > > > interval and then group by course, interval to get counts for each > > > course, > > > > interval. > > > > > > > > On Thursday, October 27, 2011, Marco Cadetg <[EMAIL PROTECTED]> > wrote: > > > >> I have a problem where I don't know how or if pig is even suitable > to > > > > solve > > > >> it. > > > >> > > > >> I have a schema like this: > > > >> > > > >> studentid,studentname,starttime,duration,course > > > >> 1,marco,1319708213,500,math > > > >> 2,ralf,1319708111,112,english > > > >> 3,greg,1319708321,333,french > > > >> 4,diva,1319708444,80,english > > > >> 5,susanne,1319708123,2000,math > > > >> 1,marco,1319708564,500,french > > > >> 2,ralf,1319708789,123,french > > > >> 7,fred,1319708213,5675,french > > > >> 8,laura,1319708233,123,math > > > >> 10,sab,1319708999,777,math > > > >> 11,fibo,1319708789,565,math > > > >> 6,dan,1319708456,50,english > > > >> 9,marco,1319708123,60,english > > > >> 12,bo,1319708456,345,math > > > >> 1,marco,1319708789,673,math > > > >> ... > > > >> ... > > > >> > > > >> I would like to retrieve a graph (interpolation) over time grouped > by > > > >> course. Meaning how many students are learning for a course based on > a > > > 30 
Re: creating a graph over time
if it fits in R, it's trivial, draw a density plot or a histogram, about
three lines of R code why I was wondering about the data volume. His example is students attending classes, if that is really the data hard to believe it's super huge? Guy On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <[EMAIL PROTECTED]>wrote: > Perhaps another way to approach this problem is to visualize it > geometrically. You have a long series of class session instances, where > each class session is like 1D line segment, beginning/stopping at some > start/end time. > > These segments naturally overlap, and I think the question you're asking is > equivalent to finding the number of overlaps at every subsegment. > > To answer this, you want to first break every class session into a full > list > of subsegments, where a subsegment is created by "breaking" each class > session/segment into multiple parts at the start/end point of any other > class session. You can create this full set of subsegments in one pass by > comparing pairwise (CROSS) each start/end point with your original list of > class sessions. > > Once you have the full list of "broken" segments, then a final GROUP > BY/COUNT(*) will you give you the number of overlaps. Seems like approach > would be faster than the previous approach if your class sessions are very > long, or there are many overlaps. > > Norbert > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <[EMAIL PROTECTED]> wrote: > > > how big is your dataset? > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote: > > > > > Thanks Bill and Norbert that seems like what I was looking for. I'm a > bit > > > worried about > > > how much data/io this could create. But I'll see ;) > > > > > > Cheers > > > Marco > > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > In case what you're looking for is an analysis over the full learning > > > > duration, and not just the start interval, then one further insight > is > > > > that each original record can be transformed into a sequence of > > > > records, where the size of the sequence corresponds to the session > > > > duration. In other words, you can use a UDF to "explode" the > original > > > > record: > > > > > > > > 1,marco,1319708213,500,math > > > > > > > > into: > > > > > > > > 1,marco,1319708190,500,math > > > > 1,marco,1319708220,500,math > > > > 1,marco,1319708250,500,math > > > > 1,marco,1319708280,500,math > > > > 1,marco,1319708310,500,math > > > > 1,marco,1319708340,500,math > > > > 1,marco,1319708370,500,math > > > > 1,marco,1319708400,500,math > > > > 1,marco,1319708430,500,math > > > > 1,marco,1319708460,500,math > > > > 1,marco,1319708490,500,math > > > > 1,marco,1319708520,500,math > > > > 1,marco,1319708550,500,math > > > > 1,marco,1319708580,500,math > > > > 1,marco,1319708610,500,math > > > > 1,marco,1319708640,500,math > > > > 1,marco,1319708670,500,math > > > > 1,marco,1319708700,500,math > > > > > > > > and then use Bill's suggestion to group by course, interval. > > > > > > > > Norbert > > > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <[EMAIL PROTECTED]> > > > > wrote: > > > > > You can pass your time to a udf that rounds it down to the nearest > 30 > > > > second > > > > > interval and then group by course, interval to get counts for each > > > > course, > > > > > interval. > > > > > > > > > > On Thursday, October 27, 2011, Marco Cadetg <[EMAIL PROTECTED]> > > wrote: > > > > >> I have a problem where I don't know how or if pig is even suitable > > to > > > > > solve > > > > >> it. > > > > >> > > > > >> I have a schema like this: > > > > >> > > > > >> studentid,studentname,starttime,duration,course > > > > >> 1,marco,1319708213,500,math > > > > >> 2,ralf,1319708111,112,english > > > > >> 3,greg,1319708321,333,french > > > > >> 4,diva,1319708444,80,english > > > > >> 5,susanne,1319708123,2000,math > > > > >> 1,marco,1319708564,500,french > > 
Re: creating a graph over time
The data is not about students but about television ;) Regarding the size.
The raw input data size is about 150m although when I 'explode' the timeseries it will be around 80x bigger. I guess the average user duration will be around 40 Minutes which means when sampling it at a 30s interval will increase the size by ~12GB. I think that is a size which my hadoop cluster with five 8core x 8GB x 2TB HD should be able to cope with. I don't know about R. Are you able to handle 12Gb files well in R (off course it depends on your computer so assume an average business computer e.g. 2core 2GHz 4GB ram)? Cheers Marco On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <[EMAIL PROTECTED]> wrote: > if it fits in R, it's trivial, draw a density plot or a histogram, about > three lines of R code > > why I was wondering about the data volume. > > His example is students attending classes, if that is really the data hard > to believe it's super huge? > > Guy > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <[EMAIL PROTECTED] > >wrote: > > > Perhaps another way to approach this problem is to visualize it > > geometrically. You have a long series of class session instances, where > > each class session is like 1D line segment, beginning/stopping at some > > start/end time. > > > > These segments naturally overlap, and I think the question you're asking > is > > equivalent to finding the number of overlaps at every subsegment. > > > > To answer this, you want to first break every class session into a full > > list > > of subsegments, where a subsegment is created by "breaking" each class > > session/segment into multiple parts at the start/end point of any other > > class session. You can create this full set of subsegments in one pass > by > > comparing pairwise (CROSS) each start/end point with your original list > of > > class sessions. > > > > Once you have the full list of "broken" segments, then a final GROUP > > BY/COUNT(*) will you give you the number of overlaps. Seems like > approach > > would be faster than the previous approach if your class sessions are > very > > long, or there are many overlaps. > > > > Norbert > > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <[EMAIL PROTECTED]> > wrote: > > > > > how big is your dataset? > > > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[EMAIL PROTECTED]> > wrote: > > > > > > > Thanks Bill and Norbert that seems like what I was looking for. I'm a > > bit > > > > worried about > > > > how much data/io this could create. But I'll see ;) > > > > > > > > Cheers > > > > Marco > > > > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger < > > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > In case what you're looking for is an analysis over the full > learning > > > > > duration, and not just the start interval, then one further insight > > is > > > > > that each original record can be transformed into a sequence of > > > > > records, where the size of the sequence corresponds to the session > > > > > duration. In other words, you can use a UDF to "explode" the > > original > > > > > record: > > > > > > > > > > 1,marco,1319708213,500,math > > > > > > > > > > into: > > > > > > > > > > 1,marco,1319708190,500,math > > > > > 1,marco,1319708220,500,math > > > > > 1,marco,1319708250,500,math > > > > > 1,marco,1319708280,500,math > > > > > 1,marco,1319708310,500,math > > > > > 1,marco,1319708340,500,math > > > > > 1,marco,1319708370,500,math > > > > > 1,marco,1319708400,500,math > > > > > 1,marco,1319708430,500,math > > > > > 1,marco,1319708460,500,math > > > > > 1,marco,1319708490,500,math > > > > > 1,marco,1319708520,500,math > > > > > 1,marco,1319708550,500,math > > > > > 1,marco,1319708580,500,math > > > > > 1,marco,1319708610,500,math > > > > > 1,marco,1319708640,500,math > > > > > 1,marco,1319708670,500,math > > > > > 1,marco,1319708700,500,math > > > > > > > > > > and then use Bill's suggestion to group by course, interval. > > > > > > > > > > Norbert > 
Re: creating a graph over time
ahh TV that explains it
12G data file is a bit too big for R unless you sample, not sure if the use case is conducive to sampling? If it is, could sample it down and structure in pig/hadoop and then load it into the analytical/visualization tool of choice... Guy On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote: > The data is not about students but about television ;) Regarding the size. > The raw input data size is about 150m although when I 'explode' the > timeseries > it will be around 80x bigger. I guess the average user duration will be > around > 40 Minutes which means when sampling it at a 30s interval will increase the > size by ~12GB. > > I think that is a size which my hadoop cluster with five 8core x 8GB x 2TB > HD > should be able to cope with. > > I don't know about R. Are you able to handle 12Gb > files well in R (off course it depends on your computer so assume an > average business computer e.g. 2core 2GHz 4GB ram)? > > Cheers > Marco > > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <[EMAIL PROTECTED]> wrote: > > > if it fits in R, it's trivial, draw a density plot or a histogram, about > > three lines of R code > > > > why I was wondering about the data volume. > > > > His example is students attending classes, if that is really the data > hard > > to believe it's super huge? > > > > Guy > > > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger < > [EMAIL PROTECTED] > > >wrote: > > > > > Perhaps another way to approach this problem is to visualize it > > > geometrically. You have a long series of class session instances, > where > > > each class session is like 1D line segment, beginning/stopping at some > > > start/end time. > > > > > > These segments naturally overlap, and I think the question you're > asking > > is > > > equivalent to finding the number of overlaps at every subsegment. > > > > > > To answer this, you want to first break every class session into a full > > > list > > > of subsegments, where a subsegment is created by "breaking" each class > > > session/segment into multiple parts at the start/end point of any other > > > class session. You can create this full set of subsegments in one pass > > by > > > comparing pairwise (CROSS) each start/end point with your original list > > of > > > class sessions. > > > > > > Once you have the full list of "broken" segments, then a final GROUP > > > BY/COUNT(*) will you give you the number of overlaps. Seems like > > approach > > > would be faster than the previous approach if your class sessions are > > very > > > long, or there are many overlaps. > > > > > > Norbert > > > > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <[EMAIL PROTECTED]> > > wrote: > > > > > > > how big is your dataset? > > > > > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[EMAIL PROTECTED]> > > wrote: > > > > > > > > > Thanks Bill and Norbert that seems like what I was looking for. > I'm a > > > bit > > > > > worried about > > > > > how much data/io this could create. But I'll see ;) > > > > > > > > > > Cheers > > > > > Marco > > > > > > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger < > > > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > In case what you're looking for is an analysis over the full > > learning > > > > > > duration, and not just the start interval, then one further > insight > > > is > > > > > > that each original record can be transformed into a sequence of > > > > > > records, where the size of the sequence corresponds to the > session > > > > > > duration. In other words, you can use a UDF to "explode" the > > > original > > > > > > record: > > > > > > > > > > > > 1,marco,1319708213,500,math > > > > > > > > > > > > into: > > > > > > > > > > > > 1,marco,1319708190,500,math > > > > > > 1,marco,1319708220,500,math > > > > > > 1,marco,1319708250,500,math > > > > > > 1,marco,1319708280,500,math > > > > > > 1,marco,1319708310,500,math > > > > > > 1,marco,1319708340,500,math > > > > > > 1,marco,1319708370,500,math 
Re: creating a graph over time
Perhaps I'm misunderstanding your use case, and this depends on the amount
of data, but you could consider something like this (to avoid exploding the data, which could perhaps be inavoidable but I hate resorting to that if I don't have to). a = foreach yourdata generate student_id, start_time, start_time+duration as end_time, course; b = group a by course; c = foreach b { ord = order a by start_time; generate yourudf.process(ord); } Here is generally what process could do. It would be an accumulator UDF that expected tuples sorted on start_time. Now you basically need a way to know who the distinct users are. Now, since you want 30s windows, your first window will presumably be 30s after the first start_time in your data, and you would just tick ahead in 1s and write to a bag which would have second, # of distinct student_ids. To know when to eject people, you could have any number of data structures... perhaps a min heap based on end_time, and of course instead of "ticking" ahead, you would grab a new tuple (since this is the only thing that would change the state of the # of distinct ids), and then do all of the ticking ahead as you adjust the heap and write the seconds in between the current time pointer and the start_time of the new tuple, making sure in each step to check against the min heap to eject any users that expired. That was a little rambly, I could quickly put together some more reasonable pseudocode if that would help. I think the general idea is clear though... 2011/10/31 Guy Bayes <[EMAIL PROTECTED]> > ahh TV that explains it > > 12G data file is a bit too big for R unless you sample, not sure if the use > case is conducive to sampling? > > If it is, could sample it down and structure in pig/hadoop and then load it > into the analytical/visualization tool of choice... > > Guy > > On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote: > > > The data is not about students but about television ;) Regarding the > size. > > The raw input data size is about 150m although when I 'explode' the > > timeseries > > it will be around 80x bigger. I guess the average user duration will be > > around > > 40 Minutes which means when sampling it at a 30s interval will increase > the > > size by ~12GB. > > > > I think that is a size which my hadoop cluster with five 8core x 8GB x > 2TB > > HD > > should be able to cope with. > > > > I don't know about R. Are you able to handle 12Gb > > files well in R (off course it depends on your computer so assume an > > average business computer e.g. 2core 2GHz 4GB ram)? > > > > Cheers > > Marco > > > > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <[EMAIL PROTECTED]> > wrote: > > > > > if it fits in R, it's trivial, draw a density plot or a histogram, > about > > > three lines of R code > > > > > > why I was wondering about the data volume. > > > > > > His example is students attending classes, if that is really the data > > hard > > > to believe it's super huge? > > > > > > Guy > > > > > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > Perhaps another way to approach this problem is to visualize it > > > > geometrically. You have a long series of class session instances, > > where > > > > each class session is like 1D line segment, beginning/stopping at > some > > > > start/end time. > > > > > > > > These segments naturally overlap, and I think the question you're > > asking > > > is > > > > equivalent to finding the number of overlaps at every subsegment. > > > > > > > > To answer this, you want to first break every class session into a > full > > > > list > > > > of subsegments, where a subsegment is created by "breaking" each > class > > > > session/segment into multiple parts at the start/end point of any > other > > > > class session. You can create this full set of subsegments in one > pass > > > by > > > > comparing pairwise (CROSS) each start/end point with your original > list > > > of 
Re: creating a graph over time
Thanks again for all your comments.
Jonathan, would you mind to enlighten me on the way you would keep track of the people you need to "eject". I don't get the min heap based tuple... Cheers Marco On Mon, Oct 31, 2011 at 6:15 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > Perhaps I'm misunderstanding your use case, and this depends on the amount > of data, but you could consider something like this (to avoid exploding the > data, which could perhaps be inavoidable but I hate resorting to that if I > don't have to). > > a = foreach yourdata generate student_id, start_time, start_time+duration > as end_time, course; > b = group a by course; > c = foreach b { > ord = order a by start_time; > generate yourudf.process(ord); > } > > Here is generally what process could do. It would be an accumulator UDF > that expected tuples sorted on start_time. Now you basically need a way to > know who the distinct users are. Now, since you want 30s windows, your > first window will presumably be 30s after the first start_time in your > data, and you would just tick ahead in 1s and write to a bag which would > have second, # of distinct student_ids. To know when to eject people, you > could have any number of data structures... perhaps a min heap based on > end_time, and of course instead of "ticking" ahead, you would grab a new > tuple (since this is the only thing that would change the state of the # of > distinct ids), and then do all of the ticking ahead as you adjust the heap > and write the seconds in between the current time pointer and the > start_time of the new tuple, making sure in each step to check against the > min heap to eject any users that expired. > > That was a little rambly, I could quickly put together some more reasonable > pseudocode if that would help. I think the general idea is clear though... > > 2011/10/31 Guy Bayes <[EMAIL PROTECTED]> > > > ahh TV that explains it > > > > 12G data file is a bit too big for R unless you sample, not sure if the > use > > case is conducive to sampling? > > > > If it is, could sample it down and structure in pig/hadoop and then load > it > > into the analytical/visualization tool of choice... > > > > Guy > > > > On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote: > > > > > The data is not about students but about television ;) Regarding the > > size. > > > The raw input data size is about 150m although when I 'explode' the > > > timeseries > > > it will be around 80x bigger. I guess the average user duration will be > > > around > > > 40 Minutes which means when sampling it at a 30s interval will increase > > the > > > size by ~12GB. > > > > > > I think that is a size which my hadoop cluster with five 8core x 8GB x > > 2TB > > > HD > > > should be able to cope with. > > > > > > I don't know about R. Are you able to handle 12Gb > > > files well in R (off course it depends on your computer so assume an > > > average business computer e.g. 2core 2GHz 4GB ram)? > > > > > > Cheers > > > Marco > > > > > > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <[EMAIL PROTECTED]> > > wrote: > > > > > > > if it fits in R, it's trivial, draw a density plot or a histogram, > > about > > > > three lines of R code > > > > > > > > why I was wondering about the data volume. > > > > > > > > His example is students attending classes, if that is really the > data > > > hard > > > > to believe it's super huge? > > > > > > > > Guy > > > > > > > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger < > > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > Perhaps another way to approach this problem is to visualize it > > > > > geometrically. You have a long series of class session instances, > > > where > > > > > each class session is like 1D line segment, beginning/stopping at > > some > > > > > start/end time. > > > > > > > > > > These segments naturally overlap, and I think the question you're > > > asking > > > > is > > > > > equivalent to finding the number of overlaps at every subsegment. 
Re: creating a graph over time
Okie dokie. So first, let's clarify and simplify the problem a little,
especially to ensure that I know what is going on. Let's first just focus on a particular class. This is ok since presumably each class is independent. Now, we have user_id, start_time, and end_time (start_time+duration). If I understand correctly, a user_id should be included up to end_time+30s, since this is a 30s moving window. As such, we'll just ignore that side of things for now, because you can just transform people's start times accordingly. Further, the assumption is that for a given user_id, you will not have overlapping start and end times...you can have multiple entries, ie "user 1, start 1, end 3; user 1, start 5, end 7;" but you can't have them in this form: "user 1, start 1, end 3; user 1, start 2, end 4." So we have simplified the question to this: given: user_id, start_time, and end_time (which never overlap), how can I get a count of unique users for every second? So now we will design a UDF to generate that output as a bag of (time, # of people) pairs, for every second from min(start_time) to max(end_time). The UDF will accept a bag sorted on the start time. Now, as I write it it's going to be a simple evalfunc, but it should be an accumulator. It's easy to make the transition. Here is what you do. Initialize a PriorityQueue. The natural ordering for int and long is fine, as it will ensure that when we poll it, we'll get the earliest end time, which is what we want. So step one is to pull the first tuple, and get the start_time and end_time. The start time will set our time to start_time (which is min(start_time) since it was sorted on start_time), and we add the end_time to the priority queue. We have a counter "uniques" which we increment. Now, before we actually do increment, we grab the next tuple. Why do you do this instead of go to the next end time? Because we don't know if someone starts in between now and the next end time. So we grab the tuple and get its start and end time. Now there are two cases. Case 1: the start time is less than the head of the priority queue, via a peek. If this is the case, then we can safely increment up to the start_time we just got, and then go from there. This is because it's impossible for there to be a new end_time less than the start_time we just got, because they are ordered by start_time and end_time>start_time. So we add the new end_time, and then we increment our timer until we get to the new start_time we just got, and add (timer,unique) at each step. When we get to start_time, we unique++. Now we get the next tuple and repeat. Case 2: the start time comes after the head of the priority queue, via a peek. If this is the case, then we need to increment up to the current head, emitting (timer,unique). Then when we get to the time_value equal to that end_time, we unique, and check again if the start_time comes before than the head of the priority queue. Until it does, we repeat step 2. Once it does, we do step 1. I've attached a crude, untested UDF that does this. Buyer beware. But it shows the general flow, and should be better than exploding the data (I really hate exploding data like that unless it's absolutely necessary). To use, generate some data, then... register window.jar; define window com.jcoveney.Window('30'); a = load 'data' using PigStorage(',') as (uid:long,start:long,end:long); b = foreach (group a all) { ord = order a by start asc; generate flatten(window(ord)); } dump b; to generate data, I first did just a small subsample just to think about it, then I did (in python) import random f=open("data","w") for i in range(0,1000000): v1=random.randint(1,10000000) v2=random.randint(1,10000000) start=min(v1,v2) stop=max(v1,v2) print >>f,"%i,%i,%i" % (i,start,stop) If this function is at all useful, I can clean it up and put in in the piggybank. Let me know if the logic doesn't make sense, or if it isn't quite what you had in mind. Jon 2011/11/1 Marco Cadetg <[EMAIL PROTECTED]> 
Re: creating a graph over time
Hey Jon,
Your windowing udf will be very useful outside of this particular usecase. It will be great if you can contribute it to PiggyBank. Thanks, Ashutosh On Tue, Nov 1, 2011 at 10:44, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > Okie dokie. So first, let's clarify and simplify the problem a little, > especially to ensure that I know what is going on. > > Let's first just focus on a particular class. This is ok since presumably > each class is independent. Now, we have user_id, start_time, and end_time > (start_time+duration). If I understand correctly, a user_id should be > included up to end_time+30s, since this is a 30s moving window. As such, > we'll just ignore that side of things for now, because you can just > transform people's start times accordingly. Further, the assumption is that > for a given user_id, you will not have overlapping start and end > times...you can have multiple entries, ie "user 1, start 1, end 3; user 1, > start 5, end 7;" but you can't have them in this form: "user 1, start 1, > end 3; user 1, start 2, end 4." > > So we have simplified the question to this: given: user_id, start_time, > and end_time (which never overlap), how can I get a count of unique users > for every second? So now we will design a UDF to generate that output as a > bag of (time, # of people) pairs, for every second from min(start_time) to > max(end_time). The UDF will accept a bag sorted on the start time. Now, as > I write it it's going to be a simple evalfunc, but it should be an > accumulator. It's easy to make the transition. > > Here is what you do. Initialize a PriorityQueue. The natural ordering for > int and long is fine, as it will ensure that when we poll it, we'll get the > earliest end time, which is what we want. > > So step one is to pull the first tuple, and get the start_time and > end_time. The start time will set our time to start_time (which is > min(start_time) since it was sorted on start_time), and we add the end_time > to the priority queue. We have a counter "uniques" which we increment. > > Now, before we actually do increment, we grab the next tuple. Why do you > do this instead of go to the next end time? Because we don't know if > someone starts in between now and the next end time. So we grab the tuple > and get its start and end time. Now there are two cases. > > Case 1: the start time is less than the head of the priority queue, via a > peek. If this is the case, then we can safely increment up to the > start_time we just got, and then go from there. This is because it's > impossible for there to be a new end_time less than the start_time we just > got, because they are ordered by start_time and end_time>start_time. So we > add the new end_time, and then we increment our timer until we get to the > new start_time we just got, and add (timer,unique) at each step. When we > get to start_time, we unique++. Now we get the next tuple and repeat. > > Case 2: the start time comes after the head of the priority queue, via a > peek. If this is the case, then we need to increment up to the current > head, emitting (timer,unique). Then when we get to the time_value equal to > that end_time, we unique, and check again if the start_time comes before > than the head of the priority queue. Until it does, we repeat step 2. Once > it does, we do step 1. > > I've attached a crude, untested UDF that does this. Buyer beware. But it > shows the general flow, and should be better than exploding the data (I > really hate exploding data like that unless it's absolutely necessary). > > To use, generate some data, then... > > register window.jar; > define window com.jcoveney.Window('30'); > a = load 'data' using PigStorage(',') as (uid:long,start:long,end:long); > b = foreach (group a all) { > ord = order a by start asc; > generate flatten(window(ord)); > } > dump b; > > to generate data, I first did just a small subsample just to think about > it, then I did (in python) > > import random > f=open("data","w") 
Re: creating a graph over time
I'll make it less hideous and submit a patch this weekend, then :)
2011/11/2 Ashutosh Chauhan <[EMAIL PROTECTED]> > Hey Jon, > > Your windowing udf will be very useful outside of this particular usecase. > It will be great if you can contribute it to PiggyBank. > > Thanks, > Ashutosh > > On Tue, Nov 1, 2011 at 10:44, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > > > Okie dokie. So first, let's clarify and simplify the problem a little, > > especially to ensure that I know what is going on. > > > > Let's first just focus on a particular class. This is ok since presumably > > each class is independent. Now, we have user_id, start_time, and end_time > > (start_time+duration). If I understand correctly, a user_id should be > > included up to end_time+30s, since this is a 30s moving window. As such, > > we'll just ignore that side of things for now, because you can just > > transform people's start times accordingly. Further, the assumption is > that > > for a given user_id, you will not have overlapping start and end > > times...you can have multiple entries, ie "user 1, start 1, end 3; user > 1, > > start 5, end 7;" but you can't have them in this form: "user 1, start 1, > > end 3; user 1, start 2, end 4." > > > > So we have simplified the question to this: given: user_id, start_time, > > and end_time (which never overlap), how can I get a count of unique users > > for every second? So now we will design a UDF to generate that output as > a > > bag of (time, # of people) pairs, for every second from min(start_time) > to > > max(end_time). The UDF will accept a bag sorted on the start time. Now, > as > > I write it it's going to be a simple evalfunc, but it should be an > > accumulator. It's easy to make the transition. > > > > Here is what you do. Initialize a PriorityQueue. The natural ordering for > > int and long is fine, as it will ensure that when we poll it, we'll get > the > > earliest end time, which is what we want. > > > > So step one is to pull the first tuple, and get the start_time and > > end_time. The start time will set our time to start_time (which is > > min(start_time) since it was sorted on start_time), and we add the > end_time > > to the priority queue. We have a counter "uniques" which we increment. > > > > Now, before we actually do increment, we grab the next tuple. Why do you > > do this instead of go to the next end time? Because we don't know if > > someone starts in between now and the next end time. So we grab the tuple > > and get its start and end time. Now there are two cases. > > > > Case 1: the start time is less than the head of the priority queue, via a > > peek. If this is the case, then we can safely increment up to the > > start_time we just got, and then go from there. This is because it's > > impossible for there to be a new end_time less than the start_time we > just > > got, because they are ordered by start_time and end_time>start_time. So > we > > add the new end_time, and then we increment our timer until we get to the > > new start_time we just got, and add (timer,unique) at each step. When we > > get to start_time, we unique++. Now we get the next tuple and repeat. > > > > Case 2: the start time comes after the head of the priority queue, via a > > peek. If this is the case, then we need to increment up to the current > > head, emitting (timer,unique). Then when we get to the time_value equal > to > > that end_time, we unique, and check again if the start_time comes > before > > than the head of the priority queue. Until it does, we repeat step 2. > Once > > it does, we do step 1. > > > > I've attached a crude, untested UDF that does this. Buyer beware. But it > > shows the general flow, and should be better than exploding the data (I > > really hate exploding data like that unless it's absolutely necessary). > > > > To use, generate some data, then... > > > > register window.jar; > > define window com.jcoveney.Window('30'); > > a = load 'data' using PigStorage(',') as (uid 
Re: creating a graph over time
Yeha, that is awesome. Thank you very much Jonathan.
Marco On Wed, Nov 2, 2011 at 7:52 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > I'll make it less hideous and submit a patch this weekend, then :) > > 2011/11/2 Ashutosh Chauhan <[EMAIL PROTECTED]> > > > Hey Jon, > > > > Your windowing udf will be very useful outside of this particular > usecase. > > It will be great if you can contribute it to PiggyBank. > > > > Thanks, > > Ashutosh > > > > On Tue, Nov 1, 2011 at 10:44, Jonathan Coveney <[EMAIL PROTECTED]> > wrote: > > > > > Okie dokie. So first, let's clarify and simplify the problem a little, > > > especially to ensure that I know what is going on. > > > > > > Let's first just focus on a particular class. This is ok since > presumably > > > each class is independent. Now, we have user_id, start_time, and > end_time > > > (start_time+duration). If I understand correctly, a user_id should be > > > included up to end_time+30s, since this is a 30s moving window. As > such, > > > we'll just ignore that side of things for now, because you can just > > > transform people's start times accordingly. Further, the assumption is > > that > > > for a given user_id, you will not have overlapping start and end > > > times...you can have multiple entries, ie "user 1, start 1, end 3; user > > 1, > > > start 5, end 7;" but you can't have them in this form: "user 1, start > 1, > > > end 3; user 1, start 2, end 4." > > > > > > So we have simplified the question to this: given: user_id, start_time, > > > and end_time (which never overlap), how can I get a count of unique > users > > > for every second? So now we will design a UDF to generate that output > as > > a > > > bag of (time, # of people) pairs, for every second from min(start_time) > > to > > > max(end_time). The UDF will accept a bag sorted on the start time. Now, > > as > > > I write it it's going to be a simple evalfunc, but it should be an > > > accumulator. It's easy to make the transition. > > > > > > Here is what you do. Initialize a PriorityQueue. The natural ordering > for > > > int and long is fine, as it will ensure that when we poll it, we'll get > > the > > > earliest end time, which is what we want. > > > > > > So step one is to pull the first tuple, and get the start_time and > > > end_time. The start time will set our time to start_time (which is > > > min(start_time) since it was sorted on start_time), and we add the > > end_time > > > to the priority queue. We have a counter "uniques" which we increment. > > > > > > Now, before we actually do increment, we grab the next tuple. Why do > you > > > do this instead of go to the next end time? Because we don't know if > > > someone starts in between now and the next end time. So we grab the > tuple > > > and get its start and end time. Now there are two cases. > > > > > > Case 1: the start time is less than the head of the priority queue, > via a > > > peek. If this is the case, then we can safely increment up to the > > > start_time we just got, and then go from there. This is because it's > > > impossible for there to be a new end_time less than the start_time we > > just > > > got, because they are ordered by start_time and end_time>start_time. So > > we > > > add the new end_time, and then we increment our timer until we get to > the > > > new start_time we just got, and add (timer,unique) at each step. When > we > > > get to start_time, we unique++. Now we get the next tuple and repeat. > > > > > > Case 2: the start time comes after the head of the priority queue, via > a > > > peek. If this is the case, then we need to increment up to the current > > > head, emitting (timer,unique). Then when we get to the time_value equal > > to > > > that end_time, we unique, and check again if the start_time comes > > before > > > than the head of the priority queue. Until it does, we repeat step 2. > > Once > > > it does, we do step 1. > > > > > > I've attached a crude, untested UDF that does this. Buyer beware. But 
Re: creating a graph over time
Hi Guys,
Sorry for joining this discussion so late. I would suggest using interval trees for dealing with overlapping time intervals. There is a fairly nice treatment of interval trees in CLR, sect. 14.3. The data structure is essentially a redblack tree, and I surmise that one could extend java.util.TreeMap to implement it. Cheers, stan On Tue, Nov 1, 2011 at 1:44 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > Okie dokie. So first, let's clarify and simplify the problem a little, > especially to ensure that I know what is going on. > > Let's first just focus on a particular class. This is ok since presumably > each class is independent. Now, we have user_id, start_time, and end_time > (start_time+duration). If I understand correctly, a user_id should be > included up to end_time+30s, since this is a 30s moving window. As such, > we'll just ignore that side of things for now, because you can just > transform people's start times accordingly. Further, the assumption is that > for a given user_id, you will not have overlapping start and end times...you > can have multiple entries, ie "user 1, start 1, end 3; user 1, start 5, end > 7;" but you can't have them in this form: "user 1, start 1, end 3; user 1, > start 2, end 4." > > So we have simplified the question to this: given: user_id, start_time, and > end_time (which never overlap), how can I get a count of unique users for > every second? So now we will design a UDF to generate that output as a bag > of (time, # of people) pairs, for every second from min(start_time) to > max(end_time). The UDF will accept a bag sorted on the start time. Now, as I > write it it's going to be a simple evalfunc, but it should be an > accumulator. It's easy to make the transition. > > Here is what you do. Initialize a PriorityQueue. The natural ordering for > int and long is fine, as it will ensure that when we poll it, we'll get the > earliest end time, which is what we want. > > So step one is to pull the first tuple, and get the start_time and end_time. > The start time will set our time to start_time (which is min(start_time) > since it was sorted on start_time), and we add the end_time to the priority > queue. We have a counter "uniques" which we increment. > > Now, before we actually do increment, we grab the next tuple. Why do you do > this instead of go to the next end time? Because we don't know if someone > starts in between now and the next end time. So we grab the tuple and get > its start and end time. Now there are two cases. > > Case 1: the start time is less than the head of the priority queue, via a > peek. If this is the case, then we can safely increment up to the start_time > we just got, and then go from there. This is because it's impossible for > there to be a new end_time less than the start_time we just got, because > they are ordered by start_time and end_time>start_time. So we add the new > end_time, and then we increment our timer until we get to the new start_time > we just got, and add (timer,unique) at each step. When we get to start_time, > we unique++. Now we get the next tuple and repeat. > > Case 2: the start time comes after the head of the priority queue, via a > peek. If this is the case, then we need to increment up to the current head, > emitting (timer,unique). Then when we get to the time_value equal to that > end_time, we unique, and check again if the start_time comes before than > the head of the priority queue. Until it does, we repeat step 2. Once it > does, we do step 1. > > I've attached a crude, untested UDF that does this. Buyer beware. But it > shows the general flow, and should be better than exploding the data (I > really hate exploding data like that unless it's absolutely necessary). > > To use, generate some data, then... > > register window.jar; > define window com.jcoveney.Window('30'); > a = load 'data' using PigStorage(',') as (uid:long,start:long,end:long); > b = foreach (group a all) { > ord = order a by start asc; > generate flatten(window(ord)); 
Re: creating a graph over time
Just a heads up: I have a cleaner version (with tests!) here:
https://issues.apache.org/jira/browse/PIG2364 If you're still using this, I heavily suggest using the new version. 2011/11/4 Marco Cadetg <[EMAIL PROTECTED]> > Yeha, that is awesome. Thank you very much Jonathan. > Marco > > On Wed, Nov 2, 2011 at 7:52 PM, Jonathan Coveney <[EMAIL PROTECTED]> > wrote: > > > I'll make it less hideous and submit a patch this weekend, then :) > > > > 2011/11/2 Ashutosh Chauhan <[EMAIL PROTECTED]> > > > > > Hey Jon, > > > > > > Your windowing udf will be very useful outside of this particular > > usecase. > > > It will be great if you can contribute it to PiggyBank. > > > > > > Thanks, > > > Ashutosh > > > > > > On Tue, Nov 1, 2011 at 10:44, Jonathan Coveney <[EMAIL PROTECTED]> > > wrote: > > > > > > > Okie dokie. So first, let's clarify and simplify the problem a > little, > > > > especially to ensure that I know what is going on. > > > > > > > > Let's first just focus on a particular class. This is ok since > > presumably > > > > each class is independent. Now, we have user_id, start_time, and > > end_time > > > > (start_time+duration). If I understand correctly, a user_id should be > > > > included up to end_time+30s, since this is a 30s moving window. As > > such, > > > > we'll just ignore that side of things for now, because you can just > > > > transform people's start times accordingly. Further, the assumption > is > > > that > > > > for a given user_id, you will not have overlapping start and end > > > > times...you can have multiple entries, ie "user 1, start 1, end 3; > user > > > 1, > > > > start 5, end 7;" but you can't have them in this form: "user 1, start > > 1, > > > > end 3; user 1, start 2, end 4." > > > > > > > > So we have simplified the question to this: given: user_id, > start_time, > > > > and end_time (which never overlap), how can I get a count of unique > > users > > > > for every second? So now we will design a UDF to generate that output > > as > > > a > > > > bag of (time, # of people) pairs, for every second from > min(start_time) > > > to > > > > max(end_time). The UDF will accept a bag sorted on the start time. > Now, > > > as > > > > I write it it's going to be a simple evalfunc, but it should be an > > > > accumulator. It's easy to make the transition. > > > > > > > > Here is what you do. Initialize a PriorityQueue. The natural ordering > > for > > > > int and long is fine, as it will ensure that when we poll it, we'll > get > > > the > > > > earliest end time, which is what we want. > > > > > > > > So step one is to pull the first tuple, and get the start_time and > > > > end_time. The start time will set our time to start_time (which is > > > > min(start_time) since it was sorted on start_time), and we add the > > > end_time > > > > to the priority queue. We have a counter "uniques" which we > increment. > > > > > > > > Now, before we actually do increment, we grab the next tuple. Why do > > you > > > > do this instead of go to the next end time? Because we don't know if > > > > someone starts in between now and the next end time. So we grab the > > tuple > > > > and get its start and end time. Now there are two cases. > > > > > > > > Case 1: the start time is less than the head of the priority queue, > > via a > > > > peek. If this is the case, then we can safely increment up to the > > > > start_time we just got, and then go from there. This is because it's > > > > impossible for there to be a new end_time less than the start_time we > > > just > > > > got, because they are ordered by start_time and end_time>start_time. > So > > > we > > > > add the new end_time, and then we increment our timer until we get to > > the > > > > new start_time we just got, and add (timer,unique) at each step. When > > we > > > > get to start_time, we unique++. Now we get the next tuple and repeat. > > > > > > > > Case 2: the start time comes after the head of the priority queue, > via > > a > > > > peek. If this is the case, then we need to increment up to the 

