Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - creating a graph over time


+
Marco Cadetg 2011-10-27, 09:56
+
Bill Graham 2011-10-27, 15:05
+
Norbert Burger 2011-10-27, 16:03
+
Marco Cadetg 2011-10-27, 16:23
+
Guy Bayes 2011-10-27, 20:05
Copy link to this message
-
Re: creating a graph over time
Norbert Burger 2011-10-28, 13:12
Perhaps another way to approach this problem is to visualize it
geometrically.  You have a long series of class session instances, where
each class session is like 1D line segment, beginning/stopping at some
start/end time.

These segments naturally overlap, and I think the question you're asking is
equivalent to finding the number of overlaps at every subsegment.

To answer this, you want to first break every class session into a full list
of subsegments, where a subsegment is created by "breaking" each class
session/segment into multiple parts at the start/end point of any other
class session.  You can create this full set of subsegments in one pass by
comparing pairwise (CROSS) each start/end point with your original list of
class sessions.

Once you have the full list of "broken" segments, then a final GROUP
BY/COUNT(*) will you give you the number of overlaps.  Seems like approach
would be faster than the previous approach if your class sessions are very
long, or there are many overlaps.

Norbert

On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <[EMAIL PROTECTED]> wrote:

> how big is your dataset?
>
> On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote:
>
> > Thanks Bill and Norbert that seems like what I was looking for. I'm a bit
> > worried about
> > how much data/io this could create. But I'll see ;)
> >
> > Cheers
> > -Marco
> >
> > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> [EMAIL PROTECTED]
> > >wrote:
> >
> > > In case what you're looking for is an analysis over the full learning
> > > duration, and not just the start interval, then one further insight is
> > > that each original record can be transformed into a sequence of
> > > records, where the size of the sequence corresponds to the session
> > > duration.  In other words, you can use a UDF to "explode" the original
> > > record:
> > >
> > > 1,marco,1319708213,500,math
> > >
> > > into:
> > >
> > > 1,marco,1319708190,500,math
> > > 1,marco,1319708220,500,math
> > > 1,marco,1319708250,500,math
> > > 1,marco,1319708280,500,math
> > > 1,marco,1319708310,500,math
> > > 1,marco,1319708340,500,math
> > > 1,marco,1319708370,500,math
> > > 1,marco,1319708400,500,math
> > > 1,marco,1319708430,500,math
> > > 1,marco,1319708460,500,math
> > > 1,marco,1319708490,500,math
> > > 1,marco,1319708520,500,math
> > > 1,marco,1319708550,500,math
> > > 1,marco,1319708580,500,math
> > > 1,marco,1319708610,500,math
> > > 1,marco,1319708640,500,math
> > > 1,marco,1319708670,500,math
> > > 1,marco,1319708700,500,math
> > >
> > > and then use Bill's suggestion to group by course, interval.
> > >
> > > Norbert
> > >
> > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <[EMAIL PROTECTED]>
> > > wrote:
> > > > You can pass your time to a udf that rounds it down to the nearest 30
> > > second
> > > > interval and then group by course, interval to get counts for each
> > > course,
> > > > interval.
> > > >
> > > > On Thursday, October 27, 2011, Marco Cadetg <[EMAIL PROTECTED]>
> wrote:
> > > >> I have a problem where I don't know how or if pig is even suitable
> to
> > > > solve
> > > >> it.
> > > >>
> > > >> I have a schema like this:
> > > >>
> > > >> student-id,student-name,start-time,duration,course
> > > >> 1,marco,1319708213,500,math
> > > >> 2,ralf,1319708111,112,english
> > > >> 3,greg,1319708321,333,french
> > > >> 4,diva,1319708444,80,english
> > > >> 5,susanne,1319708123,2000,math
> > > >> 1,marco,1319708564,500,french
> > > >> 2,ralf,1319708789,123,french
> > > >> 7,fred,1319708213,5675,french
> > > >> 8,laura,1319708233,123,math
> > > >> 10,sab,1319708999,777,math
> > > >> 11,fibo,1319708789,565,math
> > > >> 6,dan,1319708456,50,english
> > > >> 9,marco,1319708123,60,english
> > > >> 12,bo,1319708456,345,math
> > > >> 1,marco,1319708789,673,math
> > > >> ...
> > > >> ...
> > > >>
> > > >> I would like to retrieve a graph (interpolation) over time grouped
> by
> > > >> course. Meaning how many students are learning for a course based on
> a
> > > 30
+
Guy Bayes 2011-10-28, 15:02
+
Marco Cadetg 2011-10-31, 15:55
+
Guy Bayes 2011-10-31, 16:58
+
Jonathan Coveney 2011-10-31, 17:15
+
Marco Cadetg 2011-11-01, 13:26
+
Jonathan Coveney 2011-11-01, 17:44
+
Ashutosh Chauhan 2011-11-02, 18:03
+
Jonathan Coveney 2011-11-02, 18:52
+
Marco Cadetg 2011-11-04, 11:33
+
Jonathan Coveney 2011-11-14, 18:10
+
Stan Rosenberg 2011-11-05, 19:15
+
pablomar 2011-10-28, 01:59