Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> creating a graph over time


Copy link to this message
-
Re: creating a graph over time
how big is your dataset?

On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote:

> Thanks Bill and Norbert that seems like what I was looking for. I'm a bit
> worried about
> how much data/io this could create. But I'll see ;)
>
> Cheers
> -Marco
>
> On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <[EMAIL PROTECTED]
> >wrote:
>
> > In case what you're looking for is an analysis over the full learning
> > duration, and not just the start interval, then one further insight is
> > that each original record can be transformed into a sequence of
> > records, where the size of the sequence corresponds to the session
> > duration.  In other words, you can use a UDF to "explode" the original
> > record:
> >
> > 1,marco,1319708213,500,math
> >
> > into:
> >
> > 1,marco,1319708190,500,math
> > 1,marco,1319708220,500,math
> > 1,marco,1319708250,500,math
> > 1,marco,1319708280,500,math
> > 1,marco,1319708310,500,math
> > 1,marco,1319708340,500,math
> > 1,marco,1319708370,500,math
> > 1,marco,1319708400,500,math
> > 1,marco,1319708430,500,math
> > 1,marco,1319708460,500,math
> > 1,marco,1319708490,500,math
> > 1,marco,1319708520,500,math
> > 1,marco,1319708550,500,math
> > 1,marco,1319708580,500,math
> > 1,marco,1319708610,500,math
> > 1,marco,1319708640,500,math
> > 1,marco,1319708670,500,math
> > 1,marco,1319708700,500,math
> >
> > and then use Bill's suggestion to group by course, interval.
> >
> > Norbert
> >
> > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <[EMAIL PROTECTED]>
> > wrote:
> > > You can pass your time to a udf that rounds it down to the nearest 30
> > second
> > > interval and then group by course, interval to get counts for each
> > course,
> > > interval.
> > >
> > > On Thursday, October 27, 2011, Marco Cadetg <[EMAIL PROTECTED]> wrote:
> > >> I have a problem where I don't know how or if pig is even suitable to
> > > solve
> > >> it.
> > >>
> > >> I have a schema like this:
> > >>
> > >> student-id,student-name,start-time,duration,course
> > >> 1,marco,1319708213,500,math
> > >> 2,ralf,1319708111,112,english
> > >> 3,greg,1319708321,333,french
> > >> 4,diva,1319708444,80,english
> > >> 5,susanne,1319708123,2000,math
> > >> 1,marco,1319708564,500,french
> > >> 2,ralf,1319708789,123,french
> > >> 7,fred,1319708213,5675,french
> > >> 8,laura,1319708233,123,math
> > >> 10,sab,1319708999,777,math
> > >> 11,fibo,1319708789,565,math
> > >> 6,dan,1319708456,50,english
> > >> 9,marco,1319708123,60,english
> > >> 12,bo,1319708456,345,math
> > >> 1,marco,1319708789,673,math
> > >> ...
> > >> ...
> > >>
> > >> I would like to retrieve a graph (interpolation) over time grouped by
> > >> course. Meaning how many students are learning for a course based on a
> > 30
> > >> sec interval.
> > >> The grouping by course is easy but from there I've no clue how I would
> > >> achieve the rest. I guess the rest needs to be achieved via some UDF
> > >> or is there any way how to this in pig? I often think that I need a
> "for
> > >> loop" or something similar in pig.
> > >>
> > >> Thanks for your help!
> > >> -Marco
> > >>
> > >
> >
>