Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> creating a graph over time


Copy link to this message
-
Re: creating a graph over time
ahh TV that explains it

12G data file is a bit too big for R unless you sample, not sure if the use
case is conducive to sampling?

If it is, could sample it down and structure in pig/hadoop and then load it
into the analytical/visualization tool of choice...

Guy

On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote:

> The data is not about students but about television ;) Regarding the size.
> The raw input data size is about 150m although when I 'explode' the
> timeseries
> it will be around 80x bigger. I guess the average user duration will be
> around
> 40 Minutes which means when sampling it at a 30s interval will increase the
> size by ~12GB.
>
> I think that is a size which my hadoop cluster with five 8-core x 8GB x 2TB
> HD
> should be able to cope with.
>
> I don't know about R. Are you able to handle 12Gb
> files well in R (off course it depends on your computer so assume an
> average business computer e.g. 2-core 2GHz 4GB ram)?
>
> Cheers
> -Marco
>
> On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <[EMAIL PROTECTED]> wrote:
>
> > if it fits in R, it's trivial, draw a density plot or a histogram, about
> > three lines of R code
> >
> > why I was wondering about the data volume.
> >
> > His example is students attending classes, if  that is really the data
> hard
> > to believe it's super huge?
> >
> > Guy
> >
> > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <
> [EMAIL PROTECTED]
> > >wrote:
> >
> > > Perhaps another way to approach this problem is to visualize it
> > > geometrically.  You have a long series of class session instances,
> where
> > > each class session is like 1D line segment, beginning/stopping at some
> > > start/end time.
> > >
> > > These segments naturally overlap, and I think the question you're
> asking
> > is
> > > equivalent to finding the number of overlaps at every subsegment.
> > >
> > > To answer this, you want to first break every class session into a full
> > > list
> > > of subsegments, where a subsegment is created by "breaking" each class
> > > session/segment into multiple parts at the start/end point of any other
> > > class session.  You can create this full set of subsegments in one pass
> > by
> > > comparing pairwise (CROSS) each start/end point with your original list
> > of
> > > class sessions.
> > >
> > > Once you have the full list of "broken" segments, then a final GROUP
> > > BY/COUNT(*) will you give you the number of overlaps.  Seems like
> > approach
> > > would be faster than the previous approach if your class sessions are
> > very
> > > long, or there are many overlaps.
> > >
> > > Norbert
> > >
> > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > how big is your dataset?
> > > >
> > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[EMAIL PROTECTED]>
> > wrote:
> > > >
> > > > > Thanks Bill and Norbert that seems like what I was looking for.
> I'm a
> > > bit
> > > > > worried about
> > > > > how much data/io this could create. But I'll see ;)
> > > > >
> > > > > Cheers
> > > > > -Marco
> > > > >
> > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> > > > [EMAIL PROTECTED]
> > > > > >wrote:
> > > > >
> > > > > > In case what you're looking for is an analysis over the full
> > learning
> > > > > > duration, and not just the start interval, then one further
> insight
> > > is
> > > > > > that each original record can be transformed into a sequence of
> > > > > > records, where the size of the sequence corresponds to the
> session
> > > > > > duration.  In other words, you can use a UDF to "explode" the
> > > original
> > > > > > record:
> > > > > >
> > > > > > 1,marco,1319708213,500,math
> > > > > >
> > > > > > into:
> > > > > >
> > > > > > 1,marco,1319708190,500,math
> > > > > > 1,marco,1319708220,500,math
> > > > > > 1,marco,1319708250,500,math
> > > > > > 1,marco,1319708280,500,math
> > > > > > 1,marco,1319708310,500,math
> > > > > > 1,marco,1319708340,500,math
> > > > > > 1,marco,1319708370,500,math