Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> creating a graph over time


+
Marco Cadetg 2011-10-27, 09:56
+
Bill Graham 2011-10-27, 15:05
+
Norbert Burger 2011-10-27, 16:03
+
Marco Cadetg 2011-10-27, 16:23
+
Guy Bayes 2011-10-27, 20:05
+
Norbert Burger 2011-10-28, 13:12
+
Guy Bayes 2011-10-28, 15:02
+
Marco Cadetg 2011-10-31, 15:55
Copy link to this message
-
Re: creating a graph over time
ahh TV that explains it

12G data file is a bit too big for R unless you sample, not sure if the use
case is conducive to sampling?

If it is, could sample it down and structure in pig/hadoop and then load it
into the analytical/visualization tool of choice...

Guy

On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote:

> The data is not about students but about television ;) Regarding the size.
> The raw input data size is about 150m although when I 'explode' the
> timeseries
> it will be around 80x bigger. I guess the average user duration will be
> around
> 40 Minutes which means when sampling it at a 30s interval will increase the
> size by ~12GB.
>
> I think that is a size which my hadoop cluster with five 8-core x 8GB x 2TB
> HD
> should be able to cope with.
>
> I don't know about R. Are you able to handle 12Gb
> files well in R (off course it depends on your computer so assume an
> average business computer e.g. 2-core 2GHz 4GB ram)?
>
> Cheers
> -Marco
>
> On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <[EMAIL PROTECTED]> wrote:
>
> > if it fits in R, it's trivial, draw a density plot or a histogram, about
> > three lines of R code
> >
> > why I was wondering about the data volume.
> >
> > His example is students attending classes, if  that is really the data
> hard
> > to believe it's super huge?
> >
> > Guy
> >
> > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <
> [EMAIL PROTECTED]
> > >wrote:
> >
> > > Perhaps another way to approach this problem is to visualize it
> > > geometrically.  You have a long series of class session instances,
> where
> > > each class session is like 1D line segment, beginning/stopping at some
> > > start/end time.
> > >
> > > These segments naturally overlap, and I think the question you're
> asking
> > is
> > > equivalent to finding the number of overlaps at every subsegment.
> > >
> > > To answer this, you want to first break every class session into a full
> > > list
> > > of subsegments, where a subsegment is created by "breaking" each class
> > > session/segment into multiple parts at the start/end point of any other
> > > class session.  You can create this full set of subsegments in one pass
> > by
> > > comparing pairwise (CROSS) each start/end point with your original list
> > of
> > > class sessions.
> > >
> > > Once you have the full list of "broken" segments, then a final GROUP
> > > BY/COUNT(*) will you give you the number of overlaps.  Seems like
> > approach
> > > would be faster than the previous approach if your class sessions are
> > very
> > > long, or there are many overlaps.
> > >
> > > Norbert
> > >
> > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > how big is your dataset?
> > > >
> > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <[EMAIL PROTECTED]>
> > wrote:
> > > >
> > > > > Thanks Bill and Norbert that seems like what I was looking for.
> I'm a
> > > bit
> > > > > worried about
> > > > > how much data/io this could create. But I'll see ;)
> > > > >
> > > > > Cheers
> > > > > -Marco
> > > > >
> > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> > > > [EMAIL PROTECTED]
> > > > > >wrote:
> > > > >
> > > > > > In case what you're looking for is an analysis over the full
> > learning
> > > > > > duration, and not just the start interval, then one further
> insight
> > > is
> > > > > > that each original record can be transformed into a sequence of
> > > > > > records, where the size of the sequence corresponds to the
> session
> > > > > > duration.  In other words, you can use a UDF to "explode" the
> > > original
> > > > > > record:
> > > > > >
> > > > > > 1,marco,1319708213,500,math
> > > > > >
> > > > > > into:
> > > > > >
> > > > > > 1,marco,1319708190,500,math
> > > > > > 1,marco,1319708220,500,math
> > > > > > 1,marco,1319708250,500,math
> > > > > > 1,marco,1319708280,500,math
> > > > > > 1,marco,1319708310,500,math
> > > > > > 1,marco,1319708340,500,math
> > > > > > 1,marco,1319708370,500,math
+
Jonathan Coveney 2011-10-31, 17:15
+
Marco Cadetg 2011-11-01, 13:26
+
Jonathan Coveney 2011-11-01, 17:44
+
Ashutosh Chauhan 2011-11-02, 18:03
+
Jonathan Coveney 2011-11-02, 18:52
+
Marco Cadetg 2011-11-04, 11:33
+
Jonathan Coveney 2011-11-14, 18:10
+
Stan Rosenberg 2011-11-05, 19:15
+
pablomar 2011-10-28, 01:59
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB