Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Design question - parsing clickstream with query parameters


Copy link to this message
-
Re: Design question - parsing clickstream with query parameters
Does it make sense to just use UDF functions for each dimension. So for
instance if there are 2 dimensions:

1. geo/network
2. visitor

We write 2 UDFs that converts query parameters in respective format which
then gets stored in 2 separate files for each dimension. I am thinking UDF
functions would give more control over how we process it than using maps.

On Fri, Jun 15, 2012 at 3:34 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> We just use the Java Map class, with the restriction that the key must be a
> String. There are some helper methods in trunk to work with maps, and you
> can you # to dereference ie map#'key'
>
> 2012/6/15 Mohit Anchlia <[EMAIL PROTECTED]>
>
> > On Fri, Jun 15, 2012 at 9:12 AM, Alan Gates <[EMAIL PROTECTED]>
> wrote:
> >
> > > This seems reasonable, except it seems like it would make more sense to
> > > convert query parameters to maps.  By definition a query parameter is
> > > key=value.  And a map is easier to work with in general then a bag,
> since
> > > there's no need to flatten them.
> > >
> > > I've never used them. Is this Map format in hadoop?
> >
> >
> > > Alan.
> > >
> > > On Jun 11, 2012, at 10:55 AM, Mohit Anchlia wrote:
> > >
> > > > I am looking at how to parse URL with query parameters to process
> > > > clickstream data. Are there any examples I can look at? My steps
> that I
> > > > envision are:
> > > >
> > > > 1) Read lines and convert query parameters into bags that is a group
> of
> > > > fields for a particular dimension table. So if Geo is one of the
> > > dimensions
> > > > group all the geo related information from that URL as a Bag.
> > > > In the end it would like like {{92122,CA},{Unix,FireFox}}. In this
> > > example
> > > > first bag is GEO dimension and the second is Browser dimension.
> > > > 2) Load these into OLAP staging database
> > > > 3) Populate star schema from staging tables
> > > >
> > > > I am sure other people might already be doing this so I thought I'll
> > > check
> > > > as to if this makes sense.
> > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB