Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Design question - parsing clickstream with query parameters


Copy link to this message
-
Re: Design question - parsing clickstream with query parameters
Does it make sense to just use UDF functions for each dimension. So for
instance if there are 2 dimensions:

1. geo/network
2. visitor

We write 2 UDFs that converts query parameters in respective format which
then gets stored in 2 separate files for each dimension. I am thinking UDF
functions would give more control over how we process it than using maps.

On Fri, Jun 15, 2012 at 3:34 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> We just use the Java Map class, with the restriction that the key must be a
> String. There are some helper methods in trunk to work with maps, and you
> can you # to dereference ie map#'key'
>
> 2012/6/15 Mohit Anchlia <[EMAIL PROTECTED]>
>
> > On Fri, Jun 15, 2012 at 9:12 AM, Alan Gates <[EMAIL PROTECTED]>
> wrote:
> >
> > > This seems reasonable, except it seems like it would make more sense to
> > > convert query parameters to maps.  By definition a query parameter is
> > > key=value.  And a map is easier to work with in general then a bag,
> since
> > > there's no need to flatten them.
> > >
> > > I've never used them. Is this Map format in hadoop?
> >
> >
> > > Alan.
> > >
> > > On Jun 11, 2012, at 10:55 AM, Mohit Anchlia wrote:
> > >
> > > > I am looking at how to parse URL with query parameters to process
> > > > clickstream data. Are there any examples I can look at? My steps
> that I
> > > > envision are:
> > > >
> > > > 1) Read lines and convert query parameters into bags that is a group
> of
> > > > fields for a particular dimension table. So if Geo is one of the
> > > dimensions
> > > > group all the geo related information from that URL as a Bag.
> > > > In the end it would like like {{92122,CA},{Unix,FireFox}}. In this
> > > example
> > > > first bag is GEO dimension and the second is Browser dimension.
> > > > 2) Load these into OLAP staging database
> > > > 3) Populate star schema from staging tables
> > > >
> > > > I am sure other people might already be doing this so I thought I'll
> > > check
> > > > as to if this makes sense.
> > >
> > >
> >
>