|
|
-
Design question - parsing clickstream with query parameters
Mohit Anchlia 2012-06-11, 17:55
I am looking at how to parse URL with query parameters to process clickstream data. Are there any examples I can look at? My steps that I envision are:
1) Read lines and convert query parameters into bags that is a group of fields for a particular dimension table. So if Geo is one of the dimensions group all the geo related information from that URL as a Bag. In the end it would like like {{92122,CA},{Unix,FireFox}}. In this example first bag is GEO dimension and the second is Browser dimension. 2) Load these into OLAP staging database 3) Populate star schema from staging tables
I am sure other people might already be doing this so I thought I'll check as to if this makes sense.
+
Mohit Anchlia 2012-06-11, 17:55
-
Re: Design question - parsing clickstream with query parameters
Alan Gates 2012-06-15, 16:12
This seems reasonable, except it seems like it would make more sense to convert query parameters to maps. By definition a query parameter is key=value. And a map is easier to work with in general then a bag, since there's no need to flatten them.
Alan.
On Jun 11, 2012, at 10:55 AM, Mohit Anchlia wrote:
> I am looking at how to parse URL with query parameters to process > clickstream data. Are there any examples I can look at? My steps that I > envision are: > > 1) Read lines and convert query parameters into bags that is a group of > fields for a particular dimension table. So if Geo is one of the dimensions > group all the geo related information from that URL as a Bag. > In the end it would like like {{92122,CA},{Unix,FireFox}}. In this example > first bag is GEO dimension and the second is Browser dimension. > 2) Load these into OLAP staging database > 3) Populate star schema from staging tables > > I am sure other people might already be doing this so I thought I'll check > as to if this makes sense.
+
Alan Gates 2012-06-15, 16:12
-
Re: Design question - parsing clickstream with query parameters
Mohit Anchlia 2012-06-15, 19:59
On Fri, Jun 15, 2012 at 9:12 AM, Alan Gates <[EMAIL PROTECTED]> wrote:
> This seems reasonable, except it seems like it would make more sense to > convert query parameters to maps. By definition a query parameter is > key=value. And a map is easier to work with in general then a bag, since > there's no need to flatten them. > > I've never used them. Is this Map format in hadoop? > Alan. > > On Jun 11, 2012, at 10:55 AM, Mohit Anchlia wrote: > > > I am looking at how to parse URL with query parameters to process > > clickstream data. Are there any examples I can look at? My steps that I > > envision are: > > > > 1) Read lines and convert query parameters into bags that is a group of > > fields for a particular dimension table. So if Geo is one of the > dimensions > > group all the geo related information from that URL as a Bag. > > In the end it would like like {{92122,CA},{Unix,FireFox}}. In this > example > > first bag is GEO dimension and the second is Browser dimension. > > 2) Load these into OLAP staging database > > 3) Populate star schema from staging tables > > > > I am sure other people might already be doing this so I thought I'll > check > > as to if this makes sense. > >
+
Mohit Anchlia 2012-06-15, 19:59
-
Re: Design question - parsing clickstream with query parameters
Jonathan Coveney 2012-06-15, 22:34
We just use the Java Map class, with the restriction that the key must be a String. There are some helper methods in trunk to work with maps, and you can you # to dereference ie map#'key'
2012/6/15 Mohit Anchlia <[EMAIL PROTECTED]>
> On Fri, Jun 15, 2012 at 9:12 AM, Alan Gates <[EMAIL PROTECTED]> wrote: > > > This seems reasonable, except it seems like it would make more sense to > > convert query parameters to maps. By definition a query parameter is > > key=value. And a map is easier to work with in general then a bag, since > > there's no need to flatten them. > > > > I've never used them. Is this Map format in hadoop? > > > > Alan. > > > > On Jun 11, 2012, at 10:55 AM, Mohit Anchlia wrote: > > > > > I am looking at how to parse URL with query parameters to process > > > clickstream data. Are there any examples I can look at? My steps that I > > > envision are: > > > > > > 1) Read lines and convert query parameters into bags that is a group of > > > fields for a particular dimension table. So if Geo is one of the > > dimensions > > > group all the geo related information from that URL as a Bag. > > > In the end it would like like {{92122,CA},{Unix,FireFox}}. In this > > example > > > first bag is GEO dimension and the second is Browser dimension. > > > 2) Load these into OLAP staging database > > > 3) Populate star schema from staging tables > > > > > > I am sure other people might already be doing this so I thought I'll > > check > > > as to if this makes sense. > > > > >
+
Jonathan Coveney 2012-06-15, 22:34
-
Re: Design question - parsing clickstream with query parameters
Mohit Anchlia 2012-06-15, 23:55
On Fri, Jun 15, 2012 at 3:34 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:
> We just use the Java Map class, with the restriction that the key must be a > String. There are some helper methods in trunk to work with maps, and you > can you # to dereference ie map#'key' >
thanks! If you don't mind could you please share once you flatten them do you then load it in the star schema in the database?
I think I need to look at map
> > 2012/6/15 Mohit Anchlia <[EMAIL PROTECTED]> > > > On Fri, Jun 15, 2012 at 9:12 AM, Alan Gates <[EMAIL PROTECTED]> > wrote: > > > > > This seems reasonable, except it seems like it would make more sense to > > > convert query parameters to maps. By definition a query parameter is > > > key=value. And a map is easier to work with in general then a bag, > since > > > there's no need to flatten them. > > > > > > I've never used them. Is this Map format in hadoop? > > > > > > > Alan. > > > > > > On Jun 11, 2012, at 10:55 AM, Mohit Anchlia wrote: > > > > > > > I am looking at how to parse URL with query parameters to process > > > > clickstream data. Are there any examples I can look at? My steps > that I > > > > envision are: > > > > > > > > 1) Read lines and convert query parameters into bags that is a group > of > > > > fields for a particular dimension table. So if Geo is one of the > > > dimensions > > > > group all the geo related information from that URL as a Bag. > > > > In the end it would like like {{92122,CA},{Unix,FireFox}}. In this > > > example > > > > first bag is GEO dimension and the second is Browser dimension. > > > > 2) Load these into OLAP staging database > > > > 3) Populate star schema from staging tables > > > > > > > > I am sure other people might already be doing this so I thought I'll > > > check > > > > as to if this makes sense. > > > > > > > > >
+
Mohit Anchlia 2012-06-15, 23:55
-
Re: Design question - parsing clickstream with query parameters
Mohit Anchlia 2012-06-18, 17:36
Does it make sense to just use UDF functions for each dimension. So for instance if there are 2 dimensions:
1. geo/network 2. visitor
We write 2 UDFs that converts query parameters in respective format which then gets stored in 2 separate files for each dimension. I am thinking UDF functions would give more control over how we process it than using maps.
On Fri, Jun 15, 2012 at 3:34 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:
> We just use the Java Map class, with the restriction that the key must be a > String. There are some helper methods in trunk to work with maps, and you > can you # to dereference ie map#'key' > > 2012/6/15 Mohit Anchlia <[EMAIL PROTECTED]> > > > On Fri, Jun 15, 2012 at 9:12 AM, Alan Gates <[EMAIL PROTECTED]> > wrote: > > > > > This seems reasonable, except it seems like it would make more sense to > > > convert query parameters to maps. By definition a query parameter is > > > key=value. And a map is easier to work with in general then a bag, > since > > > there's no need to flatten them. > > > > > > I've never used them. Is this Map format in hadoop? > > > > > > > Alan. > > > > > > On Jun 11, 2012, at 10:55 AM, Mohit Anchlia wrote: > > > > > > > I am looking at how to parse URL with query parameters to process > > > > clickstream data. Are there any examples I can look at? My steps > that I > > > > envision are: > > > > > > > > 1) Read lines and convert query parameters into bags that is a group > of > > > > fields for a particular dimension table. So if Geo is one of the > > > dimensions > > > > group all the geo related information from that URL as a Bag. > > > > In the end it would like like {{92122,CA},{Unix,FireFox}}. In this > > > example > > > > first bag is GEO dimension and the second is Browser dimension. > > > > 2) Load these into OLAP staging database > > > > 3) Populate star schema from staging tables > > > > > > > > I am sure other people might already be doing this so I thought I'll > > > check > > > > as to if this makes sense. > > > > > > > > >
+
Mohit Anchlia 2012-06-18, 17:36
|
|