Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase Schema Design for clickstream data


Copy link to this message
-
Re: HBase Schema Design for clickstream data
That's not a whole lot of information to give you recommendations about the schema. However, at a high level, you should think about structuring your row keys such that you minimize the requirement for scans and can get the required data based on the row keys.

So, putting the user in the row key would be desirable for the visitor level aggregations. Add to it the session ID. That'll give you user+session level aggregates.

Give us more information about the fields you are storing and what all read patterns you need to address and we'll try to give more concrete recommendations on the schema design.
On Wednesday, June 27, 2012 at 2:13 PM, Mohit Anchlia wrote:

> Analysis include:
>
> Visitor level
> Session level - visitors could have multiple levels
> Page hits, conversions - popular pages, sequence of pages hit in one session
> Orders purchased - mostly determined by URL and query parameters
>
> How should I go about designing schema?
>
> Thanks
>
>
> Sent from my iPad
>
> On Jun 27, 2012, at 2:01 PM, Amandeep Khurana <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote:
>
> > Mohit,
> >
> > What would be your read patterns later on? Are you going to read per
> > session, or for a time period, or for a set of users, or process through
> > the entire dataset every time? That would play an important role in
> > defining your keys and columns.
> >
> > -Amandeep
> >
> > On Tue, Jun 26, 2012 at 1:34 PM, Mohit Anchlia <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])>wrote:
> >
> > > I am starting out with a new application where I need to store users
> > > clickstream data. I'll have Visitor Id, session id along with other page
> > > related data. I am wondering if I should just key off randomly generated
> > > session id and store all the page related data as columns inside that row
> > > assuming that this would also give good distribution accross region
> > > servers. In a session user could send 100s of HTML requests and get
> > > responses. If someone is already doing this in HBase I would like to learn
> > > more about it as to how they have designed the schema.
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB