Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - data structure


Copy link to this message
-
Re: data structure
Otis Gospodnetic 2011-07-30, 00:42
Hi Andre,
 
In the process of developing some of our HBase-based products we've built a generic aggregation framework that is nice, flexible and extensible, and it sounds like it could build those reports you are after.  We run MR aggregate job that reads raw data (e.g. your impression data) from either HDFS or HBase (we write our data to HBase via Flume), computes aggregates of various kinds (defined in a config), and stores those aggregates back in HBase, from which a web front-end can retrieve them very quickly, while filtering them by various criteria.  I'm not exactly sure what takes 70 seconds in your case - aggregation of raw data or retrieving data to generate reports?  If it's the former, 70 seconds sounds OK and acceptable, but not if it's the latter, in which case see above how we do it.  I hope this helps a bit.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop - HBase
Hadoop ecosystem search :: http://search-hadoop.com/
>________________________________
>From: Andre Reiter <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Sent: Thursday, July 14, 2011 3:52 PM
>Subject: data structure
>
>Hi everybody,
>
>we have our hadoop + hbase cluster running at the moment with 6 servers
>
>everything is working just fine. We have a web application, where data is stored with the row key = user id (meaningless UUID). So our users have a cookie, which is the row key, behind this key are families with items, i.e. family "impressions", where every impression is stored with its time stamp etc...
>
>the row key is defined with the user id, to make the real time request possible, so we can retrieve all user data very fast
>
>new we are running mapreduce jobs, to generate a report: for example we want to know how many impressions were done by all users in last x days. therefore the scan of the MR job is running over all data in our hbase table for the particular family. this takes at the moment about 70 seconds, which is actually a bit too long, and with the data growing, the time will increase, unless we add new workers to the cluster. we have right now 22 regions
>
>the problem i see, is that we can not define a filter for the scan, the row key (user id) is just an UUID, nothing meaningfull in it
>
>what can we do, to however improve (accelerate) the scan process? is it maybe advisable to store the data more redundant. so for example we create second table and store every impression twice, one time with the user id as row key in the first table, and the second one with a time stamp as a row key in the second table.
>the data volume would grow twice as fast, but our scans will work x times faster on the second table compared to now
>
>comments are very appreciated
>
>andre
>
>
>
>