Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - schema help


+
Rita 2011-08-25, 14:03
+
Rita 2011-08-25, 14:53
+
Ian Varley 2011-08-25, 15:03
+
Rita 2011-08-25, 15:12
+
Jimson K. James 2011-08-26, 03:34
Copy link to this message
-
Re: schema help
Sonal Goyal 2011-08-26, 05:08
Hi Jimson,

Here are a few links that talk about the sorted architecture:

http://wiki.apache.org/hadoop/Hbase/DataModel
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

i think the original BigTable paper ought to have some details too, I am
sorry I havent read it recently to quote with authority.

Best Regards,
Sonal
Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>

On Fri, Aug 26, 2011 at 9:04 AM, Jimson K. James <[EMAIL PROTECTED]
> wrote:

> Hi Ian,
>
> Can you just get me some reference to the key sorted architecture in
> hbase?
> Seems there is not much documentation out there.
>
>
> -----Original Message-----
> From: Ian Varley [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, August 25, 2011 8:33 PM
> To: [EMAIL PROTECTED]
> Subject: Re: schema help
>
> The rows don't need to be inserted in order; they're maintained in
> key-sorted order on the disk based on the architecture of HBase, which
> stores data sorted in memory and periodically flushes to immutable files
> in HDFS (which are later compacted to make read access more efficient).
> HBase keeps track of which physical files might contain a given key
> range, and only reads the ones it needs to.
>
> To do a query through the java API, you could create a scanner with a
> startrow that is the concatenation of your value for fieldA and the
> start time, and an endrow that has the current time.
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html
>
> Ian
>
> On Aug 25, 2011, at 9:53 AM, Rita wrote:
>
> Thanks for your reponse.
>
> 30 million rows is the best case :-)
>
> Couple of questions about doing, [fieldA][time] as my key:
>  Would I have to insert in order?
>  If no, how would hbase know to stop scanning the entire table?
>  How would a query actually look like, if my key was [fieldA time]?
>
> As a matter of fact, I can do 100% of my queries. I will leave the 5%
> out of my project/schema.
>
>
> On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley
> <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
> Rita,
>
> There's no need to create separate tables here--the table is really just
> a "namespace" for keys. A better option would probably be having one
> table with "[fieldA][time]" (the two fields concatenated) as your row
> key. Then, you can seek directly to the start of your records in
> constant time, and then scan forward until you get to the end of the
> data (linear time in the size of data you expect to get back).
>
> The downside of this is that for the 5% of your queries that aren't in
> this form, you may have to do a full table scan. (Alternately, you could
> also maintain secondary indexes that help you get the data back with
> less than a full table scan; that would depend on the nature of the
> queries).
>
> In general, a good rule of thumb when designing a schema in HBase is,
> think first about how you'd ideally like to access the data. Then
> structure the data to match that access pattern. (This is obviously not
> ideal if you have lots of different access patterns, but then, that's
> what relational databases are for. Most commercial relational DBs
> wouldn't blink at doing analytical queries against 30 million rows.)
>
> Ian
>
> On Aug 25, 2011, at 9:03 AM, Rita wrote:
>
> Hello,
>
> I am trying to solve a time related problem. I can certainly use
> opentsdb
> for this but was wondering if anyone had a clever way to create this
> type of
> schema.
>
> I have an inventory table,
>
> time (unix epoch), fieldA, fieldB, data
>
>
> There are about 30 million of these entries.
>
> 95% of my queries will look like this:
> show me where fieldA=zCORE from range [1314180693 to now]
>
> for fieldA, there is a possibility of 4000 unique items.
> for fieldB, there is a possibility of 2 unique items (bool).
>
> So, I was thinking of creating 4000*2 tables and place the data like
> that so
+
Jimson K. James 2011-08-26, 06:51
+
Sonal Goyal 2011-08-26, 06:58
+
Jimson K. James 2011-08-26, 07:17
+
Jimson K. James 2011-08-26, 07:26
+
Sheng Chen 2011-08-26, 06:08
+
Buttler, David 2011-08-26, 16:08
+
lars hofhansl 2011-08-26, 18:50
+
Doug Meil 2011-08-26, 19:09
+
Sheng Chen 2011-08-29, 02:45