Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> how to store 100billion short text messages with hbase

Copy link to this message
Re: 答复: 答复: how to store 100billion short text messages with hbase

By sharding the table manually along the time dimension (which is what you're talking about: 365 * 24 different tables, one per hour), you can reduce the amount of data any one query has to deal with, because you can instruct your query to only go to the right table. However, that's (roughly) the same effect you'd get by making the time dimension the first part of your row key in HBase, and allowing HBase to do that "sharding" work for you, into Regions. The whole point of HBase is that it's a horizontally scalable database, which handles sharding of arbitrarily large (Petabyte-size) data sets into smaller, more manageable chunks called regions, and then manages running those regions smoothly even when machines fail. If you're going to do all of that yourself, you'd be better off using something like MySQL.

(I say "roughly" above because by default, HBase won't choose nice even boundaries like a single hour for your region boundaries, so a query that wants to scan over an hour's worth of data might have to hit two regions instead of one; but that won't make much difference (in fact, it'll improve performance because the scan can be performed in parallel on the two region servers). You can also change that behavior by implementing a custom region split policy (see HBASE-5304<https://issues.apache.org/jira/browse/HBASE-5304>); but you shouldn't need to do that; functionally, it's the same thing.)

If you're still confused about why regions are better than performing the sharding yourself, I'd recommend reading the links I sent in the previous email.


On Dec 6, 2012, at 2:01 AM, tgh wrote:

Meanwhile, we need lucene to retrieve message with keywords or content in
message, after NLP parse processing, and do it without timestamp or
messageID, it is time critical operation,
And we do read one hour data, not with lucene, but with table name, if we
use timestamp about hour as tablename , such as 2012120612 as for table of
data for 12clock on Des 12 2012, and it is about 100million to 200million
messages in table, it is not very time critical operation,
And if we have 365*24table for one year , does it work , or if we put one
year data in ONE table, will it work more faster than multi tables, and why?
How does hbase manage ONE table and how to handle many table,
I am really confused,

Could you help me

Thank you
Tian Guanhua

[mailto:[EMAIL PROTECTED]] 代表 tgh
发送时间: 2012年12月6日 15:27
主题: 答复: 答复: how to store 100billion short text messages with hbase

Thank you for your reply

And in my case, we need to use lucene search engine to retrieval short
message in hbase, and this operation is time critical,
and we also need to access last hour's data in hbase, that is, read out one
hour data from hbase, and this operation is not very time cirtical, and one
hour data is about 100 million or 200 million message,
Meanwhile, when lucene retrieve data from hbase, it may get 1k or 100k
messages for results, and  we need to guarantee this is fast enough,
And for this case, if we use one table, when lucene use any message, hbase
need to handle and locate 100billion message itself, if we use 365*24 table
or 365 table, hbase need to handle and locate much less message,

I am really confused ,why ONE table is more suitable than multi table,
Could you give me some help,

Thank you
Tian Guanhua

[mailto:[EMAIL PROTECTED]] 代表 Ian
发送时间: 2012年12月6日 11:44
主题: Re: 答复: how to store 100billion short text messages with hbase

In this case, your best bet may be to come up with an ID structure for these
messages that incorporates (leads with) the timestamp; then have Lucene use
that as the key when retrieving any given message. For example, the ID could
consist of:

{timestamp} + {unique id}

(Beware: if you're going to load data with this schema in real time, you'll
hot spot one region server; see http://hbase.apache.org/book.html#timeseries
for considerations related to this.)

Then, you can either scan over all data from one time period, or GET a
particular message by this (combined) unique ID. There are also types of
UUIDs that work in this way. But, with that much data, you may want to tune
it to get the smallest possible row key; depending on the granularity of
your timestamp and how unique the "unique" part really needs to be, you
might be able to get this down to < 16 bytes. (Consider that the smallest
possible unique representation of 100B items is 36 bits - that is, log base
2 of 10 billion; but because you also want time to be a part of it, you
probably can't get anywhere near that small).

If you need to scan over LOTS of data (as opposed to just looking up single
messages, or small sequential chunks of messages), consider just writing the
data to a file in HDFS and using map/reduce to process it. Scanning all 100B
of your records won't be possible in any short time frame (by my estimate
that would take about 10 hours), but you could do that with map/reduce using
an asynchronous model.

One table is still best for this; read up on what Regions are and why they
mean you don't need multiple tables for the same data:

There are no secondary indexes in HBase:
http://hbase.apache.org/book.html#secondary.indexes. If you use Lucene for
this, it'd need its own storage (though there are indeed projects that run
Lucene on top of HBase: http://www.infoq.com/articles/LuceneHbase).

On Dec 5, 2012, at 9:28 PM, tgh wrote: