Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - how to store 100billion short text messages with hbase


Copy link to this message
-
Re: 答复: how to store 100billion short text messages with hbase
Otis Gospodnetic 2012-12-06, 21:45
Hello,

If you want to use Lucene....why not use Lucene, or one of the fancy search
servers built on top of it - Solr(Cloud), ElasticSearch, or SenseiDB?
You can easily shard the index by time, lookup by key, and search using
full-text search with results sorted by some key value or relevance to the
query.

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html
On Wed, Dec 5, 2012 at 10:28 PM, tgh <[EMAIL PROTECTED]> wrote:

> Thank you for your reply
>
> And I want to access the data with lucene search engine, that is, with key
> to retrieve any message, and I also want to get one hour data together, so
> I
> think to split data table into one hour , or if I can store it in one big
> table, is it better than store in 365 table or store in 365*24 table, which
> one is best for my data access schema, and I am also confused about how to
> make secondary index in hbase , if I have use some key words search engine
> ,
> lucene or other
>
>
> Could you help me
> Thank you
>
> -------------
> Tian Guanhua
>
>
>
> -----邮件原件-----
> 发件人: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]] 代表 Ian
> Varley
> 发送时间: 2012年12月6日 11:01
> 收件人: [EMAIL PROTECTED]
> 主题: Re: how to store 100billion short text messages with hbase
>
> Tian,
>
> The best way to think about how to structure your data in HBase is to ask
> the question: "How will I access it?". Perhaps you could reply with the
> sorts of queries you expect to be able to do over this data? For example,
> retrieve any single conversation between two people in < 10 ms; or show all
> conversations that happened in a single hour, regardless of participants.
> HBase only gives you fast GET/SCAN access along a single "primary" key (the
> row key) so you must choose it carefully, or else duplicate & denormalize
> your data for fast access.
>
> Your data size seems reasonable (but not overwhelming) for HBase. 100B
> messages x 1K bytes per message on average comes out to 100TB. That, plus
> 3x
> replication in HDFS, means you need roughly 300TB of space. If you have 13
> nodes (taking out 2 for redundant master services) that's a requirement for
> about 23T of space per server. That's a lot, even these days. Did I get all
> that math right?
>
> On your question about multiple tables: a table in HBase is only a
> namespace
> for rowkeys, and a container for a set of regions. If it's a homogenous
> data
> set, there's no advantage to breaking the table into multiple tables;
> that's
> what regions within the table are for.
>
> Ian
>
> ps - Please don't cross post to both dev@ and user@.
>
> On Dec 5, 2012, at 8:51 PM, tgh wrote:
>
> > Hi
> >       I try to use hbase to store 100billion short texts messages, each
> > message has less than 1000 character and some other items, that is,
> > each messages has less than 10 items,
> >       The whole data is a stream for about one year, and I want to create
> > multi tables to store these data, I have two ideas, the one is to
> > store the data in one hour in one table, and for one year data, there
> > are 365*24 tables, the other is to store the date in one day in one
> > table, and for one year , there are 365 tables,
> >
> >       And I have about 15 computer nodes to handle these data, and I want
> > to know how to deal with these data, the one for 365*24 tables , or
> > the one for 365 tables, or other better ideas,
> >
> >       I am really confused about hbase, it is powerful yet a bit complex
> > for me , is it?
> >       Could you give me some advice for hbase data schema and others,
> >       Could you help me,
> >
> >
> > Thank you
> > ---------------------------------
> > Tian Guanhua
> >
> >
> >
> >
> >
> >
>
>
>