Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> how to store 100billion short text messages with hbase


+
tgh 2012-12-06, 02:51
+
Ian Varley 2012-12-06, 03:00
+
tgh 2012-12-06, 03:28
+
Ian Varley 2012-12-06, 03:44
+
tgh 2012-12-06, 07:27
+
tgh 2012-12-06, 08:01
+
Ian Varley 2012-12-06, 13:40
Copy link to this message
-
Re: 答复: how to store 100billion short text messages with hbase
Hello,

If you want to use Lucene....why not use Lucene, or one of the fancy search
servers built on top of it - Solr(Cloud), ElasticSearch, or SenseiDB?
You can easily shard the index by time, lookup by key, and search using
full-text search with results sorted by some key value or relevance to the
query.

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html
On Wed, Dec 5, 2012 at 10:28 PM, tgh <[EMAIL PROTECTED]> wrote:

> Thank you for your reply
>
> And I want to access the data with lucene search engine, that is, with key
> to retrieve any message, and I also want to get one hour data together, so
> I
> think to split data table into one hour , or if I can store it in one big
> table, is it better than store in 365 table or store in 365*24 table, which
> one is best for my data access schema, and I am also confused about how to
> make secondary index in hbase , if I have use some key words search engine
> ,
> lucene or other
>
>
> Could you help me
> Thank you
>
> -------------
> Tian Guanhua
>
>
>
> -----邮件原件-----
> 发件人: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]] 代表 Ian
> Varley
> 发送时间: 2012年12月6日 11:01
> 收件人: [EMAIL PROTECTED]
> 主题: Re: how to store 100billion short text messages with hbase
>
> Tian,
>
> The best way to think about how to structure your data in HBase is to ask
> the question: "How will I access it?". Perhaps you could reply with the
> sorts of queries you expect to be able to do over this data? For example,
> retrieve any single conversation between two people in < 10 ms; or show all
> conversations that happened in a single hour, regardless of participants.
> HBase only gives you fast GET/SCAN access along a single "primary" key (the
> row key) so you must choose it carefully, or else duplicate & denormalize
> your data for fast access.
>
> Your data size seems reasonable (but not overwhelming) for HBase. 100B
> messages x 1K bytes per message on average comes out to 100TB. That, plus
> 3x
> replication in HDFS, means you need roughly 300TB of space. If you have 13
> nodes (taking out 2 for redundant master services) that's a requirement for
> about 23T of space per server. That's a lot, even these days. Did I get all
> that math right?
>
> On your question about multiple tables: a table in HBase is only a
> namespace
> for rowkeys, and a container for a set of regions. If it's a homogenous
> data
> set, there's no advantage to breaking the table into multiple tables;
> that's
> what regions within the table are for.
>
> Ian
>
> ps - Please don't cross post to both dev@ and user@.
>
> On Dec 5, 2012, at 8:51 PM, tgh wrote:
>
> > Hi
> >       I try to use hbase to store 100billion short texts messages, each
> > message has less than 1000 character and some other items, that is,
> > each messages has less than 10 items,
> >       The whole data is a stream for about one year, and I want to create
> > multi tables to store these data, I have two ideas, the one is to
> > store the data in one hour in one table, and for one year data, there
> > are 365*24 tables, the other is to store the date in one day in one
> > table, and for one year , there are 365 tables,
> >
> >       And I have about 15 computer nodes to handle these data, and I want
> > to know how to deal with these data, the one for 365*24 tables , or
> > the one for 365 tables, or other better ideas,
> >
> >       I am really confused about hbase, it is powerful yet a bit complex
> > for me , is it?
> >       Could you give me some advice for hbase data schema and others,
> >       Could you help me,
> >
> >
> > Thank you
> > ---------------------------------
> > Tian Guanhua
> >
> >
> >
> >
> >
> >
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB