Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase load distribution vs. scan efficiency


Copy link to this message
-
Re: HBase load distribution vs. scan efficiency
If you'll use bulk load to insert your data you could use the date as key
prefix and choose the rest of the key in a way that will split each day
evenly. You'll have X regions for Evey day >> 14X regions for the two weeks
window.
On Jan 19, 2014 8:39 PM, "Bill Q" <[EMAIL PROTECTED]> wrote:

> Hi,
> I am designing a schema to host some large volume of data over HBase. We
> collect daily trading data for some markets. And we run a moving window
> analysis to make predictions based on a two weeks window.
>
> Since everybody is going to pull the latest two weeks data every day, if we
> put the date in the lead positions of the Key, we will have some hot
> regions. So, we can use bucketing (date to mode bucket number) approach to
> deal with this situation. However, if we have 200 buckets, we need to run
> 200 scans to extract all the data in the last two weeks.
>
> My questions are:
> 1. What happens when each scan return the result? Will the scan result be
> sent to a sink  like place that collects and concatenate all the scan
> results?
> 2. Why having 200 scans might be a bad thing compared to have only 10
> scans?
> 3. Any suggestions to the design?
>
> Many thanks.
>
>
> Bill
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB