Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - HBase load distribution vs. scan efficiency

Copy link to this message
Re: HBase load distribution vs. scan efficiency
Amit Sela 2014-01-19, 21:02
If you'll use bulk load to insert your data you could use the date as key
prefix and choose the rest of the key in a way that will split each day
evenly. You'll have X regions for Evey day >> 14X regions for the two weeks
On Jan 19, 2014 8:39 PM, "Bill Q" <[EMAIL PROTECTED]> wrote:

> Hi,
> I am designing a schema to host some large volume of data over HBase. We
> collect daily trading data for some markets. And we run a moving window
> analysis to make predictions based on a two weeks window.
> Since everybody is going to pull the latest two weeks data every day, if we
> put the date in the lead positions of the Key, we will have some hot
> regions. So, we can use bucketing (date to mode bucket number) approach to
> deal with this situation. However, if we have 200 buckets, we need to run
> 200 scans to extract all the data in the last two weeks.
> My questions are:
> 1. What happens when each scan return the result? Will the scan result be
> sent to a sink  like place that collects and concatenate all the scan
> results?
> 2. Why having 200 scans might be a bad thing compared to have only 10
> scans?
> 3. Any suggestions to the design?
> Many thanks.
> Bill