Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Key formats and very low cardinality leading fields

Copy link to this message
Re: Key formats and very low cardinality leading fields

So here's the larger question...
How does the data flow in to the system? One source at a time?

The second field. Is it sequential? If not sequential, is it going to be some sort of incremental larger than a previous value? (Are you always inserting to the left side of the queue?

How are you using the data when you pull it from the database?

'Hot spotting' may be unavoidable and depending on other factors, it may be a moot point.
On Sep 4, 2012, at 12:56 PM, Eric Czech <[EMAIL PROTECTED]> wrote:

> Longer term .. what's really going to happen is more like I'll have a first
> field value of 1, 2, and maybe 3.  I won't know 4 - 10 for a while and
> the *second
> *value after each initial value will be, although highly unique, relatively
> exclusive for a given first value.  This means that even if I didn't use
> the leading prefix, I'd have more or less the same problem where all the
> writes are going to the same region when I introduce a new set of second
> values.
> In case the generalities are confusing, the prefix value is a data source
> identifier and the second value is an identifier for entities within that
> source.  The entity identifiers for a given source are likely to span
> different numeric or alpha-numeric ranges, but they probably won't be the
> same ranges across sources.  Also, I won't know all those ranges (or
> sources for that matter) upfront.
> I'm concerned about the introduction of a new data source (= leading prefix
> value) since the first writes will be to the same region and ideally I'd be
> able to get a sense of how the second values are split for the new leading
> prefix and split an HBase region to reflect that.  If that's not possible
> or just turns out to be a pain, then I can live with the introduction of
> the new prefix being a little slow until the regions split and distribute
> effectively.
> That make sense?
> On Tue, Sep 4, 2012 at 1:34 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
>> I think you have to understand what happens as a table splits.
>> If you have a composite key where the first field has the value between
>> 0-9 and you pre-split your table, you will  have all of your 1's going to
>> the single region until it splits. But both splits will start on the same
>> node until they eventually get balanced out.
>> (Note: I'm not an expert on how hbase balances the regions across a region
>> server so I couldn't tell you how it choses which nodes to place each
>> region.)
>> But what are you trying to do? Avoid a hot spot on the initial load, or
>> are you looking at the longer term picture?
>> On Sep 3, 2012, at 2:58 PM, Eric Czech <[EMAIL PROTECTED]> wrote:
>>> With regards to:
>>>> If you have 3 region servers and your data is evenly distributed, that
>>>> mean all the data starting with a 1 will be on server 1, and so on.
>>> Assuming there are multiple regions in existence for each prefix, why
>>> would they not be distributed across all the machines?
>>> In other words, if there are many regions with keys that generally
>>> start with 1, why would they ALL be on server 1 like you said?  It's
>>> my understanding that the regions aren't placed around the cluster
>>> according to the range of information they contain so I'm not quite
>>> following that explanation.
>>> Putting the higher cardinality values in front of the key isn't
>>> entirely out of the question, but I'd like to use the low cardinality
>>> key out front for the sake of selecting rows for MapReduce jobs.
>>> Otherwise, I always have to scan the full table for each job.
>>> On Mon, Sep 3, 2012 at 3:20 PM, Jean-Marc Spaggiari
>>> <[EMAIL PROTECTED]> wrote:
>>>> Yes, you're right, but again, it will depend on the number of
>>>> regionservers and the distribution of your data.
>>>> If you have 3 region servers and your data is evenly distributed, that
>>>> mean all the data starting with a 1 will be on server 1, and so on.