Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Key formats and very low cardinality leading fields


+
Eric Czech 2012-09-03, 17:31
+
Jean-Marc Spaggiari 2012-09-03, 18:31
+
Eric Czech 2012-09-03, 19:06
+
Jean-Marc Spaggiari 2012-09-03, 19:20
+
Eric Czech 2012-09-03, 19:58
+
Michael Segel 2012-09-04, 17:34
+
Eric Czech 2012-09-04, 17:56
+
Michael Segel 2012-09-04, 18:03
+
Eric Czech 2012-09-04, 18:51
+
Michael Segel 2012-09-05, 03:04
+
Eric Czech 2012-09-05, 05:37
+
Tom Brown 2012-09-07, 03:05
+
Jean-Marc Spaggiari 2012-09-03, 20:11
Copy link to this message
-
Re: Key formats and very low cardinality leading fields
You can also look at pre-splitting the regions for timeseries type data.

On Mon, Sep 3, 2012 at 1:11 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]
> wrote:

> Initially your table will contain only one region.
>
> When you will reach its maximum size, it will split into 2 regions
> will are going to be distributed over the cluster.
>
> The 2 regions are going to be ordered by keys.So all entries starting
> with 1 will be on the first region. And the middle key (let's say
> 25......) will start the 2nd region.
>
> So region 1 will contain 1 to 24999. and the 2nd region will contain
> keys from 25
>
> And so on.
>
> Since keys are ordered, all keys starting with a 1 are going to be
> closeby on the same region, expect if the region is big enought to be
> splitted and the servers by more region servers.
>
> So when you will load all your entries starting with 1, or 3, they
> will go on one uniq region. Only entries starting with 2 are going to
> be sometime on region 1, sometime on region 2.
>
> Of course, the more data you will load, the more regions you will
> have, the less hotspoting you will have. But at the beginning, it
> might be difficult for some of your servers.
>
>
> 2012/9/3, Eric Czech <[EMAIL PROTECTED]>:
>  > With regards to:
> >
> >> If you have 3 region servers and your data is evenly distributed, that
> >> mean all the data starting with a 1 will be on server 1, and so on.
> >
> > Assuming there are multiple regions in existence for each prefix, why
> > would they not be distributed across all the machines?
> >
> > In other words, if there are many regions with keys that generally
> > start with 1, why would they ALL be on server 1 like you said?  It's
> > my understanding that the regions aren't placed around the cluster
> > according to the range of information they contain so I'm not quite
> > following that explanation.
> >
> > Putting the higher cardinality values in front of the key isn't
> > entirely out of the question, but I'd like to use the low cardinality
> > key out front for the sake of selecting rows for MapReduce jobs.
> > Otherwise, I always have to scan the full table for each job.
> >
> > On Mon, Sep 3, 2012 at 3:20 PM, Jean-Marc Spaggiari
> > <[EMAIL PROTECTED]> wrote:
> >> Yes, you're right, but again, it will depend on the number of
> >> regionservers and the distribution of your data.
> >>
> >> If you have 3 region servers and your data is evenly distributed, that
> >> mean all the data starting with a 1 will be on server 1, and so on.
> >>
> >> So if you write a million of lines starting with a 1, they will all
> >> land on the same server.
> >>
> >> Of course, you can pre-split your table. Like 1a to 1z and assign each
> >> region to one of you 3 servers. That way you will avoir hotspotting
> >> even if you write million of lines starting with a 1.
> >>
> >> If you have une hundred regions, you will face the same issue at the
> >> beginning, but the more data your will add, the more your table will
> >> be split across all the servers and the less hotspottig you will have.
> >>
> >> Can't you just revert your fields and put the 1 to 30 at the end of the
> >> key?
> >>
> >> 2012/9/3, Eric Czech <[EMAIL PROTECTED]>:
> >>> Thanks for the response Jean-Marc!
> >>>
> >>> I understand what you're saying but in a more extreme case, let's say
> >>> I'm choosing the leading number on the range 1 - 3 instead of 1 - 30.
> >>> In that case, it seems like all of the data for any one prefix would
> >>> already be split well across the cluster and as long as the second
> >>> value isn't written sequentially, there wouldn't be an issue.
> >>>
> >>> Is my reasoning there flawed at all?
> >>>
> >>> On Mon, Sep 3, 2012 at 2:31 PM, Jean-Marc Spaggiari
> >>> <[EMAIL PROTECTED]> wrote:
> >>>> Hi Eric,
> >>>>
> >>>> In HBase, data is stored sequentially based on the key alphabetical
> >>>> order.
> >>>>
> >>>> It will depend of the number of reqions and regionservers you have but
> >>>
+
Eric Czech 2012-09-04, 17:15
+
Jean-Marc Spaggiari 2012-09-04, 17:22
+
Eric Czech 2012-09-04, 17:31