Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Key formats and very low cardinality leading fields


+
Eric Czech 2012-09-03, 17:31
+
Jean-Marc Spaggiari 2012-09-03, 18:31
+
Eric Czech 2012-09-03, 19:06
+
Jean-Marc Spaggiari 2012-09-03, 19:20
+
Eric Czech 2012-09-03, 19:58
+
Michael Segel 2012-09-04, 17:34
+
Eric Czech 2012-09-04, 17:56
+
Michael Segel 2012-09-04, 18:03
+
Eric Czech 2012-09-04, 18:51
+
Michael Segel 2012-09-05, 03:04
+
Eric Czech 2012-09-05, 05:37
+
Tom Brown 2012-09-07, 03:05
+
Jean-Marc Spaggiari 2012-09-03, 20:11
+
Mohit Anchlia 2012-09-03, 21:19
+
Eric Czech 2012-09-04, 17:15
Copy link to this message
-
Re: Key formats and very low cardinality leading fields
Jean-Marc Spaggiari 2012-09-04, 17:22
Hi Eric,

Yes you can split and existing region. You can do that easily with the
web interface. After the split, at some point, one of the 2 regions
will be moved to another server to balanced the load. You can also
move it manually.

JM

2012/9/4, Eric Czech <[EMAIL PROTECTED]>:
> Thanks again, both of you.
>
> I'll look at pre splitting the regions so that there isn't so much initial
> contention.  The issue I'll have though is that I won't know all the prefix
> values at first and will have to be able to add them later.
>
> Is it possible to split regions on an existing table?  Or is that
> inadvisable in favor of doing the splits when the table is created?
>
> On Mon, Sep 3, 2012 at 5:19 PM, Mohit Anchlia
> <[EMAIL PROTECTED]>wrote:
>
>> You can also look at pre-splitting the regions for timeseries type data.
>>
>> On Mon, Sep 3, 2012 at 1:11 PM, Jean-Marc Spaggiari <
>> [EMAIL PROTECTED]
>> > wrote:
>>
>> > Initially your table will contain only one region.
>> >
>> > When you will reach its maximum size, it will split into 2 regions
>> > will are going to be distributed over the cluster.
>> >
>> > The 2 regions are going to be ordered by keys.So all entries starting
>> > with 1 will be on the first region. And the middle key (let's say
>> > 25......) will start the 2nd region.
>> >
>> > So region 1 will contain 1 to 24999. and the 2nd region will contain
>> > keys from 25
>> >
>> > And so on.
>> >
>> > Since keys are ordered, all keys starting with a 1 are going to be
>> > closeby on the same region, expect if the region is big enought to be
>> > splitted and the servers by more region servers.
>> >
>> > So when you will load all your entries starting with 1, or 3, they
>> > will go on one uniq region. Only entries starting with 2 are going to
>> > be sometime on region 1, sometime on region 2.
>> >
>> > Of course, the more data you will load, the more regions you will
>> > have, the less hotspoting you will have. But at the beginning, it
>> > might be difficult for some of your servers.
>> >
>> >
>> > 2012/9/3, Eric Czech <[EMAIL PROTECTED]>:
>> >  > With regards to:
>> > >
>> > >> If you have 3 region servers and your data is evenly distributed,
>> > >> that
>> > >> mean all the data starting with a 1 will be on server 1, and so on.
>> > >
>> > > Assuming there are multiple regions in existence for each prefix, why
>> > > would they not be distributed across all the machines?
>> > >
>> > > In other words, if there are many regions with keys that generally
>> > > start with 1, why would they ALL be on server 1 like you said?  It's
>> > > my understanding that the regions aren't placed around the cluster
>> > > according to the range of information they contain so I'm not quite
>> > > following that explanation.
>> > >
>> > > Putting the higher cardinality values in front of the key isn't
>> > > entirely out of the question, but I'd like to use the low cardinality
>> > > key out front for the sake of selecting rows for MapReduce jobs.
>> > > Otherwise, I always have to scan the full table for each job.
>> > >
>> > > On Mon, Sep 3, 2012 at 3:20 PM, Jean-Marc Spaggiari
>> > > <[EMAIL PROTECTED]> wrote:
>> > >> Yes, you're right, but again, it will depend on the number of
>> > >> regionservers and the distribution of your data.
>> > >>
>> > >> If you have 3 region servers and your data is evenly distributed,
>> > >> that
>> > >> mean all the data starting with a 1 will be on server 1, and so on.
>> > >>
>> > >> So if you write a million of lines starting with a 1, they will all
>> > >> land on the same server.
>> > >>
>> > >> Of course, you can pre-split your table. Like 1a to 1z and assign
>> > >> each
>> > >> region to one of you 3 servers. That way you will avoir hotspotting
>> > >> even if you write million of lines starting with a 1.
>> > >>
>> > >> If you have une hundred regions, you will face the same issue at the
>> > >> beginning, but the more data your will add, the more your table will
>> > >> be split across all the servers and the less hotspottig you will
+
Eric Czech 2012-09-04, 17:31