Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Key formats and very low cardinality leading fields


Copy link to this message
-
Re: Key formats and very low cardinality leading fields
Hi Eric,

Yes you can split and existing region. You can do that easily with the
web interface. After the split, at some point, one of the 2 regions
will be moved to another server to balanced the load. You can also
move it manually.

JM

2012/9/4, Eric Czech <[EMAIL PROTECTED]>:
> Thanks again, both of you.
>
> I'll look at pre splitting the regions so that there isn't so much initial
> contention.  The issue I'll have though is that I won't know all the prefix
> values at first and will have to be able to add them later.
>
> Is it possible to split regions on an existing table?  Or is that
> inadvisable in favor of doing the splits when the table is created?
>
> On Mon, Sep 3, 2012 at 5:19 PM, Mohit Anchlia
> <[EMAIL PROTECTED]>wrote:
>
>> You can also look at pre-splitting the regions for timeseries type data.
>>
>> On Mon, Sep 3, 2012 at 1:11 PM, Jean-Marc Spaggiari <
>> [EMAIL PROTECTED]
>> > wrote:
>>
>> > Initially your table will contain only one region.
>> >
>> > When you will reach its maximum size, it will split into 2 regions
>> > will are going to be distributed over the cluster.
>> >
>> > The 2 regions are going to be ordered by keys.So all entries starting
>> > with 1 will be on the first region. And the middle key (let's say
>> > 25......) will start the 2nd region.
>> >
>> > So region 1 will contain 1 to 24999. and the 2nd region will contain
>> > keys from 25
>> >
>> > And so on.
>> >
>> > Since keys are ordered, all keys starting with a 1 are going to be
>> > closeby on the same region, expect if the region is big enought to be
>> > splitted and the servers by more region servers.
>> >
>> > So when you will load all your entries starting with 1, or 3, they
>> > will go on one uniq region. Only entries starting with 2 are going to
>> > be sometime on region 1, sometime on region 2.
>> >
>> > Of course, the more data you will load, the more regions you will
>> > have, the less hotspoting you will have. But at the beginning, it
>> > might be difficult for some of your servers.
>> >
>> >
>> > 2012/9/3, Eric Czech <[EMAIL PROTECTED]>:
>> >  > With regards to:
>> > >
>> > >> If you have 3 region servers and your data is evenly distributed,
>> > >> that
>> > >> mean all the data starting with a 1 will be on server 1, and so on.
>> > >
>> > > Assuming there are multiple regions in existence for each prefix, why
>> > > would they not be distributed across all the machines?
>> > >
>> > > In other words, if there are many regions with keys that generally
>> > > start with 1, why would they ALL be on server 1 like you said?  It's
>> > > my understanding that the regions aren't placed around the cluster
>> > > according to the range of information they contain so I'm not quite
>> > > following that explanation.
>> > >
>> > > Putting the higher cardinality values in front of the key isn't
>> > > entirely out of the question, but I'd like to use the low cardinality
>> > > key out front for the sake of selecting rows for MapReduce jobs.
>> > > Otherwise, I always have to scan the full table for each job.
>> > >
>> > > On Mon, Sep 3, 2012 at 3:20 PM, Jean-Marc Spaggiari
>> > > <[EMAIL PROTECTED]> wrote:
>> > >> Yes, you're right, but again, it will depend on the number of
>> > >> regionservers and the distribution of your data.
>> > >>
>> > >> If you have 3 region servers and your data is evenly distributed,
>> > >> that
>> > >> mean all the data starting with a 1 will be on server 1, and so on.
>> > >>
>> > >> So if you write a million of lines starting with a 1, they will all
>> > >> land on the same server.
>> > >>
>> > >> Of course, you can pre-split your table. Like 1a to 1z and assign
>> > >> each
>> > >> region to one of you 3 servers. That way you will avoir hotspotting
>> > >> even if you write million of lines starting with a 1.
>> > >>
>> > >> If you have une hundred regions, you will face the same issue at the
>> > >> beginning, but the more data your will add, the more your table will
>> > >> be split across all the servers and the less hotspottig you will
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB