|
|
-
md5 hash key and splits
Mohit Anchlia 2012-08-29, 22:56
If I use md5 hash + timestamp rowkey would hbase automatically detect the difference in ranges and peforms split? How does split work in such cases or is it still advisable to manually split the regions.
-
Re: md5 hash key and splits
Stack 2012-08-30, 04:19
On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > If I use md5 hash + timestamp rowkey would hbase automatically detect the > difference in ranges and peforms split? How does split work in such cases > or is it still advisable to manually split the regions.
Yes.
On how split works, when a region hits the maximum configured size, it splits in two.
Manual splitting can be useful when you know your distribution and you'd save on hbase doing it for you. It can speed up bulk loads for instance.
St.Ack
-
Re: md5 hash key and splits
Mohit Anchlia 2012-08-30, 04:38
On Wed, Aug 29, 2012 at 9:19 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia <[EMAIL PROTECTED]> > wrote: > > If I use md5 hash + timestamp rowkey would hbase automatically detect the > > difference in ranges and peforms split? How does split work in such cases > > or is it still advisable to manually split the regions. >
What logic would you recommend to split the table into multiple regions when using md5 hash? > Yes. > > On how split works, when a region hits the maximum configured size, it > splits in two. > > Manual splitting can be useful when you know your distribution and > you'd save on hbase doing it for you. It can speed up bulk loads for > instance. > > St.Ack >
-
Re: md5 hash key and splits
Stack 2012-08-30, 05:50
On Wed, Aug 29, 2012 at 9:38 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > On Wed, Aug 29, 2012 at 9:19 PM, Stack <[EMAIL PROTECTED]> wrote: > >> On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia <[EMAIL PROTECTED]> >> wrote: >> > If I use md5 hash + timestamp rowkey would hbase automatically detect the >> > difference in ranges and peforms split? How does split work in such cases >> > or is it still advisable to manually split the regions. >> > > What logic would you recommend to split the table into multiple regions > when using md5 hash? >
Its hard to know how well your inserts will spread over the md5 namespace ahead of time. You could try sampling or just let HBase take care of the splits for you (Is there a problem w/ your letting HBase do the splits?)
St.Ack
-
Re: md5 hash key and splits
Mohit Anchlia 2012-08-30, 14:35
On Wed, Aug 29, 2012 at 10:50 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Wed, Aug 29, 2012 at 9:38 PM, Mohit Anchlia <[EMAIL PROTECTED]> > wrote: > > On Wed, Aug 29, 2012 at 9:19 PM, Stack <[EMAIL PROTECTED]> wrote: > > > >> On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia <[EMAIL PROTECTED] > > > >> wrote: > >> > If I use md5 hash + timestamp rowkey would hbase automatically detect > the > >> > difference in ranges and peforms split? How does split work in such > cases > >> > or is it still advisable to manually split the regions. > >> > > > > What logic would you recommend to split the table into multiple regions > > when using md5 hash? > > > > Its hard to know how well your inserts will spread over the md5 > namespace ahead of time. You could try sampling or just let HBase > take care of the splits for you (Is there a problem w/ your letting > HBase do the splits?) > > From what I;ve read it's advisable to do manual splits since you are able to spread the load in more predictable way. If I am missing something please let me know. > St.Ack >
-
Re: md5 hash key and splits
Stack 2012-08-30, 22:45
On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: >> From what I;ve read it's advisable to do manual splits since you are able > to spread the load in more predictable way. If I am missing something > please let me know. >
Where did you read that? St.Ack
-
Re: md5 hash key and splits
Ian Varley 2012-08-30, 23:26
The Facebook devs have mentioned in public talks that they pre-split their tables and don't use automated region splitting. But as far as I remember, the reason for that isn't predictability of spreading load, so much as predictability of uptime & latency (they don't want an automated split to happen at a random busy time). Maybe that's what you mean, Mohit?
Ian
On Aug 30, 2012, at 5:45 PM, Stack wrote:
On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >From what I;ve read it's advisable to do manual splits since you are able to spread the load in more predictable way. If I am missing something please let me know. Where did you read that? St.Ack
-
Re: md5 hash key and splits
Amandeep Khurana 2012-08-30, 23:30
Also, you might have read that an initial loading of data can be better distributed across the cluster if the table is pre-split rather than starting with a single region and splitting (possibly aggressively, depending on the throughput) as the data loads in. Once you are in a stable state with regions distributed across the cluster, there is really no benefit in terms of spreading load by managing splitting manually v/s letting HBase do it for you. At that point it's about what Ian mentioned - predictability of latencies by avoiding splits happening at a busy time.
On Thu, Aug 30, 2012 at 4:26 PM, Ian Varley <[EMAIL PROTECTED]> wrote:
> The Facebook devs have mentioned in public talks that they pre-split their > tables and don't use automated region splitting. But as far as I remember, > the reason for that isn't predictability of spreading load, so much as > predictability of uptime & latency (they don't want an automated split to > happen at a random busy time). Maybe that's what you mean, Mohit? > > Ian > > On Aug 30, 2012, at 5:45 PM, Stack wrote: > > On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > From what I;ve read it's advisable to do manual splits since you are able > to spread the load in more predictable way. If I am missing something > please let me know. > > > Where did you read that? > St.Ack > >
-
Re: md5 hash key and splits
Mohit Anchlia 2012-08-31, 00:04
In general isn't it better to split the regions so that the load can be spread accross the cluster to avoid HotSpots? I read about pre-splitting here: http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/On Thu, Aug 30, 2012 at 4:30 PM, Amandeep Khurana <[EMAIL PROTECTED]> wrote: > Also, you might have read that an initial loading of data can be better > distributed across the cluster if the table is pre-split rather than > starting with a single region and splitting (possibly aggressively, > depending on the throughput) as the data loads in. Once you are in a stable > state with regions distributed across the cluster, there is really no > benefit in terms of spreading load by managing splitting manually v/s > letting HBase do it for you. At that point it's about what Ian mentioned - > predictability of latencies by avoiding splits happening at a busy time. > > On Thu, Aug 30, 2012 at 4:26 PM, Ian Varley <[EMAIL PROTECTED]> > wrote: > > > The Facebook devs have mentioned in public talks that they pre-split > their > > tables and don't use automated region splitting. But as far as I > remember, > > the reason for that isn't predictability of spreading load, so much as > > predictability of uptime & latency (they don't want an automated split to > > happen at a random busy time). Maybe that's what you mean, Mohit? > > > > Ian > > > > On Aug 30, 2012, at 5:45 PM, Stack wrote: > > > > On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <[EMAIL PROTECTED] > > <mailto:[EMAIL PROTECTED]>> wrote: > > From what I;ve read it's advisable to do manual splits since you are able > > to spread the load in more predictable way. If I am missing something > > please let me know. > > > > > > Where did you read that? > > St.Ack > > > > >
-
Re: md5 hash key and splits
Stack 2012-08-31, 06:52
On Thu, Aug 30, 2012 at 5:04 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > In general isn't it better to split the regions so that the load can be > spread accross the cluster to avoid HotSpots? > Time series data is a particular case [1] and the sematextians have tools to help w/ that particular loading pattern. Is time series your loading pattern? If so, yes, you need to employ some smarts (tsdb schema and write tricks or hbasewd tool) to avoid hotspotting. But hotspotting is an issue apart from splts; you can split all you want and if your row keys are time series, splitting won't undo them. You would split to distribute load over the cluster and HBase should be doing this for you w/o need of human intervention (caveat the reasons you might want to manually split as listed above by AK and Ian). St.Ack 1. http://hbase.apache.org/book.html#rowkey.design
-
Re: md5 hash key and splits
Doug Meil 2012-08-31, 13:09
Stack, re: "Where did you read that?", I think he might also be referring to this... http://hbase.apache.org/book.html#important_configurationsOn 8/30/12 8:04 PM, "Mohit Anchlia" <[EMAIL PROTECTED]> wrote: >In general isn't it better to split the regions so that the load can be >spread accross the cluster to avoid HotSpots? > >I read about pre-splitting here: > > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting>-despite-writing-records-with-sequential-keys/ > >On Thu, Aug 30, 2012 at 4:30 PM, Amandeep Khurana <[EMAIL PROTECTED]> >wrote: > >> Also, you might have read that an initial loading of data can be better >> distributed across the cluster if the table is pre-split rather than >> starting with a single region and splitting (possibly aggressively, >> depending on the throughput) as the data loads in. Once you are in a >>stable >> state with regions distributed across the cluster, there is really no >> benefit in terms of spreading load by managing splitting manually v/s >> letting HBase do it for you. At that point it's about what Ian >>mentioned - >> predictability of latencies by avoiding splits happening at a busy time. >> >> On Thu, Aug 30, 2012 at 4:26 PM, Ian Varley <[EMAIL PROTECTED]> >> wrote: >> >> > The Facebook devs have mentioned in public talks that they pre-split >> their >> > tables and don't use automated region splitting. But as far as I >> remember, >> > the reason for that isn't predictability of spreading load, so much as >> > predictability of uptime & latency (they don't want an automated >>split to >> > happen at a random busy time). Maybe that's what you mean, Mohit? >> > >> > Ian >> > >> > On Aug 30, 2012, at 5:45 PM, Stack wrote: >> > >> > On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <[EMAIL PROTECTED] >> > <mailto:[EMAIL PROTECTED]>> wrote: >> > From what I;ve read it's advisable to do manual splits since you are >>able >> > to spread the load in more predictable way. If I am missing something >> > please let me know. >> > >> > >> > Where did you read that? >> > St.Ack >> > >> > >>
-
Re: md5 hash key and splits
Mohit Anchlia 2012-08-31, 14:55
On Thu, Aug 30, 2012 at 11:52 PM, Stack <[EMAIL PROTECTED]> wrote: > On Thu, Aug 30, 2012 at 5:04 PM, Mohit Anchlia <[EMAIL PROTECTED]> > wrote: > > In general isn't it better to split the regions so that the load can be > > spread accross the cluster to avoid HotSpots? > > > > Time series data is a particular case [1] and the sematextians have > tools to help w/ that particular loading pattern. Is time series your > loading pattern? If so, yes, you need to employ some smarts (tsdb > schema and write tricks or hbasewd tool) to avoid hotspotting. But > hotspotting is an issue apart from splts; you can split all you want > and if your row keys are time series, splitting won't undo them. > > My data is timeseries and to get random distribution and still have the keys in the same region for a user I am thinking of using md5(userid)+reversetimestamp as a row key. But with this type of key how can one do pre-splits? I have 30 nodes. > You would split to distribute load over the cluster and HBase should > be doing this for you w/o need of human intervention (caveat the > reasons you might want to manually split as listed above by AK and > Ian). > > St.Ack > 1. http://hbase.apache.org/book.html#rowkey.design>
-
Re: md5 hash key and splits
Stack 2012-08-31, 15:32
On Fri, Aug 31, 2012 at 7:55 AM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: >> My data is timeseries and to get random distribution and still have the > keys in the same region for a user I am thinking of using > md5(userid)+reversetimestamp as a row key. But with this type of key how > can one do pre-splits? I have 30 nodes. >
If you don't know the key spread ahead of time, let HBase do the splitting for you? St.Ack
|
|