|
AnandaVelMurugan Chandra ...
2012-07-16, 05:30
AnandaVelMurugan Chandra ...
2012-07-17, 14:44
Alex Baranau
2012-07-17, 15:53
Michel Segel
2012-07-17, 16:44
Alex Baranau
2012-07-17, 18:49
AnandaVelMurugan Chandra ...
2012-07-18, 16:04
AnandaVelMurugan Chandra ...
2012-07-19, 15:08
Alex Baranau
2012-07-19, 15:22
syed kather
2012-07-19, 16:52
AnandaVelMurugan Chandra ...
2012-07-20, 01:41
|
-
Rowkey hashing to avoid hotspottingAnandaVelMurugan Chandra ... 2012-07-16, 05:30
Hi,
I am using Hbase to store data about mechanical components. Each component has model no. and serial no. and some other attributes. I would be querying my data mostly by model no. and serial no. So I created a composite key with these two attributes and added timestamp to make it unique. To filter the data, I use rowkey filter with regex string comparator and it works well with sample seed data. Now I am afraid whether this set up will lead to region server hotspotting when we load production data in HBase. I read hashing may solve this problem. Can some one help me in implementing hashing the row key? Also I would want the row filter to work as I have to display the number of components in a web page and I use row key filter for implementing that functionality? Any guidance would be of great help. -- Regards, Anand
-
Re: Rowkey hashing to avoid hotspottingAnandaVelMurugan Chandra ... 2012-07-17, 14:44
Hi Cristofer,
Thanks for elaborate response!!! I have no much information about production data as I work with partial data. But based on discussion with my project partners, I have some answers for you. Number of model numbers and serial numbers will be finite. Not so many... As far as I know,there is no predefined rule for model number or serial number creation. I have two access pattern. I count the number of rows for a specific model number. I use rowkey filter for this. Also I filter the rows based on model, serial number and some other columns. I scan the table with column value filter for this case. I will evaluate salting as you have explained. Regards, Anand.C On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber < [EMAIL PROTECTED]> wrote: > Hi Anand, > > As usual, the answer is that 'it depends' :) > > I think that the main question here is: why are you afraid that this setup > would lead to region server hotspotting? Is because you don't know how your > production data will seems? > > Based on what you told about your rowkey, you will query mostly by > providing model no. + serial no., but: > 1 - How is your rowkey distribution? There are tons of different > modelNumbers AND serialNumbers? Few modelNumbers and a lot of > serialNumbers? Few of both? > 2 - Putting modelNumber in front of your rowkey means that your data will > be sorted by rowkey. So, what is the rule that determinates a modelNumber > creation? Is it a sequential number that will be increased by time? If so, > are newer members accessed a lot more than older members? If not, what will > drive this number? Is it an encoding rule? > 3 - Do you expect more write/read load over a few of these modelNumbers > and/or serialNumbers? Will it be similar to a Pareto Distribution? > Distributed over what? > > Also, two other things got my attention here... > 1 - Why are you filtering with regex? If your queries are over model no. + > serial no., why don't you just scan starting by your > modelNumber+SerialNumber, and stoping on your next > modelNumber+SerialNumber? Or is there another access pattern that doesn't > apply to your composited rowkey? > 2 - Why do you have to add a timestamp to ensure uniqueness? > > Now, answering your question without more info about your data, you can > apply hash in two ways: > 1 - Generating a hash (MD5 is the most common as far as I read about) and > using only this hash as your rowkey. Based on what you have told, this way > doesn't fit your needs, because you would not be able to do apply your > filter anymore. > 2 - Salting, by prefixing your current rowkey with a pinch of hash. Notice > that the hash portion must be your rowkey prefix to ensure a kind of > balanced distribution over something (where something is your region > servers). I'm working with a case that is a bit similar to yours, and what > I'm doing right now is calculating the hashValue of my rowkey and using a > Java Formatter to create a hex string to prepend to my rowkey. Something > like a String.format("%03x", hashValue) > > In both cases, you still have to split your regions in advance, and it > will be better to work your splitting before starting to feed your table > with production data. > > Also, you have to study the consequences that changing your rowkey will > bring. It's not for free. > > There's a lot of words here and a lot of questions, so by now I feel I > started to shoot in the dark. Try to understand your production data and if > you have more to share, for sure it will help! > > Regards, > Cristofer > > -----Mensagem original----- > De: AnandaVelMurugan Chandra Mohan [mailto:[EMAIL PROTECTED]] > Enviada em: segunda-feira, 16 de julho de 2012 02:30 > Para: [EMAIL PROTECTED] > Assunto: Rowkey hashing to avoid hotspotting > > Hi, > > I am using Hbase to store data about mechanical components. Each component > has model no. and serial no. and some other attributes. > > I would be querying my data mostly by model no. and serial no. So I Regards, Anand
-
Re: Rowkey hashing to avoid hotspottingAlex Baranau 2012-07-17, 15:53
The most common reason for RS hotspotting during writing data in HBase is
writing rows with monotonically increasing/decreasing row keys. E.g. if you put timestamp in the first part of your key, then you are likely to have monotonically increasing row keys. You can find more info about this issue and how to solve it here: [1] and also you may want to look at already implemented salting solution [2]. As for RS hotspotting during reading - it is hard to predict without knowing what it the most common data access patterns. E.g. putting model # in first part of a key may seem like a good distribution, but if your web site used mostly by Mercedes owners, the majority of the read load may be directed to just few regions. Again, salting can help a lot here. +1 to what Cristofer said on other things, esp: use partial key scans were possible instead of filters and pre-split your table. Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr [1] http://bit.ly/HnKjbc [2] https://github.com/sematext/HBaseWD On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan < [EMAIL PROTECTED]> wrote: > Hi Cristofer, > > Thanks for elaborate response!!! > > I have no much information about production data as I work with partial > data. But based on discussion with my project partners, I have some answers > for you. > > Number of model numbers and serial numbers will be finite. Not so many... > As far as I know,there is no predefined rule for model number or serial > number creation. > > I have two access pattern. I count the number of rows for a specific model > number. I use rowkey filter for this. Also I filter the rows based on > model, serial number and some other columns. I scan the table with column > value filter for this case. > > I will evaluate salting as you have explained. > > Regards, > Anand.C > > On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber < > [EMAIL PROTECTED]> wrote: > > > Hi Anand, > > > > As usual, the answer is that 'it depends' :) > > > > I think that the main question here is: why are you afraid that this > setup > > would lead to region server hotspotting? Is because you don't know how > your > > production data will seems? > > > > Based on what you told about your rowkey, you will query mostly by > > providing model no. + serial no., but: > > 1 - How is your rowkey distribution? There are tons of different > > modelNumbers AND serialNumbers? Few modelNumbers and a lot of > > serialNumbers? Few of both? > > 2 - Putting modelNumber in front of your rowkey means that your data will > > be sorted by rowkey. So, what is the rule that determinates a modelNumber > > creation? Is it a sequential number that will be increased by time? If > so, > > are newer members accessed a lot more than older members? If not, what > will > > drive this number? Is it an encoding rule? > > 3 - Do you expect more write/read load over a few of these modelNumbers > > and/or serialNumbers? Will it be similar to a Pareto Distribution? > > Distributed over what? > > > > Also, two other things got my attention here... > > 1 - Why are you filtering with regex? If your queries are over model no. > + > > serial no., why don't you just scan starting by your > > modelNumber+SerialNumber, and stoping on your next > > modelNumber+SerialNumber? Or is there another access pattern that doesn't > > apply to your composited rowkey? > > 2 - Why do you have to add a timestamp to ensure uniqueness? > > > > Now, answering your question without more info about your data, you can > > apply hash in two ways: > > 1 - Generating a hash (MD5 is the most common as far as I read about) and > > using only this hash as your rowkey. Based on what you have told, this > way > > doesn't fit your needs, because you would not be able to do apply your > > filter anymore. > > 2 - Salting, by prefixing your current rowkey with a pinch of hash. > Notice > > that the hash portion must be your rowkey prefix to ensure a kind of Alex Baranau Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
-
Re: Rowkey hashing to avoid hotspottingMichel Segel 2012-07-17, 16:44
Reading hot spotting?
Hmmm there's a cache and I don't see any real use cases where you would have it occur naturally. Sent from a remote device. Please excuse any typos... Mike Segel On Jul 17, 2012, at 10:53 AM, Alex Baranau <[EMAIL PROTECTED]> wrote: > The most common reason for RS hotspotting during writing data in HBase is > writing rows with monotonically increasing/decreasing row keys. E.g. if you > put timestamp in the first part of your key, then you are likely to have > monotonically increasing row keys. You can find more info about this issue > and how to solve it here: [1] and also you may want to look at already > implemented salting solution [2]. > > As for RS hotspotting during reading - it is hard to predict without > knowing what it the most common data access patterns. E.g. putting model # > in first part of a key may seem like a good distribution, but if your web > site used mostly by Mercedes owners, the majority of the read load may be > directed to just few regions. Again, salting can help a lot here. > > +1 to what Cristofer said on other things, esp: use partial key scans were > possible instead of filters and pre-split your table. > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > [1] http://bit.ly/HnKjbc > [2] https://github.com/sematext/HBaseWD > > On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan < > [EMAIL PROTECTED]> wrote: > >> Hi Cristofer, >> >> Thanks for elaborate response!!! >> >> I have no much information about production data as I work with partial >> data. But based on discussion with my project partners, I have some answers >> for you. >> >> Number of model numbers and serial numbers will be finite. Not so many... >> As far as I know,there is no predefined rule for model number or serial >> number creation. >> >> I have two access pattern. I count the number of rows for a specific model >> number. I use rowkey filter for this. Also I filter the rows based on >> model, serial number and some other columns. I scan the table with column >> value filter for this case. >> >> I will evaluate salting as you have explained. >> >> Regards, >> Anand.C >> >> On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber < >> [EMAIL PROTECTED]> wrote: >> >>> Hi Anand, >>> >>> As usual, the answer is that 'it depends' :) >>> >>> I think that the main question here is: why are you afraid that this >> setup >>> would lead to region server hotspotting? Is because you don't know how >> your >>> production data will seems? >>> >>> Based on what you told about your rowkey, you will query mostly by >>> providing model no. + serial no., but: >>> 1 - How is your rowkey distribution? There are tons of different >>> modelNumbers AND serialNumbers? Few modelNumbers and a lot of >>> serialNumbers? Few of both? >>> 2 - Putting modelNumber in front of your rowkey means that your data will >>> be sorted by rowkey. So, what is the rule that determinates a modelNumber >>> creation? Is it a sequential number that will be increased by time? If >> so, >>> are newer members accessed a lot more than older members? If not, what >> will >>> drive this number? Is it an encoding rule? >>> 3 - Do you expect more write/read load over a few of these modelNumbers >>> and/or serialNumbers? Will it be similar to a Pareto Distribution? >>> Distributed over what? >>> >>> Also, two other things got my attention here... >>> 1 - Why are you filtering with regex? If your queries are over model no. >> + >>> serial no., why don't you just scan starting by your >>> modelNumber+SerialNumber, and stoping on your next >>> modelNumber+SerialNumber? Or is there another access pattern that doesn't >>> apply to your composited rowkey? >>> 2 - Why do you have to add a timestamp to ensure uniqueness? >>> >>> Now, answering your question without more info about your data, you can >>> apply hash in two ways: >>> 1 - Generating a hash (MD5 is the most common as far as I read about) and
-
Re: Rowkey hashing to avoid hotspottingAlex Baranau 2012-07-17, 18:49
You might be right, when reading load concentrated on single/several RS
they will not act as dead as when it is hotspotting during writing. I think I referred more to "uneven read load distribution" when called it hotspotting while reading. Caches will help for sure, but that might be not enough. Having single/several RS sweating in a cluster more than others is already not a very desired situation. Also it may be that it's not the specific set of records within Regions on RS (read as "data blocks") which are under load, but the whole regions that for some reason has more hot data (like in example above: with keys prefixed with model, the whole several regions containing data of same model may have data that is frequently accessed). In this case HBase (depending on hardware) may not be able to fit all that data in cache on this hot single (or several) RS. As opposed to situation when this hot data distributed over many more RSs (which will act like distributed cache) e.g. with salting. In general, yes, you will not see as big issues with uneven *read* load distribution over the cluster as you might see in case of uneven *write* load distribution. Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Tue, Jul 17, 2012 at 12:44 PM, Michel Segel <[EMAIL PROTECTED]>wrote: > Reading hot spotting? > Hmmm there's a cache and I don't see any real use cases where you would > have it occur naturally. > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Jul 17, 2012, at 10:53 AM, Alex Baranau <[EMAIL PROTECTED]> > wrote: > > > The most common reason for RS hotspotting during writing data in HBase is > > writing rows with monotonically increasing/decreasing row keys. E.g. if > you > > put timestamp in the first part of your key, then you are likely to have > > monotonically increasing row keys. You can find more info about this > issue > > and how to solve it here: [1] and also you may want to look at already > > implemented salting solution [2]. > > > > As for RS hotspotting during reading - it is hard to predict without > > knowing what it the most common data access patterns. E.g. putting model > # > > in first part of a key may seem like a good distribution, but if your web > > site used mostly by Mercedes owners, the majority of the read load may be > > directed to just few regions. Again, salting can help a lot here. > > > > +1 to what Cristofer said on other things, esp: use partial key scans > were > > possible instead of filters and pre-split your table. > > > > Alex Baranau > > ------ > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - > > Solr > > > > [1] http://bit.ly/HnKjbc > > [2] https://github.com/sematext/HBaseWD > > > > On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan < > > [EMAIL PROTECTED]> wrote: > > > >> Hi Cristofer, > >> > >> Thanks for elaborate response!!! > >> > >> I have no much information about production data as I work with partial > >> data. But based on discussion with my project partners, I have some > answers > >> for you. > >> > >> Number of model numbers and serial numbers will be finite. Not so > many... > >> As far as I know,there is no predefined rule for model number or serial > >> number creation. > >> > >> I have two access pattern. I count the number of rows for a specific > model > >> number. I use rowkey filter for this. Also I filter the rows based on > >> model, serial number and some other columns. I scan the table with > column > >> value filter for this case. > >> > >> I will evaluate salting as you have explained. > >> > >> Regards, > >> Anand.C > >> > >> On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber < > >> [EMAIL PROTECTED]> wrote: > >> > >>> Hi Anand, > >>> > >>> As usual, the answer is that 'it depends' :) > >>> > >>> I think that the main question here is: why are you afraid that this > >> setup > >>> would lead to region server hotspotting? Is because you don't know how Alex Baranau Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
-
Re: Rowkey hashing to avoid hotspottingAnandaVelMurugan Chandra ... 2012-07-18, 16:04
Hi Cristofer,
Data i store is test cell reports about a component. I have many test cell reports for each model number + serial number combination. So to make rowkey unique, I added timstamp. On Wed, Jul 18, 2012 at 3:14 AM, Cristofer Weber < [EMAIL PROTECTED]> wrote: > So, Anand, there are some things that can help, but again, most of them > are related with the famous access patterns. > > Sometimes is not easy to get more information about them in advance, but > if you are replacing another system you can study its data distribution, > grouping for counts, mean, changes over time, etc. It is possible to > analyze with partial data too, but it is risky because you will be > subjected to the way this partial data was gathered; sample data may not be > representative. > > Salting your rowkey with a hash calculated over your model# will probably > result in an uniform distribution over a range (if using modulus), and > pre-spliting your table will balance your load over your Region Servers. > Also, you will be able to recalculate your hash for your model# before > scanning for it, allowing for a scan over specific rowkey while restricting > this scan by startRow and stopRow. Remember that if your rowkeys shares the > same prefix they will probably be located in the same region and your scan > will be favored by this. > > I'm still curious about your need of adding a timestamp after your > model#,serial#... I have some background in manufacturing systems and > usually a serial number is unique. But, of course, it's just curiosity. :-) > > Regards, > Cristofer > > -----Mensagem original----- > De: Alex Baranau [mailto:[EMAIL PROTECTED]] > Enviada em: terça-feira, 17 de julho de 2012 12:53 > Para: [EMAIL PROTECTED] > Assunto: Re: Rowkey hashing to avoid hotspotting > > The most common reason for RS hotspotting during writing data in HBase is > writing rows with monotonically increasing/decreasing row keys. E.g. if you > put timestamp in the first part of your key, then you are likely to have > monotonically increasing row keys. You can find more info about this issue > and how to solve it here: [1] and also you may want to look at already > implemented salting solution [2]. > > As for RS hotspotting during reading - it is hard to predict without > knowing what it the most common data access patterns. E.g. putting model # > in first part of a key may seem like a good distribution, but if your web > site used mostly by Mercedes owners, the majority of the read load may be > directed to just few regions. Again, salting can help a lot here. > > +1 to what Cristofer said on other things, esp: use partial key scans > +were > possible instead of filters and pre-split your table. > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > [1] http://bit.ly/HnKjbc > [2] https://github.com/sematext/HBaseWD > > On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan < > [EMAIL PROTECTED]> wrote: > > > Hi Cristofer, > > > > Thanks for elaborate response!!! > > > > I have no much information about production data as I work with > > partial data. But based on discussion with my project partners, I have > > some answers for you. > > > > Number of model numbers and serial numbers will be finite. Not so many... > > As far as I know,there is no predefined rule for model number or > > serial number creation. > > > > I have two access pattern. I count the number of rows for a specific > > model number. I use rowkey filter for this. Also I filter the rows > > based on model, serial number and some other columns. I scan the table > > with column value filter for this case. > > > > I will evaluate salting as you have explained. > > > > Regards, > > Anand.C > > > > On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber < > > [EMAIL PROTECTED]> wrote: > > > > > Hi Anand, > > > > > > As usual, the answer is that 'it depends' :) > > > > > > I think that the main question here is: why are you afraid that this Regards, Anand
-
Re: Rowkey hashing to avoid hotspottingAnandaVelMurugan Chandra ... 2012-07-19, 15:08
Hi Cristofer,
No problem... I am happy to share and learn.. :) Regarding timestamp based column family, I haven't thought about it. But my only concern is no of column families. I read somewhere that HBase is not good at handling more than 100 column families. On Wed, Jul 18, 2012 at 11:15 PM, Cristofer Weber < [EMAIL PROTECTED]> wrote: > Hi Anand! > > I see... sorry for being so curious, but since I started studying HBase I > am curious about how people are modeling their tables, and in what kinds of > systems HBase is in use. > > Have you evaluated recording your reports in a distinct CF using > timestamps as column qualifiers? It's my curiosity asking again! > > Thanks for sharing! > > Regards, > Cristofer > > -----Mensagem original----- > De: AnandaVelMurugan Chandra Mohan [mailto:[EMAIL PROTECTED]] > Enviada em: quarta-feira, 18 de julho de 2012 13:04 > Para: [EMAIL PROTECTED] > Assunto: Re: Rowkey hashing to avoid hotspotting > > Hi Cristofer, > > Data i store is test cell reports about a component. I have many test cell > reports for each model number + serial number combination. So to make > rowkey unique, I added timstamp. > > > On Wed, Jul 18, 2012 at 3:14 AM, Cristofer Weber < > [EMAIL PROTECTED]> wrote: > > > So, Anand, there are some things that can help, but again, most of > > them are related with the famous access patterns. > > > > Sometimes is not easy to get more information about them in advance, > > but if you are replacing another system you can study its data > > distribution, grouping for counts, mean, changes over time, etc. It is > > possible to analyze with partial data too, but it is risky because you > > will be subjected to the way this partial data was gathered; sample > > data may not be representative. > > > > Salting your rowkey with a hash calculated over your model# will > > probably result in an uniform distribution over a range (if using > > modulus), and pre-spliting your table will balance your load over your > Region Servers. > > Also, you will be able to recalculate your hash for your model# before > > scanning for it, allowing for a scan over specific rowkey while > > restricting this scan by startRow and stopRow. Remember that if your > > rowkeys shares the same prefix they will probably be located in the > > same region and your scan will be favored by this. > > > > I'm still curious about your need of adding a timestamp after your > > model#,serial#... I have some background in manufacturing systems and > > usually a serial number is unique. But, of course, it's just > > curiosity. :-) > > > > Regards, > > Cristofer > > > > -----Mensagem original----- > > De: Alex Baranau [mailto:[EMAIL PROTECTED]] Enviada em: > > terça-feira, 17 de julho de 2012 12:53 > > Para: [EMAIL PROTECTED] > > Assunto: Re: Rowkey hashing to avoid hotspotting > > > > The most common reason for RS hotspotting during writing data in HBase > > is writing rows with monotonically increasing/decreasing row keys. > > E.g. if you put timestamp in the first part of your key, then you are > > likely to have monotonically increasing row keys. You can find more > > info about this issue and how to solve it here: [1] and also you may > > want to look at already implemented salting solution [2]. > > > > As for RS hotspotting during reading - it is hard to predict without > > knowing what it the most common data access patterns. E.g. putting > > model # in first part of a key may seem like a good distribution, but > > if your web site used mostly by Mercedes owners, the majority of the > > read load may be directed to just few regions. Again, salting can help a > lot here. > > > > +1 to what Cristofer said on other things, esp: use partial key scans > > +were > > possible instead of filters and pre-split your table. > > > > Alex Baranau > > ------ > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - > > ElasticSearch - Solr > > > > [1] http://bit.ly/HnKjbc > > [2] https://github.com/sematext/HBaseWD Regards, Anand
-
Re: Rowkey hashing to avoid hotspottingAlex Baranau 2012-07-19, 15:22
> I read somewhere that HBase is not
> good at handling more than 100 column families Heh. Usually it is not good to have more than two or three, actually. See [1], and may be also [2]. Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr [1] http://hbase.apache.org/book/number.of.cfs.html [2] http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know On Thu, Jul 19, 2012 at 11:08 AM, AnandaVelMurugan Chandra Mohan < [EMAIL PROTECTED]> wrote: > Hi Cristofer, > > No problem... I am happy to share and learn.. :) > > Regarding timestamp based column family, I haven't thought about it. But my > only concern is no of column families. I read somewhere that HBase is not > good at handling more than 100 column families. > > > On Wed, Jul 18, 2012 at 11:15 PM, Cristofer Weber < > [EMAIL PROTECTED]> wrote: > > > Hi Anand! > > > > I see... sorry for being so curious, but since I started studying HBase I > > am curious about how people are modeling their tables, and in what kinds > of > > systems HBase is in use. > > > > Have you evaluated recording your reports in a distinct CF using > > timestamps as column qualifiers? It's my curiosity asking again! > > > > Thanks for sharing! > > > > Regards, > > Cristofer > > > > -----Mensagem original----- > > De: AnandaVelMurugan Chandra Mohan [mailto:[EMAIL PROTECTED]] > > Enviada em: quarta-feira, 18 de julho de 2012 13:04 > > Para: [EMAIL PROTECTED] > > Assunto: Re: Rowkey hashing to avoid hotspotting > > > > Hi Cristofer, > > > > Data i store is test cell reports about a component. I have many test > cell > > reports for each model number + serial number combination. So to make > > rowkey unique, I added timstamp. > > > > > > On Wed, Jul 18, 2012 at 3:14 AM, Cristofer Weber < > > [EMAIL PROTECTED]> wrote: > > > > > So, Anand, there are some things that can help, but again, most of > > > them are related with the famous access patterns. > > > > > > Sometimes is not easy to get more information about them in advance, > > > but if you are replacing another system you can study its data > > > distribution, grouping for counts, mean, changes over time, etc. It is > > > possible to analyze with partial data too, but it is risky because you > > > will be subjected to the way this partial data was gathered; sample > > > data may not be representative. > > > > > > Salting your rowkey with a hash calculated over your model# will > > > probably result in an uniform distribution over a range (if using > > > modulus), and pre-spliting your table will balance your load over your > > Region Servers. > > > Also, you will be able to recalculate your hash for your model# before > > > scanning for it, allowing for a scan over specific rowkey while > > > restricting this scan by startRow and stopRow. Remember that if your > > > rowkeys shares the same prefix they will probably be located in the > > > same region and your scan will be favored by this. > > > > > > I'm still curious about your need of adding a timestamp after your > > > model#,serial#... I have some background in manufacturing systems and > > > usually a serial number is unique. But, of course, it's just > > > curiosity. :-) > > > > > > Regards, > > > Cristofer > > > > > > -----Mensagem original----- > > > De: Alex Baranau [mailto:[EMAIL PROTECTED]] Enviada em: > > > terça-feira, 17 de julho de 2012 12:53 > > > Para: [EMAIL PROTECTED] > > > Assunto: Re: Rowkey hashing to avoid hotspotting > > > > > > The most common reason for RS hotspotting during writing data in HBase > > > is writing rows with monotonically increasing/decreasing row keys. > > > E.g. if you put timestamp in the first part of your key, then you are > > > likely to have monotonically increasing row keys. You can find more > > > info about this issue and how to solve it here: [1] and also you may > > > want to look at already implemented salting solution [2]. > > > > > > As for RS hotspotting during reading - it is hard to predict without Alex Baranau Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
-
Re: Rowkey hashing to avoid hotspottingsyed kather 2012-07-19, 16:52
Anand ,
i had a case which i had combine 4 fields and made one row key . serial number can be first part of rowkey and model number can be second part . So that B-Search on Row key will be more faster because we can reduce lot jump while doing B- Search Note : if serial number is changing frequently then use serial number at first part For solving hot spotting problem i am at present started implementing http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ In my case i had 20 million of rows in my hbase table. i had the same problem while reading in map reduce. Thanks and Regards, S SYED ABDUL KATHER On Thu, Jul 19, 2012 at 8:52 PM, Alex Baranau <[EMAIL PROTECTED]>wrote: > > I read somewhere that HBase is not > > good at handling more than 100 column families > > Heh. Usually it is not good to have more than two or three, actually. > See [1], and may be also [2]. > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > [1] http://hbase.apache.org/book/number.of.cfs.html > [2] > http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know > > On Thu, Jul 19, 2012 at 11:08 AM, AnandaVelMurugan Chandra Mohan < > [EMAIL PROTECTED]> wrote: > > > Hi Cristofer, > > > > No problem... I am happy to share and learn.. :) > > > > Regarding timestamp based column family, I haven't thought about it. But > my > > only concern is no of column families. I read somewhere that HBase is not > > good at handling more than 100 column families. > > > > > > On Wed, Jul 18, 2012 at 11:15 PM, Cristofer Weber < > > [EMAIL PROTECTED]> wrote: > > > > > Hi Anand! > > > > > > I see... sorry for being so curious, but since I started studying > HBase I > > > am curious about how people are modeling their tables, and in what > kinds > > of > > > systems HBase is in use. > > > > > > Have you evaluated recording your reports in a distinct CF using > > > timestamps as column qualifiers? It's my curiosity asking again! > > > > > > Thanks for sharing! > > > > > > Regards, > > > Cristofer > > > > > > -----Mensagem original----- > > > De: AnandaVelMurugan Chandra Mohan [mailto:[EMAIL PROTECTED]] > > > Enviada em: quarta-feira, 18 de julho de 2012 13:04 > > > Para: [EMAIL PROTECTED] > > > Assunto: Re: Rowkey hashing to avoid hotspotting > > > > > > Hi Cristofer, > > > > > > Data i store is test cell reports about a component. I have many test > > cell > > > reports for each model number + serial number combination. So to make > > > rowkey unique, I added timstamp. > > > > > > > > > On Wed, Jul 18, 2012 at 3:14 AM, Cristofer Weber < > > > [EMAIL PROTECTED]> wrote: > > > > > > > So, Anand, there are some things that can help, but again, most of > > > > them are related with the famous access patterns. > > > > > > > > Sometimes is not easy to get more information about them in advance, > > > > but if you are replacing another system you can study its data > > > > distribution, grouping for counts, mean, changes over time, etc. It > is > > > > possible to analyze with partial data too, but it is risky because > you > > > > will be subjected to the way this partial data was gathered; sample > > > > data may not be representative. > > > > > > > > Salting your rowkey with a hash calculated over your model# will > > > > probably result in an uniform distribution over a range (if using > > > > modulus), and pre-spliting your table will balance your load over > your > > > Region Servers. > > > > Also, you will be able to recalculate your hash for your model# > before > > > > scanning for it, allowing for a scan over specific rowkey while > > > > restricting this scan by startRow and stopRow. Remember that if your > > > > rowkeys shares the same prefix they will probably be located in the > > > > same region and your scan will be favored by this. > > > > > > > > I'm still curious about your need of adding a timestamp after your
-
Re: Rowkey hashing to avoid hotspottingAnandaVelMurugan Chandra ... 2012-07-20, 01:41
Thank a lot, Guys!!! I will evaluate and implement a solution based on your
suggestions.. On Thu, Jul 19, 2012 at 10:22 PM, syed kather <[EMAIL PROTECTED]> wrote: > Anand , > i had a case which i had combine 4 fields and made one row key . > serial number can be first part of rowkey and model number can be second > part . So that B-Search on Row key will be more faster because we can > reduce lot jump while doing B- Search > Note : if serial number is changing frequently then use serial number at > first part > > For solving hot spotting problem i am at present started implementing > > > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > > In my case i had 20 million of rows in my hbase table. i had the same > problem while reading in map reduce. > > Thanks and Regards, > S SYED ABDUL KATHER > > > > On Thu, Jul 19, 2012 at 8:52 PM, Alex Baranau <[EMAIL PROTECTED] > >wrote: > > > > I read somewhere that HBase is not > > > good at handling more than 100 column families > > > > Heh. Usually it is not good to have more than two or three, actually. > > See [1], and may be also [2]. > > > > Alex Baranau > > ------ > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - > > Solr > > > > [1] http://hbase.apache.org/book/number.of.cfs.html > > [2] > > http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know > > > > On Thu, Jul 19, 2012 at 11:08 AM, AnandaVelMurugan Chandra Mohan < > > [EMAIL PROTECTED]> wrote: > > > > > Hi Cristofer, > > > > > > No problem... I am happy to share and learn.. :) > > > > > > Regarding timestamp based column family, I haven't thought about it. > But > > my > > > only concern is no of column families. I read somewhere that HBase is > not > > > good at handling more than 100 column families. > > > > > > > > > On Wed, Jul 18, 2012 at 11:15 PM, Cristofer Weber < > > > [EMAIL PROTECTED]> wrote: > > > > > > > Hi Anand! > > > > > > > > I see... sorry for being so curious, but since I started studying > > HBase I > > > > am curious about how people are modeling their tables, and in what > > kinds > > > of > > > > systems HBase is in use. > > > > > > > > Have you evaluated recording your reports in a distinct CF using > > > > timestamps as column qualifiers? It's my curiosity asking again! > > > > > > > > Thanks for sharing! > > > > > > > > Regards, > > > > Cristofer > > > > > > > > -----Mensagem original----- > > > > De: AnandaVelMurugan Chandra Mohan [mailto:[EMAIL PROTECTED]] > > > > Enviada em: quarta-feira, 18 de julho de 2012 13:04 > > > > Para: [EMAIL PROTECTED] > > > > Assunto: Re: Rowkey hashing to avoid hotspotting > > > > > > > > Hi Cristofer, > > > > > > > > Data i store is test cell reports about a component. I have many test > > > cell > > > > reports for each model number + serial number combination. So to make > > > > rowkey unique, I added timstamp. > > > > > > > > > > > > On Wed, Jul 18, 2012 at 3:14 AM, Cristofer Weber < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > So, Anand, there are some things that can help, but again, most of > > > > > them are related with the famous access patterns. > > > > > > > > > > Sometimes is not easy to get more information about them in > advance, > > > > > but if you are replacing another system you can study its data > > > > > distribution, grouping for counts, mean, changes over time, etc. It > > is > > > > > possible to analyze with partial data too, but it is risky because > > you > > > > > will be subjected to the way this partial data was gathered; sample > > > > > data may not be representative. > > > > > > > > > > Salting your rowkey with a hash calculated over your model# will > > > > > probably result in an uniform distribution over a range (if using > > > > > modulus), and pre-spliting your table will balance your load over > > your > > > > Region Servers. > > > > > Also, you will be able to recalculate your hash for your model# Regards, Anand |