|
|
-
how to randomize the primary key which is a timestamp
Weishung Chung 2011-01-10, 15:33
What is the good way to randomize the primary key which is a timestamp in HBase to avoid hotspotting? Thank you so much :)
-
Re: how to randomize the primary key which is a timestamp
Friso van Vollenhoven 2011-01-10, 15:50
Once the data is stored, how do you plan on querying it? If you want to scan for certain periods of time, having the order of timestamps randomized is not ideal.
If you are planning to do only exact lookups for individual timestamps (which might be the case), I guess you can reverse the byte order of the timestamp given that the granularity of the times is fine enough. Friso
On 10 jan 2011, at 16:33, Weishung Chung wrote:
> What is the good way to randomize the primary key which is a timestamp in > HBase to avoid hotspotting? > Thank you so much :)
-
Re: how to randomize the primary key which is a timestamp
Chirstopher Tarnas 2011-01-10, 16:05
Some options that I am aware of:
reverse the byte order of the timestamp use UUIDs rather than a timestamp use hashing, this working really depends on your requirements
On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung <[EMAIL PROTECTED]> wrote:
> What is the good way to randomize the primary key which is a timestamp in > HBase to avoid hotspotting? > Thank you so much :) >
-
Re: how to randomize the primary key which is a timestamp
Matt Corgan 2011-01-10, 16:08
You can also add a random (or hashed) prefix to the beginning of the key. If your prefix were one byte with values 0-63, you've divided the hot spot into 64 smaller ones, which is better for writing. The downside is that if you want to read a range of values, you will have to query all 64 "shards" and merge the sorted values. You can choose whatever prefix size is best for your scenario. On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas <[EMAIL PROTECTED]> wrote:
> Some options that I am aware of: > > reverse the byte order of the timestamp > use UUIDs rather than a timestamp > use hashing, this working really depends on your requirements > > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung <[EMAIL PROTECTED]> > wrote: > > > What is the good way to randomize the primary key which is a timestamp in > > HBase to avoid hotspotting? > > Thank you so much :) > > >
-
Re: how to randomize the primary key which is a timestamp
Weishung Chung 2011-01-10, 16:20
Thank you for the replies. Most of the queries, (70%) will be for scanning a range of consecutive times, with some single timestamp query (30%) But there are multiple tables with the same range of timestamps, will all these same range of timestamps from multiple tables be stored on the same region server and if so, could it affect the performance of map reduce jobs (operated on those tables with the same range of time periods) ? Would hotspotting defeat the purpose of map reduce?
On Mon, Jan 10, 2011 at 10:08 AM, Matt Corgan <[EMAIL PROTECTED]> wrote:
> You can also add a random (or hashed) prefix to the beginning of the key. > If your prefix were one byte with values 0-63, you've divided the hot spot > into 64 smaller ones, which is better for writing. The downside is that if > you want to read a range of values, you will have to query all 64 "shards" > and merge the sorted values. You can choose whatever prefix size is best > for your scenario. > > > On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas <[EMAIL PROTECTED]> > wrote: > > > Some options that I am aware of: > > > > reverse the byte order of the timestamp > > use UUIDs rather than a timestamp > > use hashing, this working really depends on your requirements > > > > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung <[EMAIL PROTECTED]> > > wrote: > > > > > What is the good way to randomize the primary key which is a timestamp > in > > > HBase to avoid hotspotting? > > > Thank you so much :) > > > > > >
-
Re: how to randomize the primary key which is a timestamp
Ted Dunning 2011-01-10, 16:30
If multiple tables have the same key distribution and count, then they will have similar split points for their regions, but the locations of the regions will be randomized.
I wouldn't worry about this until you see evidence it is a problem.
On Mon, Jan 10, 2011 at 8:20 AM, Weishung Chung <[EMAIL PROTECTED]> wrote:
> Thank you for the replies. > Most of the queries, (70%) will be for scanning a range of consecutive > times, with some single timestamp query (30%) > But there are multiple tables with the same range of timestamps, will all > these same range of timestamps from multiple tables be stored on the same > region server and if so, could it affect the performance of map reduce jobs > (operated on those tables with the same range of time periods) ? Would > hotspotting defeat the purpose of map reduce? > > On Mon, Jan 10, 2011 at 10:08 AM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > > You can also add a random (or hashed) prefix to the beginning of the key. > > If your prefix were one byte with values 0-63, you've divided the hot > spot > > into 64 smaller ones, which is better for writing. The downside is that > if > > you want to read a range of values, you will have to query all 64 > "shards" > > and merge the sorted values. You can choose whatever prefix size is best > > for your scenario. > > > > > > On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas <[EMAIL PROTECTED]> > > wrote: > > > > > Some options that I am aware of: > > > > > > reverse the byte order of the timestamp > > > use UUIDs rather than a timestamp > > > use hashing, this working really depends on your requirements > > > > > > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung <[EMAIL PROTECTED]> > > > wrote: > > > > > > > What is the good way to randomize the primary key which is a > timestamp > > in > > > > HBase to avoid hotspotting? > > > > Thank you so much :) > > > > > > > > > >
-
Re: how to randomize the primary key which is a timestamp
Matt Corgan 2011-01-10, 16:41
You can put them all in the same table. If you prefix the keys when written, use a prefix filter when querying. I would choose a prefix window that's about 4 times the number of nodes. On Mon, Jan 10, 2011 at 11:30 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> If multiple tables have the same key distribution and count, then they will > have similar split points for their regions, but the locations of the > regions will be randomized. > > I wouldn't worry about this until you see evidence it is a problem. > > On Mon, Jan 10, 2011 at 8:20 AM, Weishung Chung <[EMAIL PROTECTED]> > wrote: > > > Thank you for the replies. > > Most of the queries, (70%) will be for scanning a range of consecutive > > times, with some single timestamp query (30%) > > But there are multiple tables with the same range of timestamps, will all > > these same range of timestamps from multiple tables be stored on the same > > region server and if so, could it affect the performance of map reduce > jobs > > (operated on those tables with the same range of time periods) ? Would > > hotspotting defeat the purpose of map reduce? > > > > On Mon, Jan 10, 2011 at 10:08 AM, Matt Corgan <[EMAIL PROTECTED]> > wrote: > > > > > You can also add a random (or hashed) prefix to the beginning of the > key. > > > If your prefix were one byte with values 0-63, you've divided the hot > > spot > > > into 64 smaller ones, which is better for writing. The downside is > that > > if > > > you want to read a range of values, you will have to query all 64 > > "shards" > > > and merge the sorted values. You can choose whatever prefix size is > best > > > for your scenario. > > > > > > > > > On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Some options that I am aware of: > > > > > > > > reverse the byte order of the timestamp > > > > use UUIDs rather than a timestamp > > > > use hashing, this working really depends on your requirements > > > > > > > > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > What is the good way to randomize the primary key which is a > > timestamp > > > in > > > > > HBase to avoid hotspotting? > > > > > Thank you so much :) > > > > > > > > > > > > > > >
-
Re: how to randomize the primary key which is a timestamp
Weishung Chung 2011-01-10, 16:56
Thank you for your prompt response. I am a bit confused about the prefix. If i were to use prefix for the timestamp key, when come to query time, how should i specify the row key to search for? How do I know which prefix was used for a certain timestamp and needs to be append to the timestamp for querying?
On Mon, Jan 10, 2011 at 10:41 AM, Matt Corgan <[EMAIL PROTECTED]> wrote:
> You can put them all in the same table. If you prefix the keys when > written, use a prefix filter when querying. I would choose a prefix window > that's about 4 times the number of nodes. > > > On Mon, Jan 10, 2011 at 11:30 AM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > If multiple tables have the same key distribution and count, then they > will > > have similar split points for their regions, but the locations of the > > regions will be randomized. > > > > I wouldn't worry about this until you see evidence it is a problem. > > > > On Mon, Jan 10, 2011 at 8:20 AM, Weishung Chung <[EMAIL PROTECTED]> > > wrote: > > > > > Thank you for the replies. > > > Most of the queries, (70%) will be for scanning a range of consecutive > > > times, with some single timestamp query (30%) > > > But there are multiple tables with the same range of timestamps, will > all > > > these same range of timestamps from multiple tables be stored on the > same > > > region server and if so, could it affect the performance of map reduce > > jobs > > > (operated on those tables with the same range of time periods) ? Would > > > hotspotting defeat the purpose of map reduce? > > > > > > On Mon, Jan 10, 2011 at 10:08 AM, Matt Corgan <[EMAIL PROTECTED]> > > wrote: > > > > > > > You can also add a random (or hashed) prefix to the beginning of the > > key. > > > > If your prefix were one byte with values 0-63, you've divided the > hot > > > spot > > > > into 64 smaller ones, which is better for writing. The downside is > > that > > > if > > > > you want to read a range of values, you will have to query all 64 > > > "shards" > > > > and merge the sorted values. You can choose whatever prefix size is > > best > > > > for your scenario. > > > > > > > > > > > > On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > Some options that I am aware of: > > > > > > > > > > reverse the byte order of the timestamp > > > > > use UUIDs rather than a timestamp > > > > > use hashing, this working really depends on your requirements > > > > > > > > > > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung < > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > What is the good way to randomize the primary key which is a > > > timestamp > > > > in > > > > > > HBase to avoid hotspotting? > > > > > > Thank you so much :) > > > > > > > > > > > > > > > > > > > > >
-
Re: how to randomize the primary key which is a timestamp
Matt Corgan 2011-01-10, 17:04
You could have prefix = timestamp % 64. Then for a single key lookup, you could calculate the prefix and query just one shard. For a scan, you have to query all shards and merge the results. On Mon, Jan 10, 2011 at 11:56 AM, Weishung Chung <[EMAIL PROTECTED]> wrote:
> Thank you for your prompt response. I am a bit confused about the prefix. > If i were to use prefix for the timestamp key, when come to query time, how > should i specify the row key to search for? How do I know which prefix was > used for a certain timestamp and needs to be append to the timestamp for > querying? > > On Mon, Jan 10, 2011 at 10:41 AM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > > You can put them all in the same table. If you prefix the keys when > > written, use a prefix filter when querying. I would choose a prefix > window > > that's about 4 times the number of nodes. > > > > > > On Mon, Jan 10, 2011 at 11:30 AM, Ted Dunning <[EMAIL PROTECTED]> > > wrote: > > > > > If multiple tables have the same key distribution and count, then they > > will > > > have similar split points for their regions, but the locations of the > > > regions will be randomized. > > > > > > I wouldn't worry about this until you see evidence it is a problem. > > > > > > On Mon, Jan 10, 2011 at 8:20 AM, Weishung Chung <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Thank you for the replies. > > > > Most of the queries, (70%) will be for scanning a range of > consecutive > > > > times, with some single timestamp query (30%) > > > > But there are multiple tables with the same range of timestamps, will > > all > > > > these same range of timestamps from multiple tables be stored on the > > same > > > > region server and if so, could it affect the performance of map > reduce > > > jobs > > > > (operated on those tables with the same range of time periods) ? > Would > > > > hotspotting defeat the purpose of map reduce? > > > > > > > > On Mon, Jan 10, 2011 at 10:08 AM, Matt Corgan <[EMAIL PROTECTED]> > > > wrote: > > > > > > > > > You can also add a random (or hashed) prefix to the beginning of > the > > > key. > > > > > If your prefix were one byte with values 0-63, you've divided the > > hot > > > > spot > > > > > into 64 smaller ones, which is better for writing. The downside is > > > that > > > > if > > > > > you want to read a range of values, you will have to query all 64 > > > > "shards" > > > > > and merge the sorted values. You can choose whatever prefix size > is > > > best > > > > > for your scenario. > > > > > > > > > > > > > > > On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas < > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > Some options that I am aware of: > > > > > > > > > > > > reverse the byte order of the timestamp > > > > > > use UUIDs rather than a timestamp > > > > > > use hashing, this working really depends on your requirements > > > > > > > > > > > > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung < > > [EMAIL PROTECTED]> > > > > > > wrote: > > > > > > > > > > > > > What is the good way to randomize the primary key which is a > > > > timestamp > > > > > in > > > > > > > HBase to avoid hotspotting? > > > > > > > Thank you so much :) > > > > > > > > > > > > > > > > > > > > > > > > > > > >
-
Re: how to randomize the primary key which is a timestamp
Weishung Chung 2011-01-10, 17:42
Thanks alot, this will get me started :D
On Mon, Jan 10, 2011 at 11:04 AM, Matt Corgan <[EMAIL PROTECTED]> wrote:
> You could have prefix = timestamp % 64. Then for a single key lookup, you > could calculate the prefix and query just one shard. For a scan, you have > to query all shards and merge the results. > > > On Mon, Jan 10, 2011 at 11:56 AM, Weishung Chung <[EMAIL PROTECTED]> > wrote: > > > Thank you for your prompt response. I am a bit confused about the prefix. > > If i were to use prefix for the timestamp key, when come to query time, > how > > should i specify the row key to search for? How do I know which prefix > was > > used for a certain timestamp and needs to be append to the timestamp for > > querying? > > > > On Mon, Jan 10, 2011 at 10:41 AM, Matt Corgan <[EMAIL PROTECTED]> > wrote: > > > > > You can put them all in the same table. If you prefix the keys when > > > written, use a prefix filter when querying. I would choose a prefix > > window > > > that's about 4 times the number of nodes. > > > > > > > > > On Mon, Jan 10, 2011 at 11:30 AM, Ted Dunning <[EMAIL PROTECTED]> > > > wrote: > > > > > > > If multiple tables have the same key distribution and count, then > they > > > will > > > > have similar split points for their regions, but the locations of the > > > > regions will be randomized. > > > > > > > > I wouldn't worry about this until you see evidence it is a problem. > > > > > > > > On Mon, Jan 10, 2011 at 8:20 AM, Weishung Chung <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > Thank you for the replies. > > > > > Most of the queries, (70%) will be for scanning a range of > > consecutive > > > > > times, with some single timestamp query (30%) > > > > > But there are multiple tables with the same range of timestamps, > will > > > all > > > > > these same range of timestamps from multiple tables be stored on > the > > > same > > > > > region server and if so, could it affect the performance of map > > reduce > > > > jobs > > > > > (operated on those tables with the same range of time periods) ? > > Would > > > > > hotspotting defeat the purpose of map reduce? > > > > > > > > > > On Mon, Jan 10, 2011 at 10:08 AM, Matt Corgan <[EMAIL PROTECTED] > > > > > > wrote: > > > > > > > > > > > You can also add a random (or hashed) prefix to the beginning of > > the > > > > key. > > > > > > If your prefix were one byte with values 0-63, you've divided > the > > > hot > > > > > spot > > > > > > into 64 smaller ones, which is better for writing. The downside > is > > > > that > > > > > if > > > > > > you want to read a range of values, you will have to query all 64 > > > > > "shards" > > > > > > and merge the sorted values. You can choose whatever prefix size > > is > > > > best > > > > > > for your scenario. > > > > > > > > > > > > > > > > > > On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas < > > [EMAIL PROTECTED]> > > > > > > wrote: > > > > > > > > > > > > > Some options that I am aware of: > > > > > > > > > > > > > > reverse the byte order of the timestamp > > > > > > > use UUIDs rather than a timestamp > > > > > > > use hashing, this working really depends on your requirements > > > > > > > > > > > > > > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung < > > > [EMAIL PROTECTED]> > > > > > > > wrote: > > > > > > > > > > > > > > > What is the good way to randomize the primary key which is a > > > > > timestamp > > > > > > in > > > > > > > > HBase to avoid hotspotting? > > > > > > > > Thank you so much :) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
-
Re: how to randomize the primary key which is a timestamp
Tost 2011-01-11, 00:18
How about SecureRandom class. you can get the key from seed. see http://download.oracle.com/javase/6/docs/api/java/security/SecureRandom.html2011/1/11 Weishung Chung <[EMAIL PROTECTED]> > Thanks alot, this will get me started :D > > On Mon, Jan 10, 2011 at 11:04 AM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > > You could have prefix = timestamp % 64. Then for a single key lookup, > you > > could calculate the prefix and query just one shard. For a scan, you > have > > to query all shards and merge the results. > > > > > > On Mon, Jan 10, 2011 at 11:56 AM, Weishung Chung <[EMAIL PROTECTED]> > > wrote: > > > > > Thank you for your prompt response. I am a bit confused about the > prefix. > > > If i were to use prefix for the timestamp key, when come to query time, > > how > > > should i specify the row key to search for? How do I know which prefix > > was > > > used for a certain timestamp and needs to be append to the timestamp > for > > > querying? > > > > > > On Mon, Jan 10, 2011 at 10:41 AM, Matt Corgan <[EMAIL PROTECTED]> > > wrote: > > > > > > > You can put them all in the same table. If you prefix the keys when > > > > written, use a prefix filter when querying. I would choose a prefix > > > window > > > > that's about 4 times the number of nodes. > > > > > > > > > > > > On Mon, Jan 10, 2011 at 11:30 AM, Ted Dunning <[EMAIL PROTECTED] > > > > > > wrote: > > > > > > > > > If multiple tables have the same key distribution and count, then > > they > > > > will > > > > > have similar split points for their regions, but the locations of > the > > > > > regions will be randomized. > > > > > > > > > > I wouldn't worry about this until you see evidence it is a problem. > > > > > > > > > > On Mon, Jan 10, 2011 at 8:20 AM, Weishung Chung < > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > Thank you for the replies. > > > > > > Most of the queries, (70%) will be for scanning a range of > > > consecutive > > > > > > times, with some single timestamp query (30%) > > > > > > But there are multiple tables with the same range of timestamps, > > will > > > > all > > > > > > these same range of timestamps from multiple tables be stored on > > the > > > > same > > > > > > region server and if so, could it affect the performance of map > > > reduce > > > > > jobs > > > > > > (operated on those tables with the same range of time periods) ? > > > Would > > > > > > hotspotting defeat the purpose of map reduce? > > > > > > > > > > > > On Mon, Jan 10, 2011 at 10:08 AM, Matt Corgan < > [EMAIL PROTECTED] > > > > > > > > wrote: > > > > > > > > > > > > > You can also add a random (or hashed) prefix to the beginning > of > > > the > > > > > key. > > > > > > > If your prefix were one byte with values 0-63, you've divided > > the > > > > hot > > > > > > spot > > > > > > > into 64 smaller ones, which is better for writing. The > downside > > is > > > > > that > > > > > > if > > > > > > > you want to read a range of values, you will have to query all > 64 > > > > > > "shards" > > > > > > > and merge the sorted values. You can choose whatever prefix > size > > > is > > > > > best > > > > > > > for your scenario. > > > > > > > > > > > > > > > > > > > > > On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas < > > > [EMAIL PROTECTED]> > > > > > > > wrote: > > > > > > > > > > > > > > > Some options that I am aware of: > > > > > > > > > > > > > > > > reverse the byte order of the timestamp > > > > > > > > use UUIDs rather than a timestamp > > > > > > > > use hashing, this working really depends on your requirements > > > > > > > > > > > > > > > > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung < > > > > [EMAIL PROTECTED]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > What is the good way to randomize the primary key which is > a > > > > > > timestamp > > > > > > > in > > > > > > > > > HBase to avoid hotspotting? > > > > > > > > > Thank you so much :) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
|
|