|
Jean-Marc Spaggiari
2012-06-13, 16:16
Otis Gospodnetic
2012-06-14, 06:06
Jean-Marc Spaggiari
2012-06-14, 10:39
Michael Segel
2012-06-14, 11:55
Jean-Marc Spaggiari
2012-06-14, 12:22
Michael Segel
2012-06-14, 18:14
Jean-Marc Spaggiari
2012-06-14, 18:47
Michael Segel
2012-06-14, 19:46
Michael Segel
2012-06-15, 14:21
Jean-Marc Spaggiari
2012-06-16, 11:22
Michel Segel
2012-06-16, 14:35
Jean-Marc Spaggiari
2012-06-16, 14:42
Michael Segel
2012-06-16, 16:33
Rob Verkuylen
2012-06-16, 19:10
Jean-Marc Spaggiari
2012-06-21, 11:43
Michael Segel
2012-06-21, 14:20
Jean-Marc Spaggiari
2012-06-22, 19:43
Jean-Marc Spaggiari
2012-06-23, 02:20
Jean-Daniel Cryans
2012-06-26, 17:50
Jean-Marc Spaggiari
2012-06-26, 17:56
Jean-Daniel Cryans
2012-06-26, 18:12
Michael Segel
2012-06-26, 19:01
Jean-Marc Spaggiari
2012-06-26, 19:04
Doug Meil
2012-06-14, 21:18
|
-
Timestamp as a key good practice?Jean-Marc Spaggiari 2012-06-13, 16:16
I watched Lars George's video about HBase and read the documentation
and it's saying that it's not a good idea to have the timestamp as a key because that will always load the same region until the timestamp reach a certain value and move to the next region (hotspotting). I have a table with a uniq key, a file path and a "last update" field. I can easily find back the file with the ID and find when it has been updated. But what I need too is to find the files not updated for more than a certain period of time. If I want to retrieve that from this single table, I will have to do a full parsing of the table. Which might take a while. So I thought of building a table to reference that (kind of secondary index). The key is the "last update", one FC and each column will have the ID of the file with a dummy content. When a file is updated, I remove its cell from this table, and introduce a new cell with the new timestamp as the key. And so one. With this schema, I can find the files by ID very quickly and I can find the files which need to be updated pretty quickly too. But it's hotspotting one region. >From the video (0:45:10) I can see 4 situations. 1) Hotspotting. 2) Salting. 3) Key field swap/promotion 4) Randomization. I need to avoid hostpotting, so I looked at the 3 other options. I can do salting. Like prefix the timestamp with a number between 0 and 9. So that will distribut the load over 10 servers. To find all the files with a timestamp below a specific value, I will need to run 10 requests instead of one. But when the load will becaume to big for 10 servers, I will have to prefix by a byte between 0 and 99? Which mean 100 request? And the more regions I will have, the more requests I will have to do. Is that really a good approach? Key field swap is close to salting. I can add the first few bytes from the path before the timestamp, but the issue will remain the same. I looked and randomization, and I can't do that. Else I will have no way to retreive the information I'm looking for. So the question is. Is there a good way to store the data to retrieve them base on the date? Thanks, JM +
Jean-Marc Spaggiari 2012-06-13, 16:16
-
Re: Timestamp as a key good practice?Otis Gospodnetic 2012-06-14, 06:06
JM, have a look at https://github.com/sematext/HBaseWD (this comes up often.... Doug, maybe you could add it to the Ref Guide?)
Otis ---- Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm >________________________________ > From: Jean-Marc Spaggiari <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Wednesday, June 13, 2012 12:16 PM >Subject: Timestamp as a key good practice? > >I watched Lars George's video about HBase and read the documentation >and it's saying that it's not a good idea to have the timestamp as a >key because that will always load the same region until the timestamp >reach a certain value and move to the next region (hotspotting). > >I have a table with a uniq key, a file path and a "last update" field. >I can easily find back the file with the ID and find when it has been >updated. > >But what I need too is to find the files not updated for more than a >certain period of time. > >If I want to retrieve that from this single table, I will have to do a >full parsing of the table. Which might take a while. > >So I thought of building a table to reference that (kind of secondary >index). The key is the "last update", one FC and each column will have >the ID of the file with a dummy content. > >When a file is updated, I remove its cell from this table, and >introduce a new cell with the new timestamp as the key. > >And so one. > >With this schema, I can find the files by ID very quickly and I can >find the files which need to be updated pretty quickly too. But it's >hotspotting one region. > >From the video (0:45:10) I can see 4 situations. >1) Hotspotting. >2) Salting. >3) Key field swap/promotion >4) Randomization. > >I need to avoid hostpotting, so I looked at the 3 other options. > >I can do salting. Like prefix the timestamp with a number between 0 >and 9. So that will distribut the load over 10 servers. To find all >the files with a timestamp below a specific value, I will need to run >10 requests instead of one. But when the load will becaume to big for >10 servers, I will have to prefix by a byte between 0 and 99? Which >mean 100 request? And the more regions I will have, the more requests >I will have to do. Is that really a good approach? > >Key field swap is close to salting. I can add the first few bytes from >the path before the timestamp, but the issue will remain the same. > >I looked and randomization, and I can't do that. Else I will have no >way to retreive the information I'm looking for. > >So the question is. Is there a good way to store the data to retrieve >them base on the date? > >Thanks, > >JM > > > +
Otis Gospodnetic 2012-06-14, 06:06
-
Re: Timestamp as a key good practice?Jean-Marc Spaggiari 2012-06-14, 10:39
Wow! This is exactly what I was looking for. So I will read all of that now.
Need to read here at the bottom: https://github.com/sematext/HBaseWD and here: http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ Thanks, JM 2012/6/14, Otis Gospodnetic <[EMAIL PROTECTED]>: > JM, have a look at https://github.com/sematext/HBaseWD (this comes up > often.... Doug, maybe you could add it to the Ref Guide?) > > Otis > ---- > Performance Monitoring for Solr / ElasticSearch / HBase - > http://sematext.com/spm > > > >>________________________________ >> From: Jean-Marc Spaggiari <[EMAIL PROTECTED]> >>To: [EMAIL PROTECTED] >>Sent: Wednesday, June 13, 2012 12:16 PM >>Subject: Timestamp as a key good practice? >> >>I watched Lars George's video about HBase and read the documentation >>and it's saying that it's not a good idea to have the timestamp as a >>key because that will always load the same region until the timestamp >>reach a certain value and move to the next region (hotspotting). >> >>I have a table with a uniq key, a file path and a "last update" field. >>I can easily find back the file with the ID and find when it has been >>updated. >> >>But what I need too is to find the files not updated for more than a >>certain period of time. >> >>If I want to retrieve that from this single table, I will have to do a >>full parsing of the table. Which might take a while. >> >>So I thought of building a table to reference that (kind of secondary >>index). The key is the "last update", one FC and each column will have >>the ID of the file with a dummy content. >> >>When a file is updated, I remove its cell from this table, and >>introduce a new cell with the new timestamp as the key. >> >>And so one. >> >>With this schema, I can find the files by ID very quickly and I can >>find the files which need to be updated pretty quickly too. But it's >>hotspotting one region. >> > >From the video (0:45:10) I can see 4 situations. >>1) Hotspotting. >>2) Salting. >>3) Key field swap/promotion >>4) Randomization. >> >>I need to avoid hostpotting, so I looked at the 3 other options. >> >>I can do salting. Like prefix the timestamp with a number between 0 >>and 9. So that will distribut the load over 10 servers. To find all >>the files with a timestamp below a specific value, I will need to run >>10 requests instead of one. But when the load will becaume to big for >>10 servers, I will have to prefix by a byte between 0 and 99? Which >>mean 100 request? And the more regions I will have, the more requests >>I will have to do. Is that really a good approach? >> >>Key field swap is close to salting. I can add the first few bytes from >>the path before the timestamp, but the issue will remain the same. >> >>I looked and randomization, and I can't do that. Else I will have no >>way to retreive the information I'm looking for. >> >>So the question is. Is there a good way to store the data to retrieve >>them base on the date? >> >>Thanks, >> >>JM >> >> >> +
Jean-Marc Spaggiari 2012-06-14, 10:39
-
Re: Timestamp as a key good practice?Michael Segel 2012-06-14, 11:55
Actually I think you should revisit your key design....
Look at your access path to the data for each of the types of queries you are going to run. From your post: "I have a table with a uniq key, a file path and a "last update" field. >>> I can easily find back the file with the ID and find when it has been >>> updated. >>> >>> But what I need too is to find the files not updated for more than a >>> certain period of time. " So your primary query is going to be against the key. Not sure if you meant to say that your key was a composite key or not... sounds like your key is just the unique key and the rest are columns in the table. The secondary query or path to the data is to find data where the files were not updated for more than a period of time. If you make your key temporal, that is adding time as a component of your key, you will end up creating new rows of data while the old row still exists. Not a good side effect. The other nasty side effect of using time as your key is that you not only have the potential for hot spotting, but that you also have the nasty side effect of creating splits that will never grow. How often are you going to ask to see the files where they were not updated in the last couple of days/minutes? If its infrequent, then you really should care if you have to do a complete table scan. On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote: > Wow! This is exactly what I was looking for. So I will read all of that now. > > Need to read here at the bottom: https://github.com/sematext/HBaseWD > and here: http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > > Thanks, > > JM > > 2012/6/14, Otis Gospodnetic <[EMAIL PROTECTED]>: >> JM, have a look at https://github.com/sematext/HBaseWD (this comes up >> often.... Doug, maybe you could add it to the Ref Guide?) >> >> Otis >> ---- >> Performance Monitoring for Solr / ElasticSearch / HBase - >> http://sematext.com/spm >> >> >> >>> ________________________________ >>> From: Jean-Marc Spaggiari <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED] >>> Sent: Wednesday, June 13, 2012 12:16 PM >>> Subject: Timestamp as a key good practice? >>> >>> I watched Lars George's video about HBase and read the documentation >>> and it's saying that it's not a good idea to have the timestamp as a >>> key because that will always load the same region until the timestamp >>> reach a certain value and move to the next region (hotspotting). >>> >>> I have a table with a uniq key, a file path and a "last update" field. >>> I can easily find back the file with the ID and find when it has been >>> updated. >>> >>> But what I need too is to find the files not updated for more than a >>> certain period of time. >>> >>> If I want to retrieve that from this single table, I will have to do a >>> full parsing of the table. Which might take a while. >>> >>> So I thought of building a table to reference that (kind of secondary >>> index). The key is the "last update", one FC and each column will have >>> the ID of the file with a dummy content. >>> >>> When a file is updated, I remove its cell from this table, and >>> introduce a new cell with the new timestamp as the key. >>> >>> And so one. >>> >>> With this schema, I can find the files by ID very quickly and I can >>> find the files which need to be updated pretty quickly too. But it's >>> hotspotting one region. >>> >>> From the video (0:45:10) I can see 4 situations. >>> 1) Hotspotting. >>> 2) Salting. >>> 3) Key field swap/promotion >>> 4) Randomization. >>> >>> I need to avoid hostpotting, so I looked at the 3 other options. >>> >>> I can do salting. Like prefix the timestamp with a number between 0 >>> and 9. So that will distribut the load over 10 servers. To find all >>> the files with a timestamp below a specific value, I will need to run >>> 10 requests instead of one. But when the load will becaume to big for +
Michael Segel 2012-06-14, 11:55
-
Re: Timestamp as a key good practice?Jean-Marc Spaggiari 2012-06-14, 12:22
Hi Michael,
Thanks for your feedback. Here are more details to describe what I'm trying to achieve. My goal is to store information about files into the database. I need to check the oldest files in the database to refresh the information. The key is an 8 bytes ID of the server name in the network hosting the file + MD5 of the file path. Total is a 24 bytes key. So each time I look at a file and gather the information, I update its row in the database based on the key including a "last_update" field. I can calculate this key for any file in the drives. In order to know which file I need to check in the network, I need to scan the table by "last_update" field. So the idea is to build another table which contain the last_update as a key and the files IDs in columns. (Here is the hotspotting) Each time I work on a file, I will have to update the main table by ID and remove the cell from the second table (the index) and put it back with the new "last_update" key. I'm mainly doing 3 operations in the database. 1) I retrieve a list of 500 files which need to be update 2) I update the information for those 500 files (bulk update) 3) I load new files references to be checked. For 2 and 3, I use the main table with the file ID as the key. the distribution is almost perfect because I'm using hash. The prefix is the server ID but it's not always going to the same server since it's done by last_update. But this allow a quick access to the list of files from one server. For 1, I have expected to build this second table with the "last_update" as the key. Regarding the frequency, it really depends on the activities on the network, but it should be "often". The faster the database update will be, the more up to date I will be able to keep it. JM 2012/6/14, Michael Segel <[EMAIL PROTECTED]>: > Actually I think you should revisit your key design.... > > Look at your access path to the data for each of the types of queries you > are going to run. > From your post: > "I have a table with a uniq key, a file path and a "last update" field. >>>> I can easily find back the file with the ID and find when it has been >>>> updated. >>>> >>>> But what I need too is to find the files not updated for more than a >>>> certain period of time. > " > So your primary query is going to be against the key. > Not sure if you meant to say that your key was a composite key or not... > sounds like your key is just the unique key and the rest are columns in the > table. > > The secondary query or path to the data is to find data where the files were > not updated for more than a period of time. > > If you make your key temporal, that is adding time as a component of your > key, you will end up creating new rows of data while the old row still > exists. > Not a good side effect. > > The other nasty side effect of using time as your key is that you not only > have the potential for hot spotting, but that you also have the nasty side > effect of creating splits that will never grow. > > How often are you going to ask to see the files where they were not updated > in the last couple of days/minutes? If its infrequent, then you really > should care if you have to do a complete table scan. > > > > > On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote: > >> Wow! This is exactly what I was looking for. So I will read all of that >> now. >> >> Need to read here at the bottom: https://github.com/sematext/HBaseWD >> and here: >> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ >> >> Thanks, >> >> JM >> >> 2012/6/14, Otis Gospodnetic <[EMAIL PROTECTED]>: >>> JM, have a look at https://github.com/sematext/HBaseWD (this comes up >>> often.... Doug, maybe you could add it to the Ref Guide?) >>> >>> Otis >>> ---- >>> Performance Monitoring for Solr / ElasticSearch / HBase - >>> http://sematext.com/spm >>> >>> >>> >>>> ________________________________ >>>> From: Jean-Marc Spaggiari <[EMAIL PROTECTED]> +
Jean-Marc Spaggiari 2012-06-14, 12:22
-
Re: Timestamp as a key good practice?Michael Segel 2012-06-14, 18:14
Jean-Marc,
You do realize that this really isn't a good use case for HBase, assuming that what you are describing is a stand alone system. It would be easier and better if you just used a simple relational database. Then you would have your table w an ID, and a secondary index on the timestamp. Retrieve the data in Ascending order by timestamp and take the top 500 off the list. If you insist on using HBase, yes you will have to have a secondary table. Then using co-processors... When you update the row in your base table, you then get() the row in your index by timestamp, removing the column for that rowid. Add the new column to the timestamp row. As you put it. Now you can just do a partial scan on your index. Because your index table is so small... you shouldn't worry about hotspots. You may just want to rebuild your index every so often... HTH -Mike On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote: > Hi Michael, > > Thanks for your feedback. Here are more details to describe what I'm > trying to achieve. > > My goal is to store information about files into the database. I need > to check the oldest files in the database to refresh the information. > > The key is an 8 bytes ID of the server name in the network hosting the > file + MD5 of the file path. Total is a 24 bytes key. > > So each time I look at a file and gather the information, I update its > row in the database based on the key including a "last_update" field. > I can calculate this key for any file in the drives. > > In order to know which file I need to check in the network, I need to > scan the table by "last_update" field. So the idea is to build another > table which contain the last_update as a key and the files IDs in > columns. (Here is the hotspotting) > > Each time I work on a file, I will have to update the main table by ID > and remove the cell from the second table (the index) and put it back > with the new "last_update" key. > > I'm mainly doing 3 operations in the database. > 1) I retrieve a list of 500 files which need to be update > 2) I update the information for those 500 files (bulk update) > 3) I load new files references to be checked. > > For 2 and 3, I use the main table with the file ID as the key. the > distribution is almost perfect because I'm using hash. The prefix is > the server ID but it's not always going to the same server since it's > done by last_update. But this allow a quick access to the list of > files from one server. > For 1, I have expected to build this second table with the > "last_update" as the key. > > Regarding the frequency, it really depends on the activities on the > network, but it should be "often". The faster the database update > will be, the more up to date I will be able to keep it. > > JM > > 2012/6/14, Michael Segel <[EMAIL PROTECTED]>: >> Actually I think you should revisit your key design.... >> >> Look at your access path to the data for each of the types of queries you >> are going to run. >> From your post: >> "I have a table with a uniq key, a file path and a "last update" field. >>>>> I can easily find back the file with the ID and find when it has been >>>>> updated. >>>>> >>>>> But what I need too is to find the files not updated for more than a >>>>> certain period of time. >> " >> So your primary query is going to be against the key. >> Not sure if you meant to say that your key was a composite key or not... >> sounds like your key is just the unique key and the rest are columns in the >> table. >> >> The secondary query or path to the data is to find data where the files were >> not updated for more than a period of time. >> >> If you make your key temporal, that is adding time as a component of your >> key, you will end up creating new rows of data while the old row still >> exists. >> Not a good side effect. >> >> The other nasty side effect of using time as your key is that you not only >> have the potential for hot spotting, but that you also have the nasty side +
Michael Segel 2012-06-14, 18:14
-
Re: Timestamp as a key good practice?Jean-Marc Spaggiari 2012-06-14, 18:47
Hi Michael,
For now this is more a proof of concept than a production application. And if it's working, it should be growing a lot and database at the end will easily be over 1B rows. each individual server will have to send it's own information to one centralized server which will insert that into a database. That's why it need to be very quick and that's why I'm looking in HBase's direction. I tried with some relational databases with 4M rows in the table but the insert time is to slow when I have to introduce entries in bulk. Also, the ability for HBase to keep only the cells with values will allow to save a lot on the disk space (futur projects). I'm not yet used with HBase and there is still many things I need to undertsand but until I'm able to create a solution and test it, I will continue to read, learn and try that way. Then at then end I will be able to compare the 2 options I have (HBase or relational) and decide based on the results. So yes, your reply helped because it's giving me a way to achieve this goal (using co-processors). I don't know ye thow this part is working, so I will dig the documentation for it. Thanks, JM 2012/6/14, Michael Segel <[EMAIL PROTECTED]>: > Jean-Marc, > > You do realize that this really isn't a good use case for HBase, assuming > that what you are describing is a stand alone system. > It would be easier and better if you just used a simple relational database. > > Then you would have your table w an ID, and a secondary index on the > timestamp. > Retrieve the data in Ascending order by timestamp and take the top 500 off > the list. > > If you insist on using HBase, yes you will have to have a secondary table. > Then using co-processors... > When you update the row in your base table, you > then get() the row in your index by timestamp, removing the column for that > rowid. > Add the new column to the timestamp row. > > As you put it. > > Now you can just do a partial scan on your index. Because your index table > is so small... you shouldn't worry about hotspots. > You may just want to rebuild your index every so often... > > HTH > > -Mike > > On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote: > >> Hi Michael, >> >> Thanks for your feedback. Here are more details to describe what I'm >> trying to achieve. >> >> My goal is to store information about files into the database. I need >> to check the oldest files in the database to refresh the information. >> >> The key is an 8 bytes ID of the server name in the network hosting the >> file + MD5 of the file path. Total is a 24 bytes key. >> >> So each time I look at a file and gather the information, I update its >> row in the database based on the key including a "last_update" field. >> I can calculate this key for any file in the drives. >> >> In order to know which file I need to check in the network, I need to >> scan the table by "last_update" field. So the idea is to build another >> table which contain the last_update as a key and the files IDs in >> columns. (Here is the hotspotting) >> >> Each time I work on a file, I will have to update the main table by ID >> and remove the cell from the second table (the index) and put it back >> with the new "last_update" key. >> >> I'm mainly doing 3 operations in the database. >> 1) I retrieve a list of 500 files which need to be update >> 2) I update the information for those 500 files (bulk update) >> 3) I load new files references to be checked. >> >> For 2 and 3, I use the main table with the file ID as the key. the >> distribution is almost perfect because I'm using hash. The prefix is >> the server ID but it's not always going to the same server since it's >> done by last_update. But this allow a quick access to the list of >> files from one server. >> For 1, I have expected to build this second table with the >> "last_update" as the key. >> >> Regarding the frequency, it really depends on the activities on the >> network, but it should be "often". The faster the database update +
Jean-Marc Spaggiari 2012-06-14, 18:47
-
Re: Timestamp as a key good practice?Michael Segel 2012-06-14, 19:46
Ok...
Makes sense. You don't need to worry about Coprocessors in your initial PoC. It just makes it easier instead of relying on the application managing all of the database updates. A billion rows shouldn't be a problem for an RDBMS but that's a different issue. To start with, you update the base table, your app then deletes the old column in the index and then insert the column value at new timestamp. Note the following: You may want to simplify the time stamp by rounding up to the nearest second rather than going down to the ms. This would give you more columns per row. On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote: > Hi Michael, > > For now this is more a proof of concept than a production application. > And if it's working, it should be growing a lot and database at the > end will easily be over 1B rows. each individual server will have to > send it's own information to one centralized server which will insert > that into a database. That's why it need to be very quick and that's > why I'm looking in HBase's direction. I tried with some relational > databases with 4M rows in the table but the insert time is to slow > when I have to introduce entries in bulk. Also, the ability for HBase > to keep only the cells with values will allow to save a lot on the > disk space (futur projects). > > I'm not yet used with HBase and there is still many things I need to > undertsand but until I'm able to create a solution and test it, I will > continue to read, learn and try that way. Then at then end I will be > able to compare the 2 options I have (HBase or relational) and decide > based on the results. > > So yes, your reply helped because it's giving me a way to achieve this > goal (using co-processors). I don't know ye thow this part is working, > so I will dig the documentation for it. > > Thanks, > > JM > > 2012/6/14, Michael Segel <[EMAIL PROTECTED]>: >> Jean-Marc, >> >> You do realize that this really isn't a good use case for HBase, assuming >> that what you are describing is a stand alone system. >> It would be easier and better if you just used a simple relational database. >> >> Then you would have your table w an ID, and a secondary index on the >> timestamp. >> Retrieve the data in Ascending order by timestamp and take the top 500 off >> the list. >> >> If you insist on using HBase, yes you will have to have a secondary table. >> Then using co-processors... >> When you update the row in your base table, you >> then get() the row in your index by timestamp, removing the column for that >> rowid. >> Add the new column to the timestamp row. >> >> As you put it. >> >> Now you can just do a partial scan on your index. Because your index table >> is so small... you shouldn't worry about hotspots. >> You may just want to rebuild your index every so often... >> >> HTH >> >> -Mike >> >> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote: >> >>> Hi Michael, >>> >>> Thanks for your feedback. Here are more details to describe what I'm >>> trying to achieve. >>> >>> My goal is to store information about files into the database. I need >>> to check the oldest files in the database to refresh the information. >>> >>> The key is an 8 bytes ID of the server name in the network hosting the >>> file + MD5 of the file path. Total is a 24 bytes key. >>> >>> So each time I look at a file and gather the information, I update its >>> row in the database based on the key including a "last_update" field. >>> I can calculate this key for any file in the drives. >>> >>> In order to know which file I need to check in the network, I need to >>> scan the table by "last_update" field. So the idea is to build another >>> table which contain the last_update as a key and the files IDs in >>> columns. (Here is the hotspotting) >>> >>> Each time I work on a file, I will have to update the main table by ID >>> and remove the cell from the second table (the index) and put it back >>> with the new "last_update" key. +
Michael Segel 2012-06-14, 19:46
-
Re: Timestamp as a key good practice?Michael Segel 2012-06-15, 14:21
Thought about this a little bit more...
You will want two tables for a solution. 1 Table is Key: Unique ID Column: FilePath Value: Full Path to file Column: Last Update time Value: timestamp 2 Table is Key: Last Update time (The timestamp) Column 1-N: Unique ID Value: Full Path to the file Now if you want to get fancy, in Table 1, you could use the time stamp on the column File Path to hold the last update time. But its probably easier for you to start by keeping the data as a separate column and ignore the Timestamps on the columns for now. Note the following: 1) I used the notation Column 1-N to reflect that for a given timestamp you may or may not have multiple files that were updated. (You weren't specific as to the scale) This is a good example of HBase's column oriented approach where you may or may not have a column. It doesn't matter. :-) You could also modify the timestamp to be to the second or minute and have more entries per row. It doesn't matter. You insert based on timestamp:columnName, value, so you will add a column to this table. 2) First prove that the logic works. You insert/update table 1 to capture the ID of the file and its last update time. You then delete the old timestamp entry in table 2, then insert new entry in table 2. 3) You store Table 2 in ascending order. Then when you want to find your last 500 entries, you do a start scan at 0x000 and then limit the scan to 500 rows. Note that you may or may not have multiple entries so as you walk through the result set, you count the number of columns and stop when you have 500 columns, regardless of the number of rows you've processed. This should solve your problem and be pretty efficient. You can then work out the Coprocessors and add it to the solution to be even more efficient. With respect to 'hot-spotting' , can't be helped. You could hash your unique ID in table 1, this will reduce the potential of a hotspot as the table splits. On table 2, because you have temporal data and you want to efficiently scan a small portion of the table based on size, you will always scan the first bloc, however as data rolls off and compression occurs, you will probably have to do some cleanup. I'm not sure how HBase handles splits that no longer contain data. When you compress an empty split, does it go away? By switching to coprocessors, you now limit the update accessors to the second table so you should still have pretty good performance. You may also want to look at Asynchronous HBase, however I don't know how well it will work with Coprocessors or if you want to perform async operations in this specific use case. Good luck, HTH... -Mike On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote: > Hi Michael, > > For now this is more a proof of concept than a production application. > And if it's working, it should be growing a lot and database at the > end will easily be over 1B rows. each individual server will have to > send it's own information to one centralized server which will insert > that into a database. That's why it need to be very quick and that's > why I'm looking in HBase's direction. I tried with some relational > databases with 4M rows in the table but the insert time is to slow > when I have to introduce entries in bulk. Also, the ability for HBase > to keep only the cells with values will allow to save a lot on the > disk space (futur projects). > > I'm not yet used with HBase and there is still many things I need to > undertsand but until I'm able to create a solution and test it, I will > continue to read, learn and try that way. Then at then end I will be > able to compare the 2 options I have (HBase or relational) and decide > based on the results. > > So yes, your reply helped because it's giving me a way to achieve this > goal (using co-processors). I don't know ye thow this part is working, > so I will dig the documentation for it. > +
Michael Segel 2012-06-15, 14:21
-
Re: Timestamp as a key good practice?Jean-Marc Spaggiari 2012-06-16, 11:22
Thanks all for your comments and suggestions. Regarding the
hotspotting I will try to salt the key in the 2nd table and see the results. Yesterday I finished to install my 4 servers cluster with old machine. It's slow, but it's working. So I will do some testing. You are recommending to modify the timestamp to be to the second or minute and have more entries per row. Is that because it's better to have more columns than rows? Or it's more because that will allow to have a more "squarred" pattern (lot of rows, lot of colums) which if more efficient? JM 2012/6/15, Michael Segel <[EMAIL PROTECTED]>: > Thought about this a little bit more... > > You will want two tables for a solution. > > 1 Table is Key: Unique ID > Column: FilePath Value: Full Path to file > Column: Last Update time Value: timestamp > > 2 Table is Key: Last Update time (The timestamp) > Column 1-N: Unique ID Value: Full Path to the > file > > Now if you want to get fancy, in Table 1, you could use the time stamp on > the column File Path to hold the last update time. > But its probably easier for you to start by keeping the data as a separate > column and ignore the Timestamps on the columns for now. > > Note the following: > > 1) I used the notation Column 1-N to reflect that for a given timestamp you > may or may not have multiple files that were updated. (You weren't specific > as to the scale) > This is a good example of HBase's column oriented approach where you may or > may not have a column. It doesn't matter. :-) You could also modify the > timestamp to be to the second or minute and have more entries per row. It > doesn't matter. You insert based on timestamp:columnName, value, so you will > add a column to this table. > > 2) First prove that the logic works. You insert/update table 1 to capture > the ID of the file and its last update time. You then delete the old > timestamp entry in table 2, then insert new entry in table 2. > > 3) You store Table 2 in ascending order. Then when you want to find your > last 500 entries, you do a start scan at 0x000 and then limit the scan to > 500 rows. Note that you may or may not have multiple entries so as you walk > through the result set, you count the number of columns and stop when you > have 500 columns, regardless of the number of rows you've processed. > > This should solve your problem and be pretty efficient. > You can then work out the Coprocessors and add it to the solution to be even > more efficient. > > > With respect to 'hot-spotting' , can't be helped. You could hash your unique > ID in table 1, this will reduce the potential of a hotspot as the table > splits. > On table 2, because you have temporal data and you want to efficiently scan > a small portion of the table based on size, you will always scan the first > bloc, however as data rolls off and compression occurs, you will probably > have to do some cleanup. I'm not sure how HBase handles splits that no > longer contain data. When you compress an empty split, does it go away? > > By switching to coprocessors, you now limit the update accessors to the > second table so you should still have pretty good performance. > > You may also want to look at Asynchronous HBase, however I don't know how > well it will work with Coprocessors or if you want to perform async > operations in this specific use case. > > Good luck, HTH... > > -Mike > > On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote: > >> Hi Michael, >> >> For now this is more a proof of concept than a production application. >> And if it's working, it should be growing a lot and database at the >> end will easily be over 1B rows. each individual server will have to >> send it's own information to one centralized server which will insert >> that into a database. That's why it need to be very quick and that's >> why I'm looking in HBase's direction. I tried with some relational >> databases with 4M rows in the table but the insert time is to slow +
Jean-Marc Spaggiari 2012-06-16, 11:22
-
Re: Timestamp as a key good practice?Michel Segel 2012-06-16, 14:35
You can't salt the key in the second table.
By salting the key, you lose the ability to do range scans, which is what you want to do. Sent from a remote device. Please excuse any typos... Mike Segel On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote: > Thanks all for your comments and suggestions. Regarding the > hotspotting I will try to salt the key in the 2nd table and see the > results. > > Yesterday I finished to install my 4 servers cluster with old machine. > It's slow, but it's working. So I will do some testing. > > You are recommending to modify the timestamp to be to the second or > minute and have more entries per row. Is that because it's better to > have more columns than rows? Or it's more because that will allow to > have a more "squarred" pattern (lot of rows, lot of colums) which if > more efficient? > > JM > > 2012/6/15, Michael Segel <[EMAIL PROTECTED]>: >> Thought about this a little bit more... >> >> You will want two tables for a solution. >> >> 1 Table is Key: Unique ID >> Column: FilePath Value: Full Path to file >> Column: Last Update time Value: timestamp >> >> 2 Table is Key: Last Update time (The timestamp) >> Column 1-N: Unique ID Value: Full Path to the >> file >> >> Now if you want to get fancy, in Table 1, you could use the time stamp on >> the column File Path to hold the last update time. >> But its probably easier for you to start by keeping the data as a separate >> column and ignore the Timestamps on the columns for now. >> >> Note the following: >> >> 1) I used the notation Column 1-N to reflect that for a given timestamp you >> may or may not have multiple files that were updated. (You weren't specific >> as to the scale) >> This is a good example of HBase's column oriented approach where you may or >> may not have a column. It doesn't matter. :-) You could also modify the >> timestamp to be to the second or minute and have more entries per row. It >> doesn't matter. You insert based on timestamp:columnName, value, so you will >> add a column to this table. >> >> 2) First prove that the logic works. You insert/update table 1 to capture >> the ID of the file and its last update time. You then delete the old >> timestamp entry in table 2, then insert new entry in table 2. >> >> 3) You store Table 2 in ascending order. Then when you want to find your >> last 500 entries, you do a start scan at 0x000 and then limit the scan to >> 500 rows. Note that you may or may not have multiple entries so as you walk >> through the result set, you count the number of columns and stop when you >> have 500 columns, regardless of the number of rows you've processed. >> >> This should solve your problem and be pretty efficient. >> You can then work out the Coprocessors and add it to the solution to be even >> more efficient. >> >> >> With respect to 'hot-spotting' , can't be helped. You could hash your unique >> ID in table 1, this will reduce the potential of a hotspot as the table >> splits. >> On table 2, because you have temporal data and you want to efficiently scan >> a small portion of the table based on size, you will always scan the first >> bloc, however as data rolls off and compression occurs, you will probably >> have to do some cleanup. I'm not sure how HBase handles splits that no >> longer contain data. When you compress an empty split, does it go away? >> >> By switching to coprocessors, you now limit the update accessors to the >> second table so you should still have pretty good performance. >> >> You may also want to look at Asynchronous HBase, however I don't know how >> well it will work with Coprocessors or if you want to perform async >> operations in this specific use case. >> >> Good luck, HTH... >> >> -Mike >> >> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote: >> >>> Hi Michael, >>> >>> For now this is more a proof of concept than a production application. +
Michel Segel 2012-06-16, 14:35
-
Re: Timestamp as a key good practice?Jean-Marc Spaggiari 2012-06-16, 14:42
Let's imagine the timestamp is "123456789".
If I salt it with later from 'a' to 'z' them it will always be split between few RegionServers. I will have like "t123456789". The issue is that I will have to do 26 queries to be able to find all the entries. I will need to query from A000000000 to Axxxxxxxxx, then same for B, and so on. So what's worst? Am I better to deal with the hotspotting? Salt the key myself? Or what if I use something like HBaseWD? JM 2012/6/16, Michel Segel <[EMAIL PROTECTED]>: > You can't salt the key in the second table. > By salting the key, you lose the ability to do range scans, which is what > you want to do. > > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> > wrote: > >> Thanks all for your comments and suggestions. Regarding the >> hotspotting I will try to salt the key in the 2nd table and see the >> results. >> >> Yesterday I finished to install my 4 servers cluster with old machine. >> It's slow, but it's working. So I will do some testing. >> >> You are recommending to modify the timestamp to be to the second or >> minute and have more entries per row. Is that because it's better to >> have more columns than rows? Or it's more because that will allow to >> have a more "squarred" pattern (lot of rows, lot of colums) which if >> more efficient? >> >> JM >> >> 2012/6/15, Michael Segel <[EMAIL PROTECTED]>: >>> Thought about this a little bit more... >>> >>> You will want two tables for a solution. >>> >>> 1 Table is Key: Unique ID >>> Column: FilePath Value: Full Path to file >>> Column: Last Update time Value: timestamp >>> >>> 2 Table is Key: Last Update time (The timestamp) >>> Column 1-N: Unique ID Value: Full Path to >>> the >>> file >>> >>> Now if you want to get fancy, in Table 1, you could use the time stamp >>> on >>> the column File Path to hold the last update time. >>> But its probably easier for you to start by keeping the data as a >>> separate >>> column and ignore the Timestamps on the columns for now. >>> >>> Note the following: >>> >>> 1) I used the notation Column 1-N to reflect that for a given timestamp >>> you >>> may or may not have multiple files that were updated. (You weren't >>> specific >>> as to the scale) >>> This is a good example of HBase's column oriented approach where you may >>> or >>> may not have a column. It doesn't matter. :-) You could also modify the >>> timestamp to be to the second or minute and have more entries per row. >>> It >>> doesn't matter. You insert based on timestamp:columnName, value, so you >>> will >>> add a column to this table. >>> >>> 2) First prove that the logic works. You insert/update table 1 to >>> capture >>> the ID of the file and its last update time. You then delete the old >>> timestamp entry in table 2, then insert new entry in table 2. >>> >>> 3) You store Table 2 in ascending order. Then when you want to find your >>> last 500 entries, you do a start scan at 0x000 and then limit the scan >>> to >>> 500 rows. Note that you may or may not have multiple entries so as you >>> walk >>> through the result set, you count the number of columns and stop when >>> you >>> have 500 columns, regardless of the number of rows you've processed. >>> >>> This should solve your problem and be pretty efficient. >>> You can then work out the Coprocessors and add it to the solution to be >>> even >>> more efficient. >>> >>> >>> With respect to 'hot-spotting' , can't be helped. You could hash your >>> unique >>> ID in table 1, this will reduce the potential of a hotspot as the table >>> splits. >>> On table 2, because you have temporal data and you want to efficiently >>> scan >>> a small portion of the table based on size, you will always scan the >>> first >>> bloc, however as data rolls off and compression occurs, you will >>> probably +
Jean-Marc Spaggiari 2012-06-16, 14:42
-
Re: Timestamp as a key good practice?Michael Segel 2012-06-16, 16:33
Jean-Marc,
You indicated that you didn't want to do full table scans when you want to find out which files hadn't been touched since X time has past. (X could be months, weeks, days, hours, etc ...) So here's the thing. First, I am not convinced that you will have hot spotting. Second, you end up having to now do 26 scans instead of one. Then you need to join the result set. Not really a good solution if you think about it. Oh and I don't believe that you will be hitting a single region, although you may hit a region hard. (Your second table's key is on the timestamp of the last update to the file. If the file hadn't been touched in a week, there's the probability that at scale, it won't be in the same region as a file that had recently been touched. ) I wouldn't recommend HBaseWD. Its cute, its not novel, and can only be applied on a subset of problems. (Think round-robin partitioning in a RDBMS. DB2 was big on this.) HTH -Mike On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote: > Let's imagine the timestamp is "123456789". > > If I salt it with later from 'a' to 'z' them it will always be split > between few RegionServers. I will have like "t123456789". The issue is > that I will have to do 26 queries to be able to find all the entries. > I will need to query from A000000000 to Axxxxxxxxx, then same for B, > and so on. > > So what's worst? Am I better to deal with the hotspotting? Salt the > key myself? Or what if I use something like HBaseWD? > > JM > > 2012/6/16, Michel Segel <[EMAIL PROTECTED]>: >> You can't salt the key in the second table. >> By salting the key, you lose the ability to do range scans, which is what >> you want to do. >> >> >> >> Sent from a remote device. Please excuse any typos... >> >> Mike Segel >> >> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> >> wrote: >> >>> Thanks all for your comments and suggestions. Regarding the >>> hotspotting I will try to salt the key in the 2nd table and see the >>> results. >>> >>> Yesterday I finished to install my 4 servers cluster with old machine. >>> It's slow, but it's working. So I will do some testing. >>> >>> You are recommending to modify the timestamp to be to the second or >>> minute and have more entries per row. Is that because it's better to >>> have more columns than rows? Or it's more because that will allow to >>> have a more "squarred" pattern (lot of rows, lot of colums) which if >>> more efficient? >>> >>> JM >>> >>> 2012/6/15, Michael Segel <[EMAIL PROTECTED]>: >>>> Thought about this a little bit more... >>>> >>>> You will want two tables for a solution. >>>> >>>> 1 Table is Key: Unique ID >>>> Column: FilePath Value: Full Path to file >>>> Column: Last Update time Value: timestamp >>>> >>>> 2 Table is Key: Last Update time (The timestamp) >>>> Column 1-N: Unique ID Value: Full Path to >>>> the >>>> file >>>> >>>> Now if you want to get fancy, in Table 1, you could use the time stamp >>>> on >>>> the column File Path to hold the last update time. >>>> But its probably easier for you to start by keeping the data as a >>>> separate >>>> column and ignore the Timestamps on the columns for now. >>>> >>>> Note the following: >>>> >>>> 1) I used the notation Column 1-N to reflect that for a given timestamp >>>> you >>>> may or may not have multiple files that were updated. (You weren't >>>> specific >>>> as to the scale) >>>> This is a good example of HBase's column oriented approach where you may >>>> or >>>> may not have a column. It doesn't matter. :-) You could also modify the >>>> timestamp to be to the second or minute and have more entries per row. >>>> It >>>> doesn't matter. You insert based on timestamp:columnName, value, so you >>>> will >>>> add a column to this table. >>>> >>>> 2) First prove that the logic works. You insert/update table 1 to >>>> capture >>>> the ID of the file and its last update time. You then delete the old +
Michael Segel 2012-06-16, 16:33
-
Re: Timestamp as a key good practice?Rob Verkuylen 2012-06-16, 19:10
Just to add from my experiences:
Yes hotspotting is bad, but so are devops headaches. A reasonable machine can handle 3-4000 puts a second with ease, and a simple timerange scan can give you the records you need. I have my doubts you will be hitting these amounts anytime soon. A simple setup will get your PoC and then scale when you need to scale. Rob On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > Jean-Marc, > > You indicated that you didn't want to do full table scans when you want to > find out which files hadn't been touched since X time has past. > (X could be months, weeks, days, hours, etc ...) > > So here's the thing. > First, I am not convinced that you will have hot spotting. > Second, you end up having to now do 26 scans instead of one. Then you need > to join the result set. > > Not really a good solution if you think about it. > > Oh and I don't believe that you will be hitting a single region, although > you may hit a region hard. > (Your second table's key is on the timestamp of the last update to the > file. If the file hadn't been touched in a week, there's the probability > that at scale, it won't be in the same region as a file that had recently > been touched. ) > > I wouldn't recommend HBaseWD. Its cute, its not novel, and can only be > applied on a subset of problems. > (Think round-robin partitioning in a RDBMS. DB2 was big on this.) > > HTH > > -Mike > > > > On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote: > > > Let's imagine the timestamp is "123456789". > > > > If I salt it with later from 'a' to 'z' them it will always be split > > between few RegionServers. I will have like "t123456789". The issue is > > that I will have to do 26 queries to be able to find all the entries. > > I will need to query from A000000000 to Axxxxxxxxx, then same for B, > > and so on. > > > > So what's worst? Am I better to deal with the hotspotting? Salt the > > key myself? Or what if I use something like HBaseWD? > > > > JM > > > > 2012/6/16, Michel Segel <[EMAIL PROTECTED]>: > >> You can't salt the key in the second table. > >> By salting the key, you lose the ability to do range scans, which is > what > >> you want to do. > >> > >> > >> > >> Sent from a remote device. Please excuse any typos... > >> > >> Mike Segel > >> > >> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari < > [EMAIL PROTECTED]> > >> wrote: > >> > >>> Thanks all for your comments and suggestions. Regarding the > >>> hotspotting I will try to salt the key in the 2nd table and see the > >>> results. > >>> > >>> Yesterday I finished to install my 4 servers cluster with old machine. > >>> It's slow, but it's working. So I will do some testing. > >>> > >>> You are recommending to modify the timestamp to be to the second or > >>> minute and have more entries per row. Is that because it's better to > >>> have more columns than rows? Or it's more because that will allow to > >>> have a more "squarred" pattern (lot of rows, lot of colums) which if > >>> more efficient? > >>> > >>> JM > >>> > >>> 2012/6/15, Michael Segel <[EMAIL PROTECTED]>: > >>>> Thought about this a little bit more... > >>>> > >>>> You will want two tables for a solution. > >>>> > >>>> 1 Table is Key: Unique ID > >>>> Column: FilePath Value: Full Path to file > >>>> Column: Last Update time Value: timestamp > >>>> > >>>> 2 Table is Key: Last Update time (The timestamp) > >>>> Column 1-N: Unique ID Value: Full Path to > >>>> the > >>>> file > >>>> > >>>> Now if you want to get fancy, in Table 1, you could use the time > stamp > >>>> on > >>>> the column File Path to hold the last update time. > >>>> But its probably easier for you to start by keeping the data as a > >>>> separate > >>>> column and ignore the Timestamps on the columns for now. > >>>> > >>>> Note the following: > >>>> > >>>> 1) I used the notation Column 1-N to reflect that for a given +
Rob Verkuylen 2012-06-16, 19:10
-
Re: Timestamp as a key good practice?Jean-Marc Spaggiari 2012-06-21, 11:43
Hi Mike, Hi Rob,
Thanks for your replies and advices. Seems that now I'm due for some implementation. I'm readgin Lars' book first and when I will be done I will start with the coding. I already have my Zookeeper/Hadoop/HBase running and based on the first pages I read, I already know it's not well done since I have put a DataNode and a Zookeeper server on ALL the servers ;) So. More reading for me for the next few days, and then I will start. Thanks again! JM 2012/6/16, Rob Verkuylen <[EMAIL PROTECTED]>: > Just to add from my experiences: > > Yes hotspotting is bad, but so are devops headaches. A reasonable machine > can handle 3-4000 puts a second with ease, and a simple timerange scan can > give you the records you need. I have my doubts you will be hitting these > amounts anytime soon. A simple setup will get your PoC and then scale when > you need to scale. > > Rob > > On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel > <[EMAIL PROTECTED]>wrote: > >> Jean-Marc, >> >> You indicated that you didn't want to do full table scans when you want >> to >> find out which files hadn't been touched since X time has past. >> (X could be months, weeks, days, hours, etc ...) >> >> So here's the thing. >> First, I am not convinced that you will have hot spotting. >> Second, you end up having to now do 26 scans instead of one. Then you >> need >> to join the result set. >> >> Not really a good solution if you think about it. >> >> Oh and I don't believe that you will be hitting a single region, although >> you may hit a region hard. >> (Your second table's key is on the timestamp of the last update to the >> file. If the file hadn't been touched in a week, there's the probability >> that at scale, it won't be in the same region as a file that had recently >> been touched. ) >> >> I wouldn't recommend HBaseWD. Its cute, its not novel, and can only be >> applied on a subset of problems. >> (Think round-robin partitioning in a RDBMS. DB2 was big on this.) >> >> HTH >> >> -Mike >> >> >> >> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote: >> >> > Let's imagine the timestamp is "123456789". >> > >> > If I salt it with later from 'a' to 'z' them it will always be split >> > between few RegionServers. I will have like "t123456789". The issue is >> > that I will have to do 26 queries to be able to find all the entries. >> > I will need to query from A000000000 to Axxxxxxxxx, then same for B, >> > and so on. >> > >> > So what's worst? Am I better to deal with the hotspotting? Salt the >> > key myself? Or what if I use something like HBaseWD? >> > >> > JM >> > >> > 2012/6/16, Michel Segel <[EMAIL PROTECTED]>: >> >> You can't salt the key in the second table. >> >> By salting the key, you lose the ability to do range scans, which is >> what >> >> you want to do. >> >> >> >> >> >> >> >> Sent from a remote device. Please excuse any typos... >> >> >> >> Mike Segel >> >> >> >> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari < >> [EMAIL PROTECTED]> >> >> wrote: >> >> >> >>> Thanks all for your comments and suggestions. Regarding the >> >>> hotspotting I will try to salt the key in the 2nd table and see the >> >>> results. >> >>> >> >>> Yesterday I finished to install my 4 servers cluster with old >> >>> machine. >> >>> It's slow, but it's working. So I will do some testing. >> >>> >> >>> You are recommending to modify the timestamp to be to the second or >> >>> minute and have more entries per row. Is that because it's better to >> >>> have more columns than rows? Or it's more because that will allow to >> >>> have a more "squarred" pattern (lot of rows, lot of colums) which if >> >>> more efficient? >> >>> >> >>> JM >> >>> >> >>> 2012/6/15, Michael Segel <[EMAIL PROTECTED]>: >> >>>> Thought about this a little bit more... >> >>>> >> >>>> You will want two tables for a solution. >> >>>> >> >>>> 1 Table is Key: Unique ID >> >>>> Column: FilePath Value: Full Path to >> >>>> file >> >> +
Jean-Marc Spaggiari 2012-06-21, 11:43
-
Re: Timestamp as a key good practice?Michael Segel 2012-06-21, 14:20
If you have a really small cluster...
You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a single node. (Secondary too) Then you have Data Nodes that run DN, TT, and RS. That would solve any ZK RS problems. On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote: > Hi Mike, Hi Rob, > > Thanks for your replies and advices. Seems that now I'm due for some > implementation. I'm readgin Lars' book first and when I will be done I > will start with the coding. > > I already have my Zookeeper/Hadoop/HBase running and based on the > first pages I read, I already know it's not well done since I have put > a DataNode and a Zookeeper server on ALL the servers ;) So. More > reading for me for the next few days, and then I will start. > > Thanks again! > > JM > > 2012/6/16, Rob Verkuylen <[EMAIL PROTECTED]>: >> Just to add from my experiences: >> >> Yes hotspotting is bad, but so are devops headaches. A reasonable machine >> can handle 3-4000 puts a second with ease, and a simple timerange scan can >> give you the records you need. I have my doubts you will be hitting these >> amounts anytime soon. A simple setup will get your PoC and then scale when >> you need to scale. >> >> Rob >> >> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel >> <[EMAIL PROTECTED]>wrote: >> >>> Jean-Marc, >>> >>> You indicated that you didn't want to do full table scans when you want >>> to >>> find out which files hadn't been touched since X time has past. >>> (X could be months, weeks, days, hours, etc ...) >>> >>> So here's the thing. >>> First, I am not convinced that you will have hot spotting. >>> Second, you end up having to now do 26 scans instead of one. Then you >>> need >>> to join the result set. >>> >>> Not really a good solution if you think about it. >>> >>> Oh and I don't believe that you will be hitting a single region, although >>> you may hit a region hard. >>> (Your second table's key is on the timestamp of the last update to the >>> file. If the file hadn't been touched in a week, there's the probability >>> that at scale, it won't be in the same region as a file that had recently >>> been touched. ) >>> >>> I wouldn't recommend HBaseWD. Its cute, its not novel, and can only be >>> applied on a subset of problems. >>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.) >>> >>> HTH >>> >>> -Mike >>> >>> >>> >>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote: >>> >>>> Let's imagine the timestamp is "123456789". >>>> >>>> If I salt it with later from 'a' to 'z' them it will always be split >>>> between few RegionServers. I will have like "t123456789". The issue is >>>> that I will have to do 26 queries to be able to find all the entries. >>>> I will need to query from A000000000 to Axxxxxxxxx, then same for B, >>>> and so on. >>>> >>>> So what's worst? Am I better to deal with the hotspotting? Salt the >>>> key myself? Or what if I use something like HBaseWD? >>>> >>>> JM >>>> >>>> 2012/6/16, Michel Segel <[EMAIL PROTECTED]>: >>>>> You can't salt the key in the second table. >>>>> By salting the key, you lose the ability to do range scans, which is >>> what >>>>> you want to do. >>>>> >>>>> >>>>> >>>>> Sent from a remote device. Please excuse any typos... >>>>> >>>>> Mike Segel >>>>> >>>>> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari < >>> [EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>>> Thanks all for your comments and suggestions. Regarding the >>>>>> hotspotting I will try to salt the key in the 2nd table and see the >>>>>> results. >>>>>> >>>>>> Yesterday I finished to install my 4 servers cluster with old >>>>>> machine. >>>>>> It's slow, but it's working. So I will do some testing. >>>>>> >>>>>> You are recommending to modify the timestamp to be to the second or >>>>>> minute and have more entries per row. Is that because it's better to >>>>>> have more columns than rows? Or it's more because that will allow to >>>>>> have a more "squarred" pattern (lot of rows, lot of colums) which if +
Michael Segel 2012-06-21, 14:20
-
Re: Timestamp as a key good practice?Jean-Marc Spaggiari 2012-06-22, 19:43
Ok. So if I understand correctly, I need:
PC1 => HMaster (HBase), JobTracker (Hadoop), Name Node (Hadoop), and ZooKeeper (ZK) PC2 => Secondary Name Node (Hadoop) PC3 to x => Data Node (Hadoop), Task Tracker (Hadoop), Restion Server (HBase) For PC2, should I run Zookeeper, JobTracker and master too? Can I have 2 masters? Or I just run just the secondray name node? 2012/6/21, Michael Segel <[EMAIL PROTECTED]>: > If you have a really small cluster... > You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a > single node. (Secondary too) > Then you have Data Nodes that run DN, TT, and RS. > > That would solve any ZK RS problems. > > On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote: > >> Hi Mike, Hi Rob, >> >> Thanks for your replies and advices. Seems that now I'm due for some >> implementation. I'm readgin Lars' book first and when I will be done I >> will start with the coding. >> >> I already have my Zookeeper/Hadoop/HBase running and based on the >> first pages I read, I already know it's not well done since I have put >> a DataNode and a Zookeeper server on ALL the servers ;) So. More >> reading for me for the next few days, and then I will start. >> >> Thanks again! >> >> JM >> >> 2012/6/16, Rob Verkuylen <[EMAIL PROTECTED]>: >>> Just to add from my experiences: >>> >>> Yes hotspotting is bad, but so are devops headaches. A reasonable >>> machine >>> can handle 3-4000 puts a second with ease, and a simple timerange scan >>> can >>> give you the records you need. I have my doubts you will be hitting >>> these >>> amounts anytime soon. A simple setup will get your PoC and then scale >>> when >>> you need to scale. >>> >>> Rob >>> >>> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel >>> <[EMAIL PROTECTED]>wrote: >>> >>>> Jean-Marc, >>>> >>>> You indicated that you didn't want to do full table scans when you want >>>> to >>>> find out which files hadn't been touched since X time has past. >>>> (X could be months, weeks, days, hours, etc ...) >>>> >>>> So here's the thing. >>>> First, I am not convinced that you will have hot spotting. >>>> Second, you end up having to now do 26 scans instead of one. Then you >>>> need >>>> to join the result set. >>>> >>>> Not really a good solution if you think about it. >>>> >>>> Oh and I don't believe that you will be hitting a single region, >>>> although >>>> you may hit a region hard. >>>> (Your second table's key is on the timestamp of the last update to the >>>> file. If the file hadn't been touched in a week, there's the >>>> probability >>>> that at scale, it won't be in the same region as a file that had >>>> recently >>>> been touched. ) >>>> >>>> I wouldn't recommend HBaseWD. Its cute, its not novel, and can only be >>>> applied on a subset of problems. >>>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.) >>>> >>>> HTH >>>> >>>> -Mike >>>> >>>> >>>> >>>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote: >>>> >>>>> Let's imagine the timestamp is "123456789". >>>>> >>>>> If I salt it with later from 'a' to 'z' them it will always be split >>>>> between few RegionServers. I will have like "t123456789". The issue is >>>>> that I will have to do 26 queries to be able to find all the entries. >>>>> I will need to query from A000000000 to Axxxxxxxxx, then same for B, >>>>> and so on. >>>>> >>>>> So what's worst? Am I better to deal with the hotspotting? Salt the >>>>> key myself? Or what if I use something like HBaseWD? >>>>> >>>>> JM >>>>> >>>>> 2012/6/16, Michel Segel <[EMAIL PROTECTED]>: >>>>>> You can't salt the key in the second table. >>>>>> By salting the key, you lose the ability to do range scans, which is >>>> what >>>>>> you want to do. >>>>>> >>>>>> >>>>>> >>>>>> Sent from a remote device. Please excuse any typos... >>>>>> >>>>>> Mike Segel >>>>>> >>>>>> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari < >>>> [EMAIL PROTECTED]> >>>>>> wrote: >>>>>> >>>>>>> Thanks all for your comments and suggestions. Regarding the +
Jean-Marc Spaggiari 2012-06-22, 19:43
-
Re: Timestamp as a key good practice?Jean-Marc Spaggiari 2012-06-23, 02:20
Hum... Seems that it's not working that way:
ERROR [main:QuorumPeerConfig@283] - Invalid configuration, only one server specified (ignoring) So most porbably the secondary should looks exactly like the master, but I'm not 100% sure... 2012/6/22, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: > Ok. So if I understand correctly, I need: > PC1 => HMaster (HBase), JobTracker (Hadoop), Name Node (Hadoop), and > ZooKeeper (ZK) > PC2 => Secondary Name Node (Hadoop) > PC3 to x => Data Node (Hadoop), Task Tracker (Hadoop), Restion Server > (HBase) > > For PC2, should I run Zookeeper, JobTracker and master too? Can I have > 2 masters? Or I just run just the secondray name node? > > 2012/6/21, Michael Segel <[EMAIL PROTECTED]>: >> If you have a really small cluster... >> You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a >> single node. (Secondary too) >> Then you have Data Nodes that run DN, TT, and RS. >> >> That would solve any ZK RS problems. >> >> On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote: >> >>> Hi Mike, Hi Rob, >>> >>> Thanks for your replies and advices. Seems that now I'm due for some >>> implementation. I'm readgin Lars' book first and when I will be done I >>> will start with the coding. >>> >>> I already have my Zookeeper/Hadoop/HBase running and based on the >>> first pages I read, I already know it's not well done since I have put >>> a DataNode and a Zookeeper server on ALL the servers ;) So. More >>> reading for me for the next few days, and then I will start. >>> >>> Thanks again! >>> >>> JM >>> >>> 2012/6/16, Rob Verkuylen <[EMAIL PROTECTED]>: >>>> Just to add from my experiences: >>>> >>>> Yes hotspotting is bad, but so are devops headaches. A reasonable >>>> machine >>>> can handle 3-4000 puts a second with ease, and a simple timerange scan >>>> can >>>> give you the records you need. I have my doubts you will be hitting >>>> these >>>> amounts anytime soon. A simple setup will get your PoC and then scale >>>> when >>>> you need to scale. >>>> >>>> Rob >>>> >>>> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel >>>> <[EMAIL PROTECTED]>wrote: >>>> >>>>> Jean-Marc, >>>>> >>>>> You indicated that you didn't want to do full table scans when you >>>>> want >>>>> to >>>>> find out which files hadn't been touched since X time has past. >>>>> (X could be months, weeks, days, hours, etc ...) >>>>> >>>>> So here's the thing. >>>>> First, I am not convinced that you will have hot spotting. >>>>> Second, you end up having to now do 26 scans instead of one. Then you >>>>> need >>>>> to join the result set. >>>>> >>>>> Not really a good solution if you think about it. >>>>> >>>>> Oh and I don't believe that you will be hitting a single region, >>>>> although >>>>> you may hit a region hard. >>>>> (Your second table's key is on the timestamp of the last update to the >>>>> file. If the file hadn't been touched in a week, there's the >>>>> probability >>>>> that at scale, it won't be in the same region as a file that had >>>>> recently >>>>> been touched. ) >>>>> >>>>> I wouldn't recommend HBaseWD. Its cute, its not novel, and can only >>>>> be >>>>> applied on a subset of problems. >>>>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.) >>>>> >>>>> HTH >>>>> >>>>> -Mike >>>>> >>>>> >>>>> >>>>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote: >>>>> >>>>>> Let's imagine the timestamp is "123456789". >>>>>> >>>>>> If I salt it with later from 'a' to 'z' them it will always be split >>>>>> between few RegionServers. I will have like "t123456789". The issue >>>>>> is >>>>>> that I will have to do 26 queries to be able to find all the entries. >>>>>> I will need to query from A000000000 to Axxxxxxxxx, then same for B, >>>>>> and so on. >>>>>> >>>>>> So what's worst? Am I better to deal with the hotspotting? Salt the >>>>>> key myself? Or what if I use something like HBaseWD? >>>>>> >>>>>> JM >>>>>> >>>>>> 2012/6/16, Michel Segel <[EMAIL PROTECTED]>: >>>>> +
Jean-Marc Spaggiari 2012-06-23, 02:20
-
Re: Timestamp as a key good practice?Jean-Daniel Cryans 2012-06-26, 17:50
A quorum with 2 members is worse than 1 so don't put a ZK on PC2, the
exception you are seeing is that ZK is trying to get a quorum on with 1 machine but that doesn't make sense so instead it should revert to a standalone server and still work. J-D On Fri, Jun 22, 2012 at 7:20 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote: > Hum... Seems that it's not working that way: > > ERROR [main:QuorumPeerConfig@283] - Invalid configuration, only one > server specified (ignoring) > > So most porbably the secondary should looks exactly like the master, > but I'm not 100% sure... > > 2012/6/22, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: >> Ok. So if I understand correctly, I need: >> PC1 => HMaster (HBase), JobTracker (Hadoop), Name Node (Hadoop), and >> ZooKeeper (ZK) >> PC2 => Secondary Name Node (Hadoop) >> PC3 to x => Data Node (Hadoop), Task Tracker (Hadoop), Restion Server >> (HBase) >> >> For PC2, should I run Zookeeper, JobTracker and master too? Can I have >> 2 masters? Or I just run just the secondray name node? >> >> 2012/6/21, Michael Segel <[EMAIL PROTECTED]>: >>> If you have a really small cluster... >>> You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a >>> single node. (Secondary too) >>> Then you have Data Nodes that run DN, TT, and RS. >>> >>> That would solve any ZK RS problems. >>> >>> On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote: >>> >>>> Hi Mike, Hi Rob, >>>> >>>> Thanks for your replies and advices. Seems that now I'm due for some >>>> implementation. I'm readgin Lars' book first and when I will be done I >>>> will start with the coding. >>>> >>>> I already have my Zookeeper/Hadoop/HBase running and based on the >>>> first pages I read, I already know it's not well done since I have put >>>> a DataNode and a Zookeeper server on ALL the servers ;) So. More >>>> reading for me for the next few days, and then I will start. >>>> >>>> Thanks again! >>>> >>>> JM >>>> >>>> 2012/6/16, Rob Verkuylen <[EMAIL PROTECTED]>: >>>>> Just to add from my experiences: >>>>> >>>>> Yes hotspotting is bad, but so are devops headaches. A reasonable >>>>> machine >>>>> can handle 3-4000 puts a second with ease, and a simple timerange scan >>>>> can >>>>> give you the records you need. I have my doubts you will be hitting >>>>> these >>>>> amounts anytime soon. A simple setup will get your PoC and then scale >>>>> when >>>>> you need to scale. >>>>> >>>>> Rob >>>>> >>>>> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel >>>>> <[EMAIL PROTECTED]>wrote: >>>>> >>>>>> Jean-Marc, >>>>>> >>>>>> You indicated that you didn't want to do full table scans when you >>>>>> want >>>>>> to >>>>>> find out which files hadn't been touched since X time has past. >>>>>> (X could be months, weeks, days, hours, etc ...) >>>>>> >>>>>> So here's the thing. >>>>>> First, I am not convinced that you will have hot spotting. >>>>>> Second, you end up having to now do 26 scans instead of one. Then you >>>>>> need >>>>>> to join the result set. >>>>>> >>>>>> Not really a good solution if you think about it. >>>>>> >>>>>> Oh and I don't believe that you will be hitting a single region, >>>>>> although >>>>>> you may hit a region hard. >>>>>> (Your second table's key is on the timestamp of the last update to the >>>>>> file. If the file hadn't been touched in a week, there's the >>>>>> probability >>>>>> that at scale, it won't be in the same region as a file that had >>>>>> recently >>>>>> been touched. ) >>>>>> >>>>>> I wouldn't recommend HBaseWD. Its cute, its not novel, and can only >>>>>> be >>>>>> applied on a subset of problems. >>>>>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.) >>>>>> >>>>>> HTH >>>>>> >>>>>> -Mike >>>>>> >>>>>> >>>>>> >>>>>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote: >>>>>> >>>>>>> Let's imagine the timestamp is "123456789". >>>>>>> >>>>>>> If I salt it with later from 'a' to 'z' them it will always be split >>>>>>> between few RegionServers. I will have like "t123456789". The issue +
Jean-Daniel Cryans 2012-06-26, 17:50
-
Re: Timestamp as a key good practice?Jean-Marc Spaggiari 2012-06-26, 17:56
Am I better to run it on 1? Or on 3? I just want to do some testing
for now. But I have issues with the performances. It's taking 20 seconds to do 1000 gets with the actual configuration... I'm tracking the issues. I think the network is one so I will address it this week, but for ZK, can I keep it in only one server for now? Or it will be more efficient if Iconfigure it on 3? Thanks, JM 2012/6/26, Jean-Daniel Cryans <[EMAIL PROTECTED]>: > A quorum with 2 members is worse than 1 so don't put a ZK on PC2, the > exception you are seeing is that ZK is trying to get a quorum on with > 1 machine but that doesn't make sense so instead it should revert to a > standalone server and still work. > > J-D > > On Fri, Jun 22, 2012 at 7:20 PM, Jean-Marc Spaggiari > <[EMAIL PROTECTED]> wrote: >> Hum... Seems that it's not working that way: >> >> ERROR [main:QuorumPeerConfig@283] - Invalid configuration, only one >> server specified (ignoring) >> >> So most porbably the secondary should looks exactly like the master, >> but I'm not 100% sure... >> >> 2012/6/22, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: >>> Ok. So if I understand correctly, I need: >>> PC1 => HMaster (HBase), JobTracker (Hadoop), Name Node (Hadoop), and >>> ZooKeeper (ZK) >>> PC2 => Secondary Name Node (Hadoop) >>> PC3 to x => Data Node (Hadoop), Task Tracker (Hadoop), Restion Server >>> (HBase) >>> >>> For PC2, should I run Zookeeper, JobTracker and master too? Can I have >>> 2 masters? Or I just run just the secondray name node? >>> >>> 2012/6/21, Michael Segel <[EMAIL PROTECTED]>: >>>> If you have a really small cluster... >>>> You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a >>>> single node. (Secondary too) >>>> Then you have Data Nodes that run DN, TT, and RS. >>>> >>>> That would solve any ZK RS problems. >>>> >>>> On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote: >>>> >>>>> Hi Mike, Hi Rob, >>>>> >>>>> Thanks for your replies and advices. Seems that now I'm due for some >>>>> implementation. I'm readgin Lars' book first and when I will be done I >>>>> will start with the coding. >>>>> >>>>> I already have my Zookeeper/Hadoop/HBase running and based on the >>>>> first pages I read, I already know it's not well done since I have put >>>>> a DataNode and a Zookeeper server on ALL the servers ;) So. More >>>>> reading for me for the next few days, and then I will start. >>>>> >>>>> Thanks again! >>>>> >>>>> JM >>>>> >>>>> 2012/6/16, Rob Verkuylen <[EMAIL PROTECTED]>: >>>>>> Just to add from my experiences: >>>>>> >>>>>> Yes hotspotting is bad, but so are devops headaches. A reasonable >>>>>> machine >>>>>> can handle 3-4000 puts a second with ease, and a simple timerange >>>>>> scan >>>>>> can >>>>>> give you the records you need. I have my doubts you will be hitting >>>>>> these >>>>>> amounts anytime soon. A simple setup will get your PoC and then scale >>>>>> when >>>>>> you need to scale. >>>>>> >>>>>> Rob >>>>>> >>>>>> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel >>>>>> <[EMAIL PROTECTED]>wrote: >>>>>> >>>>>>> Jean-Marc, >>>>>>> >>>>>>> You indicated that you didn't want to do full table scans when you >>>>>>> want >>>>>>> to >>>>>>> find out which files hadn't been touched since X time has past. >>>>>>> (X could be months, weeks, days, hours, etc ...) >>>>>>> >>>>>>> So here's the thing. >>>>>>> First, I am not convinced that you will have hot spotting. >>>>>>> Second, you end up having to now do 26 scans instead of one. Then >>>>>>> you >>>>>>> need >>>>>>> to join the result set. >>>>>>> >>>>>>> Not really a good solution if you think about it. >>>>>>> >>>>>>> Oh and I don't believe that you will be hitting a single region, >>>>>>> although >>>>>>> you may hit a region hard. >>>>>>> (Your second table's key is on the timestamp of the last update to >>>>>>> the >>>>>>> file. If the file hadn't been touched in a week, there's the >>>>>>> probability >>>>>>> that at scale, it won't be in the same region as a file that had +
Jean-Marc Spaggiari 2012-06-26, 17:56
-
Re: Timestamp as a key good practice?Jean-Daniel Cryans 2012-06-26, 18:12
On Tue, Jun 26, 2012 at 10:56 AM, Jean-Marc Spaggiari
<[EMAIL PROTECTED]> wrote: > Am I better to run it on 1? Or on 3? I just want to do some testing > for now. but for ZK, can I keep it in only one server for now? Or it will be > more efficient if Iconfigure it on 3? FWIW your system will be as available is PC1 is, so just put 1 ZK on that node. ZK is not on the read path so whether you have 1 or 10 it won't change anything. > But I have issues with the performances. It's taking 20 > seconds to do 1000 gets with the actual configuration... I'm tracking > the issues. I think the network is one so I will address it this week, > Network is always good to check, it's all fun and games until an interface negotiates 100Mb. 50ms per get sounds a bit extreme. > > Thanks, > > JM > > 2012/6/26, Jean-Daniel Cryans <[EMAIL PROTECTED]>: >> A quorum with 2 members is worse than 1 so don't put a ZK on PC2, the >> exception you are seeing is that ZK is trying to get a quorum on with >> 1 machine but that doesn't make sense so instead it should revert to a >> standalone server and still work. >> >> J-D +
Jean-Daniel Cryans 2012-06-26, 18:12
-
Re: Timestamp as a key good practice?Michael Segel 2012-06-26, 19:01
> Network is always good to check, it's all fun and games until an
> interface negotiates 100Mb. > > 50ms per get sounds a bit extreme. <mini-rant> Funny you should mention hardware. I did submit a talk on cluster design to Strata (NY and London) Seems it didn't make the cut on NY, but who knows about London... It seems that people are now starting to get the idea that its important to think about your hardware and cluster design before you actually start to build a cluster. </mini-rant> You're right we don't know enough about the hardware and configuration to talk intelligently... Depending on the size of the row... it could cause a long time to do a single fetch. (err get() ) On Jun 26, 2012, at 1:12 PM, Jean-Daniel Cryans wrote: > On Tue, Jun 26, 2012 at 10:56 AM, Jean-Marc Spaggiari > <[EMAIL PROTECTED]> wrote: >> Am I better to run it on 1? Or on 3? I just want to do some testing >> for now. but for ZK, can I keep it in only one server for now? Or it will be >> more efficient if Iconfigure it on 3? > > FWIW your system will be as available is PC1 is, so just put 1 ZK on > that node. ZK is not on the read path so whether you have 1 or 10 it > won't change anything. > >> But I have issues with the performances. It's taking 20 >> seconds to do 1000 gets with the actual configuration... I'm tracking >> the issues. I think the network is one so I will address it this week, >> > > Network is always good to check, it's all fun and games until an > interface negotiates 100Mb. > > 50ms per get sounds a bit extreme. > >> >> Thanks, >> >> JM >> >> 2012/6/26, Jean-Daniel Cryans <[EMAIL PROTECTED]>: >>> A quorum with 2 members is worse than 1 so don't put a ZK on PC2, the >>> exception you are seeing is that ZK is trying to get a quorum on with >>> 1 machine but that doesn't make sense so instead it should revert to a >>> standalone server and still work. >>> >>> J-D > +
Michael Segel 2012-06-26, 19:01
-
Re: Timestamp as a key good practice?Jean-Marc Spaggiari 2012-06-26, 19:04
Thanks JD. I will shut down one of the ZK instance.
To Michael and JD,I will start another thread regarding the performances with more details. JM 2012/6/26, Michael Segel <[EMAIL PROTECTED]>: >> Network is always good to check, it's all fun and games until an >> interface negotiates 100Mb. >> >> 50ms per get sounds a bit extreme. > <mini-rant> > Funny you should mention hardware. > I did submit a talk on cluster design to Strata (NY and London) Seems it > didn't make the cut on NY, but who knows about London... > > It seems that people are now starting to get the idea that its important to > think about your hardware and cluster design before you actually start to > build a cluster. > </mini-rant> > > You're right we don't know enough about the hardware and configuration to > talk intelligently... > > Depending on the size of the row... it could cause a long time to do a > single fetch. (err get() ) > > On Jun 26, 2012, at 1:12 PM, Jean-Daniel Cryans wrote: > >> On Tue, Jun 26, 2012 at 10:56 AM, Jean-Marc Spaggiari >> <[EMAIL PROTECTED]> wrote: >>> Am I better to run it on 1? Or on 3? I just want to do some testing >>> for now. but for ZK, can I keep it in only one server for now? Or it will >>> be >>> more efficient if Iconfigure it on 3? >> >> FWIW your system will be as available is PC1 is, so just put 1 ZK on >> that node. ZK is not on the read path so whether you have 1 or 10 it >> won't change anything. >> >>> But I have issues with the performances. It's taking 20 >>> seconds to do 1000 gets with the actual configuration... I'm tracking >>> the issues. I think the network is one so I will address it this week, >>> >> >> Network is always good to check, it's all fun and games until an >> interface negotiates 100Mb. >> >> 50ms per get sounds a bit extreme. >> >>> >>> Thanks, >>> >>> JM >>> >>> 2012/6/26, Jean-Daniel Cryans <[EMAIL PROTECTED]>: >>>> A quorum with 2 members is worse than 1 so don't put a ZK on PC2, the >>>> exception you are seeing is that ZK is trying to get a quorum on with >>>> 1 machine but that doesn't make sense so instead it should revert to a >>>> standalone server and still work. >>>> >>>> J-D >> > > +
Jean-Marc Spaggiari 2012-06-26, 19:04
-
Re: Timestamp as a key good practice?Doug Meil 2012-06-14, 21:18
Will do! On 6/14/12 2:06 AM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote: >JM, have a look at https://github.com/sematext/HBaseWD (this comes up >often.... Doug, maybe you could add it to the Ref Guide?) > >Otis >---- >Performance Monitoring for Solr / ElasticSearch / HBase - >http://sematext.com/spm > > > >>________________________________ >> From: Jean-Marc Spaggiari <[EMAIL PROTECTED]> >>To: [EMAIL PROTECTED] >>Sent: Wednesday, June 13, 2012 12:16 PM >>Subject: Timestamp as a key good practice? >> >>I watched Lars George's video about HBase and read the documentation >>and it's saying that it's not a good idea to have the timestamp as a >>key because that will always load the same region until the timestamp >>reach a certain value and move to the next region (hotspotting). >> >>I have a table with a uniq key, a file path and a "last update" field. >>I can easily find back the file with the ID and find when it has been >>updated. >> >>But what I need too is to find the files not updated for more than a >>certain period of time. >> >>If I want to retrieve that from this single table, I will have to do a >>full parsing of the table. Which might take a while. >> >>So I thought of building a table to reference that (kind of secondary >>index). The key is the "last update", one FC and each column will have >>the ID of the file with a dummy content. >> >>When a file is updated, I remove its cell from this table, and >>introduce a new cell with the new timestamp as the key. >> >>And so one. >> >>With this schema, I can find the files by ID very quickly and I can >>find the files which need to be updated pretty quickly too. But it's >>hotspotting one region. >> >>From the video (0:45:10) I can see 4 situations. >>1) Hotspotting. >>2) Salting. >>3) Key field swap/promotion >>4) Randomization. >> >>I need to avoid hostpotting, so I looked at the 3 other options. >> >>I can do salting. Like prefix the timestamp with a number between 0 >>and 9. So that will distribut the load over 10 servers. To find all >>the files with a timestamp below a specific value, I will need to run >>10 requests instead of one. But when the load will becaume to big for >>10 servers, I will have to prefix by a byte between 0 and 99? Which >>mean 100 request? And the more regions I will have, the more requests >>I will have to do. Is that really a good approach? >> >>Key field swap is close to salting. I can add the first few bytes from >>the path before the timestamp, but the issue will remain the same. >> >>I looked and randomization, and I can't do that. Else I will have no >>way to retreive the information I'm looking for. >> >>So the question is. Is there a good way to store the data to retrieve >>them base on the date? >> >>Thanks, >> >>JM >> >> +
Doug Meil 2012-06-14, 21:18
|