|
Bryan Keller
2010-12-22, 23:41
Peter Haidinyak
2010-12-22, 23:52
Bryan Keller
2010-12-23, 00:16
Bryan Keller
2010-12-23, 01:03
Bryan Keller
2010-12-23, 02:24
Michael Segel
2010-12-23, 02:53
Ted Yu
2010-12-23, 03:00
Michael Segel
2010-12-23, 03:35
Bryan Keller
2010-12-23, 04:34
Andrey Stepachev
2010-12-23, 07:14
Ted Dunning
2010-12-23, 09:28
Andrey Stepachev
2010-12-23, 09:57
Lars George
2010-12-23, 10:55
Michael Segel
2010-12-23, 13:20
Ryan Rawson
2010-12-23, 19:54
Bryan Keller
2010-12-23, 22:28
Bryan Keller
2010-12-23, 22:44
|
-
Insert into tall table 50% faster than wide tableBryan Keller 2010-12-22, 23:41
I have been testing a couple of different approaches to storing customer orders. One is a tall table, where each order is a row. The other is a wide table where each customer is a row, and orders are columns in the row. I am finding that inserts into the tall table, i.e. adding rows for every order, is roughly 50% faster than inserts into the wide table, i.e. adding a row for a customer and then adding columns for orders.
In my test, there are 10,000 customers, each customer has 600 orders and each order has 10 columns. The tall table approach results in 6 mil rows of 10 columns. The wide table approach results is 10,000 rows of 6,000 columns. I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the orders using a Put for each order, submitted in batches of 1000 as a list of Puts. Are there techniques to speed up inserts with the wide table approach that I am perhaps overlooking?
-
RE: Insert into tall table 50% faster than wide tablePeter Haidinyak 2010-12-22, 23:52
Interesting, do you know what the time difference would be on the other side, doing a lookup/scan?
Thanks -Pete -----Original Message----- From: Bryan Keller [mailto:[EMAIL PROTECTED]] Sent: Wednesday, December 22, 2010 3:41 PM To: [EMAIL PROTECTED] Subject: Insert into tall table 50% faster than wide table I have been testing a couple of different approaches to storing customer orders. One is a tall table, where each order is a row. The other is a wide table where each customer is a row, and orders are columns in the row. I am finding that inserts into the tall table, i.e. adding rows for every order, is roughly 50% faster than inserts into the wide table, i.e. adding a row for a customer and then adding columns for orders. In my test, there are 10,000 customers, each customer has 600 orders and each order has 10 columns. The tall table approach results in 6 mil rows of 10 columns. The wide table approach results is 10,000 rows of 6,000 columns. I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the orders using a Put for each order, submitted in batches of 1000 as a list of Puts. Are there techniques to speed up inserts with the wide table approach that I am perhaps overlooking?
-
Re: Insert into tall table 50% faster than wide tableBryan Keller 2010-12-23, 00:16
It appears to be the same or better, not to derail my original question. The much slower write performance will cause problems for me unless I can resolve that.
On Dec 22, 2010, at 3:52 PM, Peter Haidinyak wrote: > Interesting, do you know what the time difference would be on the other side, doing a lookup/scan? > > Thanks > > -Pete > > -----Original Message----- > From: Bryan Keller [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, December 22, 2010 3:41 PM > To: [EMAIL PROTECTED] > Subject: Insert into tall table 50% faster than wide table > > I have been testing a couple of different approaches to storing customer orders. One is a tall table, where each order is a row. The other is a wide table where each customer is a row, and orders are columns in the row. I am finding that inserts into the tall table, i.e. adding rows for every order, is roughly 50% faster than inserts into the wide table, i.e. adding a row for a customer and then adding columns for orders. > > In my test, there are 10,000 customers, each customer has 600 orders and each order has 10 columns. The tall table approach results in 6 mil rows of 10 columns. The wide table approach results is 10,000 rows of 6,000 columns. I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the orders using a Put for each order, submitted in batches of 1000 as a list of Puts. > > Are there techniques to speed up inserts with the wide table approach that I am perhaps overlooking? >
-
Re: Insert into tall table 50% faster than wide tableBryan Keller 2010-12-23, 01:03
Perhaps slow wide table insert performance is related to row versioning? If I have a customer row and keep adding order columns one by one, I'm thinking that there might be a version kept of the row for every order I add? If I am simply inserting a new row for every order, there is no versioning going on. Could this be causing performance problems?
On Dec 22, 2010, at 4:16 PM, Bryan Keller wrote: > It appears to be the same or better, not to derail my original question. The much slower write performance will cause problems for me unless I can resolve that. > > On Dec 22, 2010, at 3:52 PM, Peter Haidinyak wrote: > >> Interesting, do you know what the time difference would be on the other side, doing a lookup/scan? >> >> Thanks >> >> -Pete >> >> -----Original Message----- >> From: Bryan Keller [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, December 22, 2010 3:41 PM >> To: [EMAIL PROTECTED] >> Subject: Insert into tall table 50% faster than wide table >> >> I have been testing a couple of different approaches to storing customer orders. One is a tall table, where each order is a row. The other is a wide table where each customer is a row, and orders are columns in the row. I am finding that inserts into the tall table, i.e. adding rows for every order, is roughly 50% faster than inserts into the wide table, i.e. adding a row for a customer and then adding columns for orders. >> >> In my test, there are 10,000 customers, each customer has 600 orders and each order has 10 columns. The tall table approach results in 6 mil rows of 10 columns. The wide table approach results is 10,000 rows of 6,000 columns. I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the orders using a Put for each order, submitted in batches of 1000 as a list of Puts. >> >> Are there techniques to speed up inserts with the wide table approach that I am perhaps overlooking? >> >
-
Re: Insert into tall table 50% faster than wide tableBryan Keller 2010-12-23, 02:24
Actually I don't think this is the problem as HBase versions cells, not rows, if I understand correctly.
On Dec 22, 2010, at 5:03 PM, Bryan Keller wrote: > Perhaps slow wide table insert performance is related to row versioning? If I have a customer row and keep adding order columns one by one, I'm thinking that there might be a version kept of the row for every order I add? If I am simply inserting a new row for every order, there is no versioning going on. Could this be causing performance problems? > > On Dec 22, 2010, at 4:16 PM, Bryan Keller wrote: > >> It appears to be the same or better, not to derail my original question. The much slower write performance will cause problems for me unless I can resolve that. >> >> On Dec 22, 2010, at 3:52 PM, Peter Haidinyak wrote: >> >>> Interesting, do you know what the time difference would be on the other side, doing a lookup/scan? >>> >>> Thanks >>> >>> -Pete >>> >>> -----Original Message----- >>> From: Bryan Keller [mailto:[EMAIL PROTECTED]] >>> Sent: Wednesday, December 22, 2010 3:41 PM >>> To: [EMAIL PROTECTED] >>> Subject: Insert into tall table 50% faster than wide table >>> >>> I have been testing a couple of different approaches to storing customer orders. One is a tall table, where each order is a row. The other is a wide table where each customer is a row, and orders are columns in the row. I am finding that inserts into the tall table, i.e. adding rows for every order, is roughly 50% faster than inserts into the wide table, i.e. adding a row for a customer and then adding columns for orders. >>> >>> In my test, there are 10,000 customers, each customer has 600 orders and each order has 10 columns. The tall table approach results in 6 mil rows of 10 columns. The wide table approach results is 10,000 rows of 6,000 columns. I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the orders using a Put for each order, submitted in batches of 1000 as a list of Puts. >>> >>> Are there techniques to speed up inserts with the wide table approach that I am perhaps overlooking? >>> >> >
-
RE: Insert into tall table 50% faster than wide tableMichael Segel 2010-12-23, 02:53
HBase does version cells. But I saw something of interest: " >>> In my test, there are 10,000 customers, each customer has 600 orders and each order has 10 columns. The tall table approach results in 6 mil rows of 10 columns. The wide table approach results is 10,000 rows of 6,000 columns. I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the orders using a Put for each order, submitted in batches of 1000 as a list of Puts. >>> >>> Are there techniques to speed up inserts with the wide table approach that I am perhaps overlooking? >>> >> > " Ok, so you have 10K by 600 by 10. So the 'tall' design has a row key of customer_id and Order_id with 10 columns in a single column family. So you get 6 million rows and 10 column puts. Now if you do a 'wide' table... Your row key is the 'customer_id' only. Each column is the order so you write one column for each order and you have to figure out how you represent your columns in the order. (An example... your order of 10 items is represented by a string with a 'special character' used as a column separator in the order.) So you're doing one column write for each order and you have a total of 10K rows. Unless I'm missing something part of the 'slowness' could be how your writing your orders on your wide table. There are a couple other unknowns. Are you hashing your keys? I mean are you getting a bit of 'randomness' in your keys? So what am I missing? -Mike > Subject: Re: Insert into tall table 50% faster than wide table > From: [EMAIL PROTECTED] > Date: Wed, 22 Dec 2010 18:24:05 -0800 > To: [EMAIL PROTECTED] > > Actually I don't think this is the problem as HBase versions cells, not rows, if I understand correctly. > > On Dec 22, 2010, at 5:03 PM, Bryan Keller wrote: > > > Perhaps slow wide table insert performance is related to row versioning? If I have a customer row and keep adding order columns one by one, I'm thinking that there might be a version kept of the row for every order I add? If I am simply inserting a new row for every order, there is no versioning going on. Could this be causing performance problems? > > > > On Dec 22, 2010, at 4:16 PM, Bryan Keller wrote: > > > >> It appears to be the same or better, not to derail my original question. The much slower write performance will cause problems for me unless I can resolve that. > >> > >> On Dec 22, 2010, at 3:52 PM, Peter Haidinyak wrote: > >> > >>> Interesting, do you know what the time difference would be on the other side, doing a lookup/scan? > >>> > >>> Thanks > >>> > >>> -Pete > >>> > >>> -----Original Message----- > >>> From: Bryan Keller [mailto:[EMAIL PROTECTED]] > >>> Sent: Wednesday, December 22, 2010 3:41 PM > >>> To: [EMAIL PROTECTED] > >>> Subject: Insert into tall table 50% faster than wide table > >>> > >>> I have been testing a couple of different approaches to storing customer orders. One is a tall table, where each order is a row. The other is a wide table where each customer is a row, and orders are columns in the row. I am finding that inserts into the tall table, i.e. adding rows for every order, is roughly 50% faster than inserts into the wide table, i.e. adding a row for a customer and then adding columns for orders. > >>> > >>> In my test, there are 10,000 customers, each customer has 600 orders and each order has 10 columns. The tall table approach results in 6 mil rows of 10 columns. The wide table approach results is 10,000 rows of 6,000 columns. I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the orders using a Put for each order, submitted in batches of 1000 as a list of Puts. > >>> > >>> Are there techniques to speed up inserts with the wide table approach that I am perhaps overlooking? > >>> > >> > > >
-
Re: Insert into tall table 50% faster than wide tableTed Yu 2010-12-23, 03:00
> Each column is the order so you write one column for each order
As stated earlier, wide table has 6,000 columns instead of 600. :-) Bryan: Can you describe how you form row keys in each case ? On Wed, Dec 22, 2010 at 6:53 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > > HBase does version cells. > > But I saw something of interest: > " > >>> In my test, there are 10,000 customers, each customer has 600 orders > and each order has 10 columns. The tall table approach results in 6 mil rows > of 10 columns. The wide table approach results is 10,000 rows of 6,000 > columns. I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the > orders using a Put for each order, submitted in batches of 1000 as a list of > Puts. > >>> > >>> Are there techniques to speed up inserts with the wide table approach > that I am perhaps overlooking? > >>> > >> > > " > > Ok, so you have 10K by 600 by 10. So the 'tall' design has a row key of > customer_id and Order_id with 10 columns in a single column family. > So you get 6 million rows and 10 column puts. > > Now if you do a 'wide' table... > Your row key is the 'customer_id' only. Each column is the order so you > write one column for each order and you have to figure out how you represent > your columns in the order. > (An example... your order of 10 items is represented by a string with a > 'special character' used as a column separator in the order.) > So you're doing one column write for each order and you have a total of 10K > rows. > > Unless I'm missing something part of the 'slowness' could be how your > writing your orders on your wide table. There are a couple other unknowns. > Are you hashing your keys? > I mean are you getting a bit of 'randomness' in your keys? > > So what am I missing? > > -Mike > > > > Subject: Re: Insert into tall table 50% faster than wide table > > From: [EMAIL PROTECTED] > > Date: Wed, 22 Dec 2010 18:24:05 -0800 > > To: [EMAIL PROTECTED] > > > > Actually I don't think this is the problem as HBase versions cells, not > rows, if I understand correctly. > > > > On Dec 22, 2010, at 5:03 PM, Bryan Keller wrote: > > > > > Perhaps slow wide table insert performance is related to row > versioning? If I have a customer row and keep adding order columns one by > one, I'm thinking that there might be a version kept of the row for every > order I add? If I am simply inserting a new row for every order, there is no > versioning going on. Could this be causing performance problems? > > > > > > On Dec 22, 2010, at 4:16 PM, Bryan Keller wrote: > > > > > >> It appears to be the same or better, not to derail my original > question. The much slower write performance will cause problems for me > unless I can resolve that. > > >> > > >> On Dec 22, 2010, at 3:52 PM, Peter Haidinyak wrote: > > >> > > >>> Interesting, do you know what the time difference would be on the > other side, doing a lookup/scan? > > >>> > > >>> Thanks > > >>> > > >>> -Pete > > >>> > > >>> -----Original Message----- > > >>> From: Bryan Keller [mailto:[EMAIL PROTECTED]] > > >>> Sent: Wednesday, December 22, 2010 3:41 PM > > >>> To: [EMAIL PROTECTED] > > >>> Subject: Insert into tall table 50% faster than wide table > > >>> > > >>> I have been testing a couple of different approaches to storing > customer orders. One is a tall table, where each order is a row. The other > is a wide table where each customer is a row, and orders are columns in the > row. I am finding that inserts into the tall table, i.e. adding rows for > every order, is roughly 50% faster than inserts into the wide table, i.e. > adding a row for a customer and then adding columns for orders. > > >>> > > >>> In my test, there are 10,000 customers, each customer has 600 orders > and each order has 10 columns. The tall table approach results in 6 mil rows > of 10 columns. The wide table approach results is 10,000 rows of 6,000 > columns. I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the > orders using a Put for each order, submitted in batches of 1000 as a list of
-
RE: Insert into tall table 50% faster than wide tableMichael Segel 2010-12-23, 03:35
Ted, yes, 10K rows one for each customer. But if you write each order as a column, and there are 10 'columns' in an order, you have to somehow serialize the 10 columns that represent the order so you get one column per order_id. Of course you could still write out a column as order_id,order_column and then get your 6000 columns. If you did that, then you have the issue of your column id. Did you go column_id,order_id or did you go order_id, column_id? (One has to ask... :-) ) IMHO I'd elect to put the 10 columns of the order in a single column rather than write the 10 columns as individual columns. But that's just me. :-) -Mike > Date: Wed, 22 Dec 2010 19:00:25 -0800 > Subject: Re: Insert into tall table 50% faster than wide table > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > > Each column is the order so you write one column for each order > As stated earlier, wide table has 6,000 columns instead of 600. :-) > > Bryan: > Can you describe how you form row keys in each case ? > > > On Wed, Dec 22, 2010 at 6:53 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > > > > > HBase does version cells. > > > > But I saw something of interest: > > " > > >>> In my test, there are 10,000 customers, each customer has 600 orders > > and each order has 10 columns. The tall table approach results in 6 mil rows > > of 10 columns. The wide table approach results is 10,000 rows of 6,000 > > columns. I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the > > orders using a Put for each order, submitted in batches of 1000 as a list of > > Puts. > > >>> > > >>> Are there techniques to speed up inserts with the wide table approach > > that I am perhaps overlooking? > > >>> > > >> > > > " > > > > Ok, so you have 10K by 600 by 10. So the 'tall' design has a row key of > > customer_id and Order_id with 10 columns in a single column family. > > So you get 6 million rows and 10 column puts. > > > > Now if you do a 'wide' table... > > Your row key is the 'customer_id' only. Each column is the order so you > > write one column for each order and you have to figure out how you represent > > your columns in the order. > > (An example... your order of 10 items is represented by a string with a > > 'special character' used as a column separator in the order.) > > So you're doing one column write for each order and you have a total of 10K > > rows. > > > > Unless I'm missing something part of the 'slowness' could be how your > > writing your orders on your wide table. There are a couple other unknowns. > > Are you hashing your keys? > > I mean are you getting a bit of 'randomness' in your keys? > > > > So what am I missing? > > > > -Mike > > > > > > > Subject: Re: Insert into tall table 50% faster than wide table > > > From: [EMAIL PROTECTED] > > > Date: Wed, 22 Dec 2010 18:24:05 -0800 > > > To: [EMAIL PROTECTED] > > > > > > Actually I don't think this is the problem as HBase versions cells, not > > rows, if I understand correctly. > > > > > > On Dec 22, 2010, at 5:03 PM, Bryan Keller wrote: > > > > > > > Perhaps slow wide table insert performance is related to row > > versioning? If I have a customer row and keep adding order columns one by > > one, I'm thinking that there might be a version kept of the row for every > > order I add? If I am simply inserting a new row for every order, there is no > > versioning going on. Could this be causing performance problems? > > > > > > > > On Dec 22, 2010, at 4:16 PM, Bryan Keller wrote: > > > > > > > >> It appears to be the same or better, not to derail my original > > question. The much slower write performance will cause problems for me > > unless I can resolve that. > > > >> > > > >> On Dec 22, 2010, at 3:52 PM, Peter Haidinyak wrote: > > > >> > > > >>> Interesting, do you know what the time difference would be on the > > other side, doing a lookup/scan? > > > >>> > > > >>> Thanks > > > >>> > > > >>> -Pete > > > >>> > > > >>> -----Original Message----- > > > >>> From: Bryan Keller [mailto:[EMAIL PROTECTED]]
-
Re: Insert into tall table 50% faster than wide tableBryan Keller 2010-12-23, 04:34
So for the tall table the row key is customer ID + order ID.
For the wide table, the row key is customer ID. The column names (qualifiers) are prefixed by the order ID so they are unique per order, i.e. there are 10 columns prefixed with the first order's ID, there are 10 columns prefixed with the second order'd ID, etc. The order and customer IDs are random UUIDs (16 bytes). On Dec 22, 2010, at 7:35 PM, Michael Segel wrote: > > Ted, > > yes, 10K rows one for each customer. > But if you write each order as a column, and there are 10 'columns' in an order, you have to somehow serialize the 10 columns that represent the order so you get one column per order_id. > Of course you could still write out a column as order_id,order_column and then get your 6000 columns. If you did that, then you have the issue of your column id. Did you go column_id,order_id or did you go order_id, column_id? > (One has to ask... :-) ) > > IMHO I'd elect to put the 10 columns of the order in a single column rather than write the 10 columns as individual columns. But that's just me. :-) > > -Mike > > >> Date: Wed, 22 Dec 2010 19:00:25 -0800 >> Subject: Re: Insert into tall table 50% faster than wide table >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> >>> Each column is the order so you write one column for each order >> As stated earlier, wide table has 6,000 columns instead of 600. :-) >> >> Bryan: >> Can you describe how you form row keys in each case ? >> >> >> On Wed, Dec 22, 2010 at 6:53 PM, Michael Segel <[EMAIL PROTECTED]>wrote: >> >>> >>> HBase does version cells. >>> >>> But I saw something of interest: >>> " >>>>>> In my test, there are 10,000 customers, each customer has 600 orders >>> and each order has 10 columns. The tall table approach results in 6 mil rows >>> of 10 columns. The wide table approach results is 10,000 rows of 6,000 >>> columns. I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the >>> orders using a Put for each order, submitted in batches of 1000 as a list of >>> Puts. >>>>>> >>>>>> Are there techniques to speed up inserts with the wide table approach >>> that I am perhaps overlooking? >>>>>> >>>>> >>>> " >>> >>> Ok, so you have 10K by 600 by 10. So the 'tall' design has a row key of >>> customer_id and Order_id with 10 columns in a single column family. >>> So you get 6 million rows and 10 column puts. >>> >>> Now if you do a 'wide' table... >>> Your row key is the 'customer_id' only. Each column is the order so you >>> write one column for each order and you have to figure out how you represent >>> your columns in the order. >>> (An example... your order of 10 items is represented by a string with a >>> 'special character' used as a column separator in the order.) >>> So you're doing one column write for each order and you have a total of 10K >>> rows. >>> >>> Unless I'm missing something part of the 'slowness' could be how your >>> writing your orders on your wide table. There are a couple other unknowns. >>> Are you hashing your keys? >>> I mean are you getting a bit of 'randomness' in your keys? >>> >>> So what am I missing? >>> >>> -Mike >>> >>> >>>> Subject: Re: Insert into tall table 50% faster than wide table >>>> From: [EMAIL PROTECTED] >>>> Date: Wed, 22 Dec 2010 18:24:05 -0800 >>>> To: [EMAIL PROTECTED] >>>> >>>> Actually I don't think this is the problem as HBase versions cells, not >>> rows, if I understand correctly. >>>> >>>> On Dec 22, 2010, at 5:03 PM, Bryan Keller wrote: >>>> >>>>> Perhaps slow wide table insert performance is related to row >>> versioning? If I have a customer row and keep adding order columns one by >>> one, I'm thinking that there might be a version kept of the row for every >>> order I add? If I am simply inserting a new row for every order, there is no >>> versioning going on. Could this be causing performance problems? >>>>> >>>>> On Dec 22, 2010, at 4:16 PM, Bryan Keller wrote: >>>>> >>>>>> It appears to be the same or better, not to derail my original
-
Re: Insert into tall table 50% faster than wide tableAndrey Stepachev 2010-12-23, 07:14
I think row locks slows down here. Each row you inserted tries to aquire
lock, and then release it. Wide table has significally less rows, and much less locks acquired during insert. 2010/12/23 Bryan Keller <[EMAIL PROTECTED]> > I have been testing a couple of different approaches to storing customer > orders. One is a tall table, where each order is a row. The other is a wide > table where each customer is a row, and orders are columns in the row. I am > finding that inserts into the tall table, i.e. adding rows for every order, > is roughly 50% faster than inserts into the wide table, i.e. adding a row > for a customer and then adding columns for orders. > > In my test, there are 10,000 customers, each customer has 600 orders and > each order has 10 columns. The tall table approach results in 6 mil rows of > 10 columns. The wide table approach results is 10,000 rows of 6,000 columns. > I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the orders > using a Put for each order, submitted in batches of 1000 as a list of Puts. > > Are there techniques to speed up inserts with the wide table approach that > I am perhaps overlooking? > >
-
Re: Insert into tall table 50% faster than wide tableTed Dunning 2010-12-23, 09:28
But the tall table is FASTER than the wide table.
On Wed, Dec 22, 2010 at 11:14 PM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: > I think row locks slows down here. Each row you inserted tries to aquire > lock, and then release it. Wide table has significally less rows, and much > less locks acquired during insert. > > > 2010/12/23 Bryan Keller <[EMAIL PROTECTED]> > > > I have been testing a couple of different approaches to storing customer > > orders. One is a tall table, where each order is a row. The other is a > wide > > table where each customer is a row, and orders are columns in the row. I > am > > finding that inserts into the tall table, i.e. adding rows for every > order, > > is roughly 50% faster than inserts into the wide table, i.e. adding a row > > for a customer and then adding columns for orders. > > > > In my test, there are 10,000 customers, each customer has 600 orders and > > each order has 10 columns. The tall table approach results in 6 mil rows > of > > 10 columns. The wide table approach results is 10,000 rows of 6,000 > columns. > > I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the orders > > using a Put for each order, submitted in batches of 1000 as a list of > Puts. > > > > Are there techniques to speed up inserts with the wide table approach > that > > I am perhaps overlooking? > > > > >
-
Re: Insert into tall table 50% faster than wide tableAndrey Stepachev 2010-12-23, 09:57
2010/12/23 Ted Dunning <[EMAIL PROTECTED]>
> But the tall table is FASTER than the wide table. > Opps. :). Maybe you put more data? Do you using compression? (in case of prefixed qualifiers you get more data, that uuid can has comparable length as an order row) > > On Wed, Dec 22, 2010 at 11:14 PM, Andrey Stepachev <[EMAIL PROTECTED]> > wrote: > > > I think row locks slows down here. Each row you inserted tries to aquire > > lock, and then release it. Wide table has significally less rows, and > much > > less locks acquired during insert. > > > > > > 2010/12/23 Bryan Keller <[EMAIL PROTECTED]> > > > > > I have been testing a couple of different approaches to storing > customer > > > orders. One is a tall table, where each order is a row. The other is a > > wide > > > table where each customer is a row, and orders are columns in the row. > I > > am > > > finding that inserts into the tall table, i.e. adding rows for every > > order, > > > is roughly 50% faster than inserts into the wide table, i.e. adding a > row > > > for a customer and then adding columns for orders. > > > > > > In my test, there are 10,000 customers, each customer has 600 orders > and > > > each order has 10 columns. The tall table approach results in 6 mil > rows > > of > > > 10 columns. The wide table approach results is 10,000 rows of 6,000 > > columns. > > > I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the orders > > > using a Put for each order, submitted in batches of 1000 as a list of > > Puts. > > > > > > Are there techniques to speed up inserts with the wide table approach > > that > > > I am perhaps overlooking? > > > > > > > > >
-
Re: Insert into tall table 50% faster than wide tableLars George 2010-12-23, 10:55
Writing data only hits the WAL and MemStore, so that should equal in
the same performance for both models. One thing that Mike mentioned is how you distribute the load. How many servers are you using? How are inserting your data (sequential or random)? Why do you use a Put since this sounds like a bulk insert and hence should be much better done with a HFileOutputFormat based MapReduce job? You do have some row locking happening as mentioned earlier, which may block concurrent updates to the same row. Are you sending updates for one row in a single Put instance? Or are you creating many Put's for each order but the same row? Lars On Thu, Dec 23, 2010 at 9:57 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: > 2010/12/23 Ted Dunning <[EMAIL PROTECTED]> > >> But the tall table is FASTER than the wide table. >> > > Opps. :). > > Maybe you put more data? Do you using compression? (in case of prefixed > qualifiers you > get more data, that uuid can has comparable length as an order row) > > >> >> On Wed, Dec 22, 2010 at 11:14 PM, Andrey Stepachev <[EMAIL PROTECTED]> >> wrote: >> >> > I think row locks slows down here. Each row you inserted tries to aquire >> > lock, and then release it. Wide table has significally less rows, and >> much >> > less locks acquired during insert. >> > >> > >> > 2010/12/23 Bryan Keller <[EMAIL PROTECTED]> >> > >> > > I have been testing a couple of different approaches to storing >> customer >> > > orders. One is a tall table, where each order is a row. The other is a >> > wide >> > > table where each customer is a row, and orders are columns in the row. >> I >> > am >> > > finding that inserts into the tall table, i.e. adding rows for every >> > order, >> > > is roughly 50% faster than inserts into the wide table, i.e. adding a >> row >> > > for a customer and then adding columns for orders. >> > > >> > > In my test, there are 10,000 customers, each customer has 600 orders >> and >> > > each order has 10 columns. The tall table approach results in 6 mil >> rows >> > of >> > > 10 columns. The wide table approach results is 10,000 rows of 6,000 >> > columns. >> > > I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the orders >> > > using a Put for each order, submitted in batches of 1000 as a list of >> > Puts. >> > > >> > > Are there techniques to speed up inserts with the wide table approach >> > that >> > > I am perhaps overlooking? >> > > >> > > >> > >> >
-
RE: Insert into tall table 50% faster than wide tableMichael Segel 2010-12-23, 13:20
Uhm... just a couple of thoughts... For clarification ... lets call Bryan's "order's columns" the detail of the order. Columns of columns is a bit confusing... Its becoming more apparent that schema design will play a large consideration in terms of performance, and because its going to be dependent on HBase's internals, its very possible that it can be tied to versions. This means that as HBase evolves, those seeking optimum performance may have to periodically review their schema decisions. The first thing I'd recommend on the 'wide table' schema is to not store the individual order's columns as separate columns, but as part of the order itself. The main reason for this is that you will never fetch an order's detail by itself. A quick and cheap way of serializing the order detail is to use something Dick Pick did around 40 years ago. In the Pick databases (ie Revelation), a non-printable ASCII character was used as a column delimiter. You could use the '|' (pipe) character, but someone could point out that its possible that it could occur in the data. A non-printable ascii character (char 254??) would less likely be part of the data. This works well because when you want to get the order, you can fetch it from HBase, then parse the order based on a string token. (Very fast and efficient) This will make life easier in the long run... It will also have a positive impact on your code. On each Mapper.map() iteration, or rather code iteration [see assumption below], you have your row_id, and then one put for the column write (that contains the 10 detail items.) Note: What has a higher cost? Using a string buffer and concatenation of your 10 detail items then taking its bytecode, or doing 10 put()s? Note the following: The discussion above is for uber performance gains. There will be code improvements, however they will be relatively modest when compared to other potential gains. Assumption(s): Bryan is attempting to create a simulation with 10K customers, 600 orders each. (10 items per order). This is a performance test. This probably isn't a m/r program but a single client doing an insert. Note that its a relative performance issue and it would be easier to do as a single program and not a distributed one. This could be a m/r if Bryan pre-builds the list of customer orders before starting the job... Or it could be a multi-threaded client where each thread reads from the pre-built list and performs an insert. If the assumption is true, then Bryan is going to randomly pick a customer id, create an order and insert the order in to HBase. (randomly pick a number between 1 and N where N represents the number of customers who haven't placed 600 orders, and then count the number of orders and remove each customer with 600 orders from the list) So this really wouldn't be a bulk load app, but a simulation of multiple clients hitting HBase and its relative performance. If this is the case, I don't know if you want to use the HFileOutput format... With respect to the 'wide' row, I'd hash the key. (You wouldn't want to do this in the 'tall' table because you want each customer's orders to be near each other.) HTH -Mike > Date: Thu, 23 Dec 2010 10:55:43 +0000 > Subject: Re: Insert into tall table 50% faster than wide table > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > Writing data only hits the WAL and MemStore, so that should equal in > the same performance for both models. One thing that Mike mentioned is > how you distribute the load. How many servers are you using? How are > inserting your data (sequential or random)? Why do you use a Put since > this sounds like a bulk insert and hence should be much better done > with a HFileOutputFormat based MapReduce job? > > You do have some row locking happening as mentioned earlier, which may > block concurrent updates to the same row. Are you sending updates for > one row in a single Put instance? Or are you creating many Put's for > each order but the same row?
-
Re: Insert into tall table 50% faster than wide tableRyan Rawson 2010-12-23, 19:54
Hi all,
What does the region count look like between your tall and wide tables? If you dont get a good spread of regions across your cluster you don't get full parallelism on all your hardware. The row lock thing is another thing to watch out for, concurrent puts will serialize along the row lock. -ryan On Thu, Dec 23, 2010 at 5:20 AM, Michael Segel <[EMAIL PROTECTED]> wrote: > > Uhm... just a couple of thoughts... > > For clarification ... lets call Bryan's "order's columns" the detail of the order. Columns of columns is a bit confusing... > > Its becoming more apparent that schema design will play a large consideration in terms of performance, and because its going to be dependent on HBase's internals, its very possible that it can be tied to versions. > This means that as HBase evolves, those seeking optimum performance may have to periodically review their schema decisions. > > The first thing I'd recommend on the 'wide table' schema is to not store the individual order's columns as separate columns, but as part of the order itself. The main reason for this is that you will never fetch an order's detail by itself. A quick and cheap way of serializing the order detail is to use something Dick Pick did around 40 years ago. In the Pick databases (ie Revelation), a non-printable ASCII character was used as a column delimiter. You could use the '|' (pipe) character, but someone could point out that its possible that it could occur in the data. A non-printable ascii character (char 254??) would less likely be part of the data. This works well because when you want to get the order, you can fetch it from HBase, then parse the order based on a string token. (Very fast and efficient) > > This will make life easier in the long run... > > It will also have a positive impact on your code. > On each Mapper.map() iteration, or rather code iteration [see assumption below], you have your row_id, and then one put for the column write (that contains the 10 detail items.) Note: What has a higher cost? Using a string buffer and concatenation of your 10 detail items then taking its bytecode, or doing 10 put()s? > > Note the following: The discussion above is for uber performance gains. There will be code improvements, however they will be relatively modest when compared to other potential gains. > > Assumption(s): > Bryan is attempting to create a simulation with 10K customers, 600 orders each. (10 items per order). This is a performance test. > This probably isn't a m/r program but a single client doing an insert. Note that its a relative performance issue and it would be easier to do as a single program and not a distributed one. This could be a m/r if Bryan pre-builds the list of customer orders before starting the job... Or it could be a multi-threaded client where each thread reads from the pre-built list and performs an insert. > > If the assumption is true, then Bryan is going to randomly pick a customer id, create an order and insert the order in to HBase. (randomly pick a number between 1 and N where N represents the number of customers who haven't placed 600 orders, and then count the number of orders and remove each customer with 600 orders from the list) > > So this really wouldn't be a bulk load app, but a simulation of multiple clients hitting HBase and its relative performance. > > If this is the case, I don't know if you want to use the HFileOutput format... > > With respect to the 'wide' row, I'd hash the key. (You wouldn't want to do this in the 'tall' table because you want each customer's orders to be near each other.) > > HTH > > -Mike > > >> Date: Thu, 23 Dec 2010 10:55:43 +0000 >> Subject: Re: Insert into tall table 50% faster than wide table >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> >> Writing data only hits the WAL and MemStore, so that should equal in >> the same performance for both models. One thing that Mike mentioned is >> how you distribute the load. How many servers are you using? How are
-
Re: Insert into tall table 50% faster than wide tableBryan Keller 2010-12-23, 22:28
I revised the test so that it creates a single Put for each customer. Previously I was creating a separate Put for each order, even if the order was for the same customer. I submit batches of Puts using HTable.put(List<Put>).
Performance with both approaches was about the same. It doesn't appear as if row locks are an issue in my case, perhaps because the Puts for a customer's orders are mostly in the same List<Put>? As to cluster setup, I am testing tall vs wide on the exact same cluster. Keys are all random UUIDs so I'm assuming I should get a good spread. Are there configuration options I should be looking at that could help wide table performance for inserts? I was thinking about serializing the order data, but then I will run into issues of versioning and such, and then I am back to a tightly structured schema. Thus I did like storing the order fields in separate columns. Read performance seems to be very good, it is the writes that are slower. On Dec 23, 2010, at 11:54 AM, Ryan Rawson wrote: > Hi all, > > What does the region count look like between your tall and wide > tables? If you dont get a good spread of regions across your cluster > you don't get full parallelism on all your hardware. > > The row lock thing is another thing to watch out for, concurrent puts > will serialize along the row lock. > > -ryan > > On Thu, Dec 23, 2010 at 5:20 AM, Michael Segel > <[EMAIL PROTECTED]> wrote: >> >> Uhm... just a couple of thoughts... >> >> For clarification ... lets call Bryan's "order's columns" the detail of the order. Columns of columns is a bit confusing... >> >> Its becoming more apparent that schema design will play a large consideration in terms of performance, and because its going to be dependent on HBase's internals, its very possible that it can be tied to versions. >> This means that as HBase evolves, those seeking optimum performance may have to periodically review their schema decisions. >> >> The first thing I'd recommend on the 'wide table' schema is to not store the individual order's columns as separate columns, but as part of the order itself. The main reason for this is that you will never fetch an order's detail by itself. A quick and cheap way of serializing the order detail is to use something Dick Pick did around 40 years ago. In the Pick databases (ie Revelation), a non-printable ASCII character was used as a column delimiter. You could use the '|' (pipe) character, but someone could point out that its possible that it could occur in the data. A non-printable ascii character (char 254??) would less likely be part of the data. This works well because when you want to get the order, you can fetch it from HBase, then parse the order based on a string token. (Very fast and efficient) >> >> This will make life easier in the long run... >> >> It will also have a positive impact on your code. >> On each Mapper.map() iteration, or rather code iteration [see assumption below], you have your row_id, and then one put for the column write (that contains the 10 detail items.) Note: What has a higher cost? Using a string buffer and concatenation of your 10 detail items then taking its bytecode, or doing 10 put()s? >> >> Note the following: The discussion above is for uber performance gains. There will be code improvements, however they will be relatively modest when compared to other potential gains. >> >> Assumption(s): >> Bryan is attempting to create a simulation with 10K customers, 600 orders each. (10 items per order). This is a performance test. >> This probably isn't a m/r program but a single client doing an insert. Note that its a relative performance issue and it would be easier to do as a single program and not a distributed one. This could be a m/r if Bryan pre-builds the list of customer orders before starting the job... Or it could be a multi-threaded client where each thread reads from the pre-built list and performs an insert. >> >> If the assumption is true, then Bryan is going to randomly pick a customer id, create an order and insert the order in to HBase. (randomly pick a number between 1 and N where N represents the number of customers who haven't placed 600 orders, and then count the number of orders and remove each customer with 600 orders from the list)
-
Re: Insert into tall table 50% faster than wide tableBryan Keller 2010-12-23, 22:44
Correction, I ran the wrong test. Consolidating the Puts increased performance back to that of the tall table. So it appears row locks were the issue. Thanks for the help everyone.
On Dec 23, 2010, at 2:28 PM, Bryan Keller wrote: > I revised the test so that it creates a single Put for each customer. Previously I was creating a separate Put for each order, even if the order was for the same customer. I submit batches of Puts using HTable.put(List<Put>). > > Performance with both approaches was about the same. It doesn't appear as if row locks are an issue in my case, perhaps because the Puts for a customer's orders are mostly in the same List<Put>? > > As to cluster setup, I am testing tall vs wide on the exact same cluster. Keys are all random UUIDs so I'm assuming I should get a good spread. Are there configuration options I should be looking at that could help wide table performance for inserts? > > I was thinking about serializing the order data, but then I will run into issues of versioning and such, and then I am back to a tightly structured schema. Thus I did like storing the order fields in separate columns. Read performance seems to be very good, it is the writes that are slower. > > > On Dec 23, 2010, at 11:54 AM, Ryan Rawson wrote: > >> Hi all, >> >> What does the region count look like between your tall and wide >> tables? If you dont get a good spread of regions across your cluster >> you don't get full parallelism on all your hardware. >> >> The row lock thing is another thing to watch out for, concurrent puts >> will serialize along the row lock. >> >> -ryan >> >> On Thu, Dec 23, 2010 at 5:20 AM, Michael Segel >> <[EMAIL PROTECTED]> wrote: >>> >>> Uhm... just a couple of thoughts... >>> >>> For clarification ... lets call Bryan's "order's columns" the detail of the order. Columns of columns is a bit confusing... >>> >>> Its becoming more apparent that schema design will play a large consideration in terms of performance, and because its going to be dependent on HBase's internals, its very possible that it can be tied to versions. >>> This means that as HBase evolves, those seeking optimum performance may have to periodically review their schema decisions. >>> >>> The first thing I'd recommend on the 'wide table' schema is to not store the individual order's columns as separate columns, but as part of the order itself. The main reason for this is that you will never fetch an order's detail by itself. A quick and cheap way of serializing the order detail is to use something Dick Pick did around 40 years ago. In the Pick databases (ie Revelation), a non-printable ASCII character was used as a column delimiter. You could use the '|' (pipe) character, but someone could point out that its possible that it could occur in the data. A non-printable ascii character (char 254??) would less likely be part of the data. This works well because when you want to get the order, you can fetch it from HBase, then parse the order based on a string token. (Very fast and efficient) >>> >>> This will make life easier in the long run... >>> >>> It will also have a positive impact on your code. >>> On each Mapper.map() iteration, or rather code iteration [see assumption below], you have your row_id, and then one put for the column write (that contains the 10 detail items.) Note: What has a higher cost? Using a string buffer and concatenation of your 10 detail items then taking its bytecode, or doing 10 put()s? >>> >>> Note the following: The discussion above is for uber performance gains. There will be code improvements, however they will be relatively modest when compared to other potential gains. >>> >>> Assumption(s): >>> Bryan is attempting to create a simulation with 10K customers, 600 orders each. (10 items per order). This is a performance test. >>> This probably isn't a m/r program but a single client doing an insert. Note that its a relative performance issue and it would be easier to do as a single program and not a distributed one. This could be a m/r if Bryan pre-builds the list of customer orders before starting the job... Or it could be a multi-threaded client where each thread reads from the pre-built list and performs an insert. |