|
anil gupta
2012-12-14, 08:41
Anoop Sam John
2012-12-14, 08:54
ramkrishna vasudevan
2012-12-14, 11:34
anil gupta
2012-12-14, 18:01
Anoop Sam John
2012-12-17, 04:02
anil gupta
2012-12-18, 08:28
Michel Segel
2012-12-18, 09:02
Anoop Sam John
2012-12-18, 09:27
Anoop Sam John
2012-12-18, 09:35
anil gupta
2012-12-19, 08:24
anil gupta
2012-12-19, 08:39
David Arthur
2012-12-20, 02:47
Anoop Sam John
2012-12-20, 03:33
Anoop Sam John
2012-12-20, 03:44
Farah Karim
2012-12-25, 10:14
Shengjie Min
2012-12-27, 11:23
Anoop Sam John
2012-12-27, 11:30
Shengjie Min
2012-12-27, 13:07
Anoop John
2012-12-27, 15:54
ramkrishna vasudevan
2012-12-27, 16:11
Shengjie Min
2012-12-27, 16:29
Anoop Sam John
2012-12-28, 03:33
Mohit Anchlia
2012-12-28, 03:42
Anoop Sam John
2012-12-28, 04:14
Shengjie Min
2012-12-28, 10:55
Adrien Mogenet
2013-01-06, 20:30
Mohit Anchlia
2013-01-06, 20:36
Adrien Mogenet
2013-01-06, 20:40
anil gupta
2013-01-06, 22:12
Anoop Sam John
2013-01-07, 03:48
Mohit Anchlia
2013-01-07, 04:17
Anoop Sam John
2013-01-07, 13:49
Michael Segel
2013-01-08, 14:33
Asaf Mesika
2013-01-08, 23:00
lars hofhansl
2013-01-09, 00:30
anil gupta
2013-01-09, 01:28
Michel Segel
2013-01-09, 01:30
Mohit Anchlia
2013-01-09, 01:50
Anoop Sam John
2013-01-09, 03:22
ramkrishna vasudevan
2013-01-09, 04:11
|
-
Re: HBase - Secondary Indexanil gupta 2012-12-14, 08:41
Hi Anoop,
Nice presentation and seems like a smart implementation. Since the presentation only covered bullet points so i have couple of questions on your implementation. :) Here is a recap to my implementation and our previous discussion on Secondary index: Here is the link to previous email thread: http://search-hadoop.com/m/1zWPMaaRtr . The secondary index is stored in table "B" as rowkey B --> family:<rowkey A> . "<rowkey A>" is the column qualifier. Every row in B will only on have one column "k" and the value of that column is the rowkey of A. Suppose i am storing customer events in table A. I have two requirement for data query: 1. Query customer events on basis of customer_Id and event_ID. 2. Query customer events on basis of event_timestamp and customer_ID. 70% of querying is done by query#1, so i will create <customer_Id><event_ID> as row key of Table A. Now, in order to support fast results for query#2, i need to create a secondary index on A. I store that secondary index in B, rowkey of B is <event_timestamp><customer_ID>.Every row stores the corresponding rowkey of A. HBase Querying approach: 1. Scan the secondary table by using prefix filter and startRow to get the list of Rowkeys of Primary table. 2. Do a batch get on primary table by using HTable.get(List<Get>) method using the list of Rowkeys obtained in step1. The only issue is that in my solution i have at least two RPC calls. Once each in step1 and step2 above. I want to reduce the number of RPC to 1 if possible. ******Questions on your implementation:********* 1. In your presentation you mentioned that region of Primary Table and Region of Secondary Table are always located on the same region server. How do you achieve it? By using the Primary table rowkey as prefix of Rowkey of Secondary Table? Will your implementation work if the rowkey of primary table cannot be used as prefix in rowkey of Secondary table( i have this limitation in my use case)? 2. Are you using an Endpoint or Observer for building the secondary index table? 3. "Custom balancer do collocation". Is it a custom load balancer of HBase Master or something else? 4. Your region split looks interesting. I dont have much info about it. Can you point to some docs on IndexHalfStoreFileReader? Thanks, Anil Gupta On Tue, Dec 4, 2012 at 12:10 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Hi All > > Last week I got a chance to present the secondary indexing > solution what we have done in Huawei at the China Hadoop Conference. You > can see the presentation from > http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf > > > > I would like to hear what others think on this. :) > > > > -Anoop- > -- Thanks & Regards, Anil Gupta
-
RE: HBase - Secondary IndexAnoop Sam John 2012-12-14, 08:54
Hi Anil,
>1. In your presentation you mentioned that region of Primary Table and Region of Secondary Table are always located on the same region server. How do you achieve it? By using the Primary table rowkey as prefix of Rowkey of Secondary Table? Will your implementation work if the rowkey of primary table cannot be used as prefix in rowkey of Secondary table( i have this limitation in my use case)? First all there will be same number of regions in both primary and index tables. All the start/stop keys of the regions also will be same. Suppose there are 2 regions on main table say for keys 0-10 and 10-20. Then we will create 2 regions in index table also with same key ranges. At the master balancing level it is easy to collocate these regions seeing the start and end keys. When the selection of the rowkey that will be used in the index table is the key here. What we will do is all the rowkeys in the index table will be prefixed with the start key of the region/ When an entry is added to the main table with rowkey as 5 it will go to the 1st region (0-10) Now there will be index region with range as 0-10. We will select this region to store this index data. The row getting added into the index region for this entry will have a rowkey 0_x_5 I am just using '_' as a seperator here just to show this. Actually we wont be having any seperator. So the rowkeys (in index region) will have a static begin part always. Will scan time also we know this part and so the startrow and endrow creation for the scan will be possible.. Note that we will store the actual table row key as the last part of the index rowkey itself not as a value. This is better option in our case of handling the scan index usage also at sever side. There is no index data fetch to client side.. I feel your use case perfectly fit with our model >2. Are you using an Endpoint or Observer for building the secondary index table? Observer >3. "Custom balancer do collocation". Is it a custom load balancer of HBase Master or something else? It is a balancer implementation which will be plugged into Master >4. Your region split looks interesting. I dont have much info about it. Can you point to some docs on IndexHalfStoreFileReader? Sorry I am not able to publish any design doc or code as the company has not decided to open src the solution yet. Any particular query you come acorss pls feel free to aske me :) You can see the HalfStoreFileReader class 1st.. -Anoop- ________________________________________ From: anil gupta [[EMAIL PROTECTED]] Sent: Friday, December 14, 2012 2:11 PM To: [EMAIL PROTECTED] Subject: Re: HBase - Secondary Index Hi Anoop, Nice presentation and seems like a smart implementation. Since the presentation only covered bullet points so i have couple of questions on your implementation. :) Here is a recap to my implementation and our previous discussion on Secondary index: Here is the link to previous email thread: http://search-hadoop.com/m/1zWPMaaRtr . The secondary index is stored in table "B" as rowkey B --> family:<rowkey A> . "<rowkey A>" is the column qualifier. Every row in B will only on have one column "k" and the value of that column is the rowkey of A. Suppose i am storing customer events in table A. I have two requirement for data query: 1. Query customer events on basis of customer_Id and event_ID. 2. Query customer events on basis of event_timestamp and customer_ID. 70% of querying is done by query#1, so i will create <customer_Id><event_ID> as row key of Table A. Now, in order to support fast results for query#2, i need to create a secondary index on A. I store that secondary index in B, rowkey of B is <event_timestamp><customer_ID>.Every row stores the corresponding rowkey of A. HBase Querying approach: 1. Scan the secondary table by using prefix filter and startRow to get the list of Rowkeys of Primary table. 2. Do a batch get on primary table by using HTable.get(List<Get>) method using the list of Rowkeys obtained in step1. The only issue is that in my solution i have at least two RPC calls. Once each in step1 and step2 above. I want to reduce the number of RPC to 1 if possible. ******Questions on your implementation:********* 1. In your presentation you mentioned that region of Primary Table and Region of Secondary Table are always located on the same region server. How do you achieve it? By using the Primary table rowkey as prefix of Rowkey of Secondary Table? Will your implementation work if the rowkey of primary table cannot be used as prefix in rowkey of Secondary table( i have this limitation in my use case)? 2. Are you using an Endpoint or Observer for building the secondary index table? 3. "Custom balancer do collocation". Is it a custom load balancer of HBase Master or something else? 4. Your region split looks interesting. I dont have much info about it. Can you point to some docs on IndexHalfStoreFileReader? Thanks, Anil Gupta On Tue, Dec 4, 2012 at 12:10 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: Thanks & Regards, Anil Gupta
-
Re: HBase - Secondary Indexramkrishna vasudevan 2012-12-14, 11:34
Nice explanation Anoop. :)
No prefix filters will be needed to query the secondary index table. As Anoop told <Startkey><IndexName>-> Static part <Value>-> Main table rowkey value <Actualrowkey>Actual rowkey. So you just need to set a start row with <StartKey><IndexName><Value>... This will give you the Actual rowkey as it is part of the rowkey.. Just use this rowkey on the primary table...You get the exact row needed... All are server side...Nothing comes to the client till the final actual row key is fetched.. Regards Ram On Fri, Dec 14, 2012 at 2:24 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Hi Anil, > > >1. In your presentation you mentioned that region of Primary Table and > Region of Secondary Table are always located on the same region server. How > do you achieve it? By using the Primary table rowkey as prefix of Rowkey > of Secondary Table? Will your implementation work if the rowkey of primary > table cannot be used as prefix in rowkey of Secondary table( i have this > limitation in my use case)? > First all there will be same number of regions in both primary and index > tables. All the start/stop keys of the regions also will be same. > Suppose there are 2 regions on main table say for keys 0-10 and 10-20. > Then we will create 2 regions in index table also with same key ranges. > At the master balancing level it is easy to collocate these regions seeing > the start and end keys. > When the selection of the rowkey that will be used in the index table is > the key here. > What we will do is all the rowkeys in the index table will be prefixed > with the start key of the region/ > When an entry is added to the main table with rowkey as 5 it will go to > the 1st region (0-10) > Now there will be index region with range as 0-10. We will select this > region to store this index data. > The row getting added into the index region for this entry will have a > rowkey 0_x_5 > I am just using '_' as a seperator here just to show this. Actually we > wont be having any seperator. > So the rowkeys (in index region) will have a static begin part always. > Will scan time also we know this part and so the startrow and endrow > creation for the scan will be possible.. Note that we will store the actual > table row key as the last part of the index rowkey itself not as a value. > This is better option in our case of handling the scan index usage also at > sever side. There is no index data fetch to client side.. > > I feel your use case perfectly fit with our model > > >2. Are you using an Endpoint or Observer for building the secondary index > table? > Observer > > >3. "Custom balancer do collocation". Is it a custom load balancer of HBase > Master or something else? > It is a balancer implementation which will be plugged into Master > > >4. Your region split looks interesting. I dont have much info about it. > Can > you point to some docs on IndexHalfStoreFileReader? > Sorry I am not able to publish any design doc or code as the company has > not decided to open src the solution yet. > Any particular query you come acorss pls feel free to aske me :) > You can see the HalfStoreFileReader class 1st.. > > -Anoop- > ________________________________________ > From: anil gupta [[EMAIL PROTECTED]] > Sent: Friday, December 14, 2012 2:11 PM > To: [EMAIL PROTECTED] > Subject: Re: HBase - Secondary Index > > Hi Anoop, > > Nice presentation and seems like a smart implementation. Since the > presentation only covered bullet points so i have couple of questions on > your implementation. :) > > Here is a recap to my implementation and our previous discussion on > Secondary index: > > Here is the link to previous email thread: > http://search-hadoop.com/m/1zWPMaaRtr . > > The secondary index is stored in table "B" as rowkey B --> family:<rowkey > A> . "<rowkey A>" is the column qualifier. Every row in B will only on > have one column "k" and the value of that column is the rowkey of A. > > Suppose i am storing customer events in table A. I have two requirement for
-
Re: HBase - Secondary Indexanil gupta 2012-12-14, 18:01
On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote:
> Hi Anil, > > >1. In your presentation you mentioned that region of Primary Table and > Region of Secondary Table are always located on the same region server. How > do you achieve it? By using the Primary table rowkey as prefix of Rowkey > of Secondary Table? Will your implementation work if the rowkey of primary > table cannot be used as prefix in rowkey of Secondary table( i have this > limitation in my use case)? > First all there will be same number of regions in both primary and index > tables. All the start/stop keys of the regions also will be same. > Suppose there are 2 regions on main table say for keys 0-10 and 10-20. > Then we will create 2 regions in index table also with same key ranges. > At the master balancing level it is easy to collocate these regions seeing > the start and end keys. > When the selection of the rowkey that will be used in the index table is > the key here. > What we will do is all the rowkeys in the index table will be prefixed > with the start key of the region/ > When an entry is added to the main table with rowkey as 5 it will go to > the 1st region (0-10) > Now there will be index region with range as 0-10. We will select this > region to store this index data. > The row getting added into the index region for this entry will have a > rowkey 0_x_5 > I am just using '_' as a seperator here just to show this. Actually we > wont be having any seperator. > So the rowkeys (in index region) will have a static begin part always. > Will scan time also we know this part and so the startrow and endrow > creation for the scan will be possible.. Note that we will store the actual > table row key as the last part of the index rowkey itself not as a value. > This is better option in our case of handling the scan index usage also at > sever side. There is no index data fetch to client side.. > Anil: My primary table rowkey is customerId+event_id, and my secondary table rowkey is timestamp+ customerid. In your implementation it seems like for using secondary index the application needs to know about the "start_key" of the region(static begin part) it wants to query. Right? Do you separately manage the logic of determining the region "start_key"(static begin part) for a scan? Also, Its possible that while using secondary index the customerId is not provided. So, i wont be having customer id for all the queries. Hence i cannot use customer_id as a prefix in rowkey of my Secondary Table. > > I feel your use case perfectly fit with our model > Anil: Somehow i am unable to fit your implementation into my use case due to the constraint of static begin part of rowkey in Secondary table. There seems to be a disconnect. Can you tell me how does my use case fits into your implementation? > > >2. Are you using an Endpoint or Observer for building the secondary index > table? > Observer > > >3. "Custom balancer do collocation". Is it a custom load balancer of HBase > Master or something else? > It is a balancer implementation which will be plugged into Master > > >4. Your region split looks interesting. I dont have much info about it. > Can > you point to some docs on IndexHalfStoreFileReader? > Sorry I am not able to publish any design doc or code as the company has > not decided to open src the solution yet. > Any particular query you come acorss pls feel free to aske me :) > You can see the HalfStoreFileReader class 1st.. > > -Anoop- > ________________________________________ > From: anil gupta [[EMAIL PROTECTED]] > Sent: Friday, December 14, 2012 2:11 PM > To: [EMAIL PROTECTED] > Subject: Re: HBase - Secondary Index > > Hi Anoop, > > Nice presentation and seems like a smart implementation. Since the > presentation only covered bullet points so i have couple of questions on > your implementation. :) > > Here is a recap to my implementation and our previous discussion on > Secondary index: > > Here is the link to previous email thread: Thanks & Regards, Anil Gupta
-
RE: HBase - Secondary IndexAnoop Sam John 2012-12-17, 04:02
Hi Anil
During the scan, there is no need to fetch any index data to client side. So there is no need to create any scanner on the index table at the client side. This happens at the server side. For the Scan on the main table with condition on timestamp and customer id, a scanner to be created with Filters. Yes like normal when there is no secondary index. So this scan from the client will go through all the regions in the main table. When it scans one particular region say (x,y] on the main table, using the CP we can get the index table region object corresponding to this main table region from the RS. There is no issue in creating the static part of the rowkey. You know 'x' is the region start key. Then at the server side will create a scanner on the index region directly and here we can specify the startkey. 'x' + <timestamp value> + <customer id>.. Using the results from the index scan we will make reseek on the main region to the exact rows where the data what we are interested in is available. So there wont be a full region data scan happening. When in the cases where only timestamp is there but no customer id, it will be simple again. Create a scanner on the main table with only one filter. At the CP side the scanner on the index region will get created with startkey as 'x' + <timestamp value>.. When you create the scan object and set startRow on that it need not be the full rowkey. It can be part of the rowkey also. Yes like prefix. Hope u got it now :) -Anoop- ________________________________________ From: anil gupta [[EMAIL PROTECTED]] Sent: Friday, December 14, 2012 11:31 PM To: [EMAIL PROTECTED] Subject: Re: HBase - Secondary Index On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Hi Anil, > > >1. In your presentation you mentioned that region of Primary Table and > Region of Secondary Table are always located on the same region server. How > do you achieve it? By using the Primary table rowkey as prefix of Rowkey > of Secondary Table? Will your implementation work if the rowkey of primary > table cannot be used as prefix in rowkey of Secondary table( i have this > limitation in my use case)? > First all there will be same number of regions in both primary and index > tables. All the start/stop keys of the regions also will be same. > Suppose there are 2 regions on main table say for keys 0-10 and 10-20. > Then we will create 2 regions in index table also with same key ranges. > At the master balancing level it is easy to collocate these regions seeing > the start and end keys. > When the selection of the rowkey that will be used in the index table is > the key here. > What we will do is all the rowkeys in the index table will be prefixed > with the start key of the region/ > When an entry is added to the main table with rowkey as 5 it will go to > the 1st region (0-10) > Now there will be index region with range as 0-10. We will select this > region to store this index data. > The row getting added into the index region for this entry will have a > rowkey 0_x_5 > I am just using '_' as a seperator here just to show this. Actually we > wont be having any seperator. > So the rowkeys (in index region) will have a static begin part always. > Will scan time also we know this part and so the startrow and endrow > creation for the scan will be possible.. Note that we will store the actual > table row key as the last part of the index rowkey itself not as a value. > This is better option in our case of handling the scan index usage also at > sever side. There is no index data fetch to client side.. > Anil: My primary table rowkey is customerId+event_id, and my secondary table rowkey is timestamp+ customerid. In your implementation it seems like for using secondary index the application needs to know about the "start_key" of the region(static begin part) it wants to query. Right? Do you separately manage the logic of determining the region "start_key"(static begin part) for a scan? Also, Its possible that while using secondary index the customerId is not provided. So, i wont be having customer id for all the queries. Hence i cannot use customer_id as a prefix in rowkey of my Secondary Table. Anil: Somehow i am unable to fit your implementation into my use case due to the constraint of static begin part of rowkey in Secondary table. There seems to be a disconnect. Can you tell me how does my use case fits into your implementation? Thanks & Regards, Anil Gupta
-
Re: HBase - Secondary Indexanil gupta 2012-12-18, 08:28
Hi Anoop,
Please find my reply inline. Thanks, Anil Gupta On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Hi Anil > During the scan, there is no need to fetch any index data > to client side. So there is no need to create any scanner on the index > table at the client side. This happens at the server side. > > > For the Scan on the main table with condition on timestamp and customer > id, a scanner to be created with Filters. Yes like normal when there is no > secondary index. So this scan from the client will go through all the > regions in the main table. Anil: Do you mean that if the table is spread across 50 region servers in 60 node cluster then we need to send a scan request to all the 50 RS. Right? Doesn't it sounds expensive? IMHO you were not doing this in your solution. Your solution looked cleaner than this since you exactly knew which Node you need to go to for querying while using secondary index due to co-location(due to static begin part for secondary table rowkey) of region of primary table and secondary index table. My problem is little more complicated due to the constraints that: I cannot have a "static begin part" in the rowkey of my secondary table. When it scans one particular region say (x,y] on the main table, using the > CP we can get the index table region object corresponding to this main > table region from the RS. There is no issue in creating the static part of > the rowkey. You know 'x' is the region start key. Then at the server side > will create a scanner on the index region directly and here we can specify > the startkey. 'x' + <timestamp value> + <customer id>.. Using the results > from the index scan we will make reseek on the main region to the exact > rows where the data what we are interested in is available. So there wont > be a full region data scan happening. > > When in the cases where only timestamp is there but no customer id, it > will be simple again. Create a scanner on the main table with only one > filter. At the CP side the scanner on the index region will get created > with startkey as 'x' + <timestamp value>.. When you create the scan > object and set startRow on that it need not be the full rowkey. It can be > part of the rowkey also. Yes like prefix. > > Hope u got it now :) > Anil: I hope now we are on same page. Thanks a lot for your valuable time to discuss this stuff. > > -Anoop- > ________________________________________ > From: anil gupta [[EMAIL PROTECTED]] > Sent: Friday, December 14, 2012 11:31 PM > To: [EMAIL PROTECTED] > Subject: Re: HBase - Secondary Index > > On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John <[EMAIL PROTECTED]> > wrote: > > > Hi Anil, > > > > >1. In your presentation you mentioned that region of Primary Table and > > Region of Secondary Table are always located on the same region server. > How > > do you achieve it? By using the Primary table rowkey as prefix of Rowkey > > of Secondary Table? Will your implementation work if the rowkey of > primary > > table cannot be used as prefix in rowkey of Secondary table( i have this > > limitation in my use case)? > > First all there will be same number of regions in both primary and index > > tables. All the start/stop keys of the regions also will be same. > > Suppose there are 2 regions on main table say for keys 0-10 and 10-20. > > Then we will create 2 regions in index table also with same key ranges. > > At the master balancing level it is easy to collocate these regions > seeing > > the start and end keys. > > When the selection of the rowkey that will be used in the index table is > > the key here. > > What we will do is all the rowkeys in the index table will be prefixed > > with the start key of the region/ > > When an entry is added to the main table with rowkey as 5 it will go to > > the 1st region (0-10) > > Now there will be index region with range as 0-10. We will select this > > region to store this index data. > Thanks & Regards, Anil Gupta
-
Re: HBase - Secondary IndexMichel Segel 2012-12-18, 09:02
Just a couple of questions...
First, since you don't have any natural secondary indices, you can create one from a couple of choices. Keeping it simple, you choose an inverted table as your index. In doing so, you have one column containing all of the row ids for a given value. This means that it is a simple get(). My question is that since you don't have any formal SQL syntax, how are you doing this all server side? Sent from a remote device. Please excuse any typos... Mike Segel On Dec 18, 2012, at 2:28 AM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi Anoop, > > Please find my reply inline. > > Thanks, > Anil Gupta > > On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > >> Hi Anil >> During the scan, there is no need to fetch any index data >> to client side. So there is no need to create any scanner on the index >> table at the client side. This happens at the server side. > > >> >> For the Scan on the main table with condition on timestamp and customer >> id, a scanner to be created with Filters. Yes like normal when there is no >> secondary index. So this scan from the client will go through all the >> regions in the main table. > > > Anil: Do you mean that if the table is spread across 50 region servers in > 60 node cluster then we need to send a scan request to all the 50 RS. > Right? Doesn't it sounds expensive? IMHO you were not doing this in your > solution. Your solution looked cleaner than this since you exactly knew > which Node you need to go to for querying while using secondary index due > to co-location(due to static begin part for secondary table rowkey) of > region of primary table and secondary index table. My problem is little > more complicated due to the constraints that: I cannot have a "static begin > part" in the rowkey of my secondary table. > > When it scans one particular region say (x,y] on the main table, using the >> CP we can get the index table region object corresponding to this main >> table region from the RS. There is no issue in creating the static part of >> the rowkey. You know 'x' is the region start key. Then at the server side >> will create a scanner on the index region directly and here we can specify >> the startkey. 'x' + <timestamp value> + <customer id>.. Using the results >> from the index scan we will make reseek on the main region to the exact >> rows where the data what we are interested in is available. So there wont >> be a full region data scan happening. > >> When in the cases where only timestamp is there but no customer id, it >> will be simple again. Create a scanner on the main table with only one >> filter. At the CP side the scanner on the index region will get created >> with startkey as 'x' + <timestamp value>.. When you create the scan >> object and set startRow on that it need not be the full rowkey. It can be >> part of the rowkey also. Yes like prefix. >> >> Hope u got it now :) > Anil: I hope now we are on same page. Thanks a lot for your valuable time > to discuss this stuff. > >> >> -Anoop- >> ________________________________________ >> From: anil gupta [[EMAIL PROTECTED]] >> Sent: Friday, December 14, 2012 11:31 PM >> To: [EMAIL PROTECTED] >> Subject: Re: HBase - Secondary Index >> >> On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John <[EMAIL PROTECTED]> >> wrote: >> >>> Hi Anil, >>> >>>> 1. In your presentation you mentioned that region of Primary Table and >>> Region of Secondary Table are always located on the same region server. >> How >>> do you achieve it? By using the Primary table rowkey as prefix of Rowkey >>> of Secondary Table? Will your implementation work if the rowkey of >> primary >>> table cannot be used as prefix in rowkey of Secondary table( i have this >>> limitation in my use case)? >>> First all there will be same number of regions in both primary and index >>> tables. All the start/stop keys of the regions also will be same. >>> Suppose there are 2 regions on main table say for keys 0-10 and 10-20.
-
RE: HBase - Secondary IndexAnoop Sam John 2012-12-18, 09:27
Anil:
If the scan from client side does not specify any rowkey range but only the filter condition, yes it will go to all the primary table regions for the scan. There 1st it will scan the index table region and seek to exact rows in the main table region. If that region is not having any data at all corresponding to the filter condition, the entire region will get skipped simply. In a normal scan also, if there is a rowkey range that we can specify, then only to specific regions the request will go. In the sec index case of ours also it is same.. In a simple way what I can say is for the scan there is no change at all wrt the operation that is what is happening at the client side. From the meta data to know which all region and RSs to contact, and contacting that regions one by one and getting data from that region. Only difference is what is happening at the server side. With out index the whole data from all the Hfiles will get fetched at the server side and the filter will get applied for every row. Only those rows which passes the filter will get back to the client side. With index, when the scanning happen at the server side, the index data will get scanned 1st from the index region. This region will be in the same RS so no extra RPCs. The data to be scanned from the index table will be limited.. We can create the start key and stop key for that.. Based on the result of the index scan, we will know the rowkeys where all the data what we are interested in resides. So reseek will happen to those rows and read only those rows. So the time spent at the server side for scanning a region will get reduced to a very high value. Yes but still there will be calls from the client side to the RS for each region... Now I think u might be clear.. In the ppt that I have shared, there also it is saying the same thing. It is showing what is happening at the server side. -Anoop- ________________________________________ From: anil gupta [[EMAIL PROTECTED]] Sent: Tuesday, December 18, 2012 1:58 PM To: [EMAIL PROTECTED] Subject: Re: HBase - Secondary Index Hi Anoop, Please find my reply inline. Thanks, Anil Gupta On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Hi Anil > During the scan, there is no need to fetch any index data > to client side. So there is no need to create any scanner on the index > table at the client side. This happens at the server side. > > > For the Scan on the main table with condition on timestamp and customer > id, a scanner to be created with Filters. Yes like normal when there is no > secondary index. So this scan from the client will go through all the > regions in the main table. Anil: Do you mean that if the table is spread across 50 region servers in 60 node cluster then we need to send a scan request to all the 50 RS. Right? Doesn't it sounds expensive? IMHO you were not doing this in your solution. Your solution looked cleaner than this since you exactly knew which Node you need to go to for querying while using secondary index due to co-location(due to static begin part for secondary table rowkey) of region of primary table and secondary index table. My problem is little more complicated due to the constraints that: I cannot have a "static begin part" in the rowkey of my secondary table. When it scans one particular region say (x,y] on the main table, using the > CP we can get the index table region object corresponding to this main > table region from the RS. There is no issue in creating the static part of > the rowkey. You know 'x' is the region start key. Then at the server side > will create a scanner on the index region directly and here we can specify > the startkey. 'x' + <timestamp value> + <customer id>.. Using the results > from the index scan we will make reseek on the main region to the exact > rows where the data what we are interested in is available. So there wont > be a full region data scan happening. > > When in the cases where only timestamp is there but no customer id, it Anil: I hope now we are on same page. Thanks a lot for your valuable time to discuss this stuff. Thanks & Regards, Anil Gupta
-
RE: HBase - Secondary IndexAnoop Sam John 2012-12-18, 09:35
Hi Mike
>My question is that since you don't have any formal SQL syntax, how are you doing this all server side? I think the question is to Anil.. In his case he is not doing the index data scan at the server side. He scan the index table data back to client and from client doing gets to get the main table data. Correct Anil? Just making it clear... :) -Anoop- ________________________________________ From: Michel Segel [[EMAIL PROTECTED]] Sent: Tuesday, December 18, 2012 2:32 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: HBase - Secondary Index Just a couple of questions... First, since you don't have any natural secondary indices, you can create one from a couple of choices. Keeping it simple, you choose an inverted table as your index. In doing so, you have one column containing all of the row ids for a given value. This means that it is a simple get(). My question is that since you don't have any formal SQL syntax, how are you doing this all server side? Sent from a remote device. Please excuse any typos... Mike Segel On Dec 18, 2012, at 2:28 AM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi Anoop, > > Please find my reply inline. > > Thanks, > Anil Gupta > > On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > >> Hi Anil >> During the scan, there is no need to fetch any index data >> to client side. So there is no need to create any scanner on the index >> table at the client side. This happens at the server side. > > >> >> For the Scan on the main table with condition on timestamp and customer >> id, a scanner to be created with Filters. Yes like normal when there is no >> secondary index. So this scan from the client will go through all the >> regions in the main table. > > > Anil: Do you mean that if the table is spread across 50 region servers in > 60 node cluster then we need to send a scan request to all the 50 RS. > Right? Doesn't it sounds expensive? IMHO you were not doing this in your > solution. Your solution looked cleaner than this since you exactly knew > which Node you need to go to for querying while using secondary index due > to co-location(due to static begin part for secondary table rowkey) of > region of primary table and secondary index table. My problem is little > more complicated due to the constraints that: I cannot have a "static begin > part" in the rowkey of my secondary table. > > When it scans one particular region say (x,y] on the main table, using the >> CP we can get the index table region object corresponding to this main >> table region from the RS. There is no issue in creating the static part of >> the rowkey. You know 'x' is the region start key. Then at the server side >> will create a scanner on the index region directly and here we can specify >> the startkey. 'x' + <timestamp value> + <customer id>.. Using the results >> from the index scan we will make reseek on the main region to the exact >> rows where the data what we are interested in is available. So there wont >> be a full region data scan happening. > >> When in the cases where only timestamp is there but no customer id, it >> will be simple again. Create a scanner on the main table with only one >> filter. At the CP side the scanner on the index region will get created >> with startkey as 'x' + <timestamp value>.. When you create the scan >> object and set startRow on that it need not be the full rowkey. It can be >> part of the rowkey also. Yes like prefix. >> >> Hope u got it now :) > Anil: I hope now we are on same page. Thanks a lot for your valuable time > to discuss this stuff. > >> >> -Anoop- >> ________________________________________ >> From: anil gupta [[EMAIL PROTECTED]] >> Sent: Friday, December 14, 2012 11:31 PM >> To: [EMAIL PROTECTED] >> Subject: Re: HBase - Secondary Index >> >> On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John <[EMAIL PROTECTED]> >> wrote: >> >>> Hi Anil, >>> >>>> 1. In your presentation you mentioned that region of Primary Table and
-
Re: HBase - Secondary Indexanil gupta 2012-12-19, 08:24
Hi Anoop,
For my use case, scans will never have primary table rowkey range whenever i query using secondary index. IMHO, if i am sending the request to all the RS of table then i am afraid/concerned of too many unnecessary RPC's across the cluster for every single query based on secondary index. Essentially everytime it will look like a full table scan but under the hood the CP's will do the magic using secondary table.Your solution works well when rowkey range on primary table can be specified. Unfortunately, i dont have that luxury for now to use "primary table rowkey range". It seems like i will have to stick to my current solution. However, it's always good to have a healthy discussion on different approaches. :) PS: My current secondary index implementation is not yet in production. I did some preliminary testing and it seems to work fine but i think i need to do some more testing. Thanks, Anil Gupta On Tue, Dec 18, 2012 at 1:27 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Anil: > If the scan from client side does not specify any rowkey range but > only the filter condition, yes it will go to all the primary table regions > for the scan. There 1st it will scan the index table region and seek to > exact rows in the main table region. If that region is not having any data > at all corresponding to the filter condition, the entire region will get > skipped simply. > > In a normal scan also, if there is a rowkey range that we can specify, > then only to specific regions the request will go. In the sec index case of > ours also it is same.. > > In a simple way what I can say is for the scan there is no change at all > wrt the operation that is what is happening at the client side. From the > meta data to know which all region and RSs to contact, and contacting that > regions one by one and getting data from that region. Only difference is > what is happening at the server side. With out index the whole data from > all the Hfiles will get fetched at the server side and the filter will get > applied for every row. Only those rows which passes the filter will get > back to the client side. With index, when the scanning happen at the > server side, the index data will get scanned 1st from the index region. > This region will be in the same RS so no extra RPCs. The data to be scanned > from the index table will be limited.. We can create the start key and stop > key for that.. Based on the result of the index scan, we will know the > rowkeys where all the data what we are interested in resides. So reseek > will happen to those rows and read only those rows. So the time spent at > the server side for scanning a region will get reduced to a very high value. > > Yes but still there will be calls from the client side to the RS for each > region... > > Now I think u might be clear.. In the ppt that I have shared, there also > it is saying the same thing. It is showing what is happening at the server > side. > > -Anoop- > > ________________________________________ > From: anil gupta [[EMAIL PROTECTED]] > Sent: Tuesday, December 18, 2012 1:58 PM > To: [EMAIL PROTECTED] > Subject: Re: HBase - Secondary Index > > Hi Anoop, > > Please find my reply inline. > > Thanks, > Anil Gupta > > On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[EMAIL PROTECTED]> > wrote: > > > Hi Anil > > During the scan, there is no need to fetch any index data > > to client side. So there is no need to create any scanner on the index > > table at the client side. This happens at the server side. > > > > > > > > For the Scan on the main table with condition on timestamp and customer > > id, a scanner to be created with Filters. Yes like normal when there is > no > > secondary index. So this scan from the client will go through all the > > regions in the main table. > > > Anil: Do you mean that if the table is spread across 50 region servers in > 60 node cluster then we need to send a scan request to all the 50 RS. > Right? Doesn't it sounds expensive? IMHO you were not doing this in your Thanks & Regards, Anil Gupta
-
Re: HBase - Secondary Indexanil gupta 2012-12-19, 08:39
Hi Michael,
Please find my replies inline. Thanks, Anil On Tue, Dec 18, 2012 at 1:02 AM, Michel Segel <[EMAIL PROTECTED]>wrote: > Just a couple of questions... > > First, since you don't have any natural secondary indices, you can create > one from a couple of choices. Keeping it simple, you choose an inverted > table as your index. > Reasons for not creating a inverted table: 1. There can be millions of columns corresponding to a rowkey in my secondary index. In future it can even grow more. 2. While using secondary index, we are also planning to have filtering on the basis of other non-rowkey columns. For example: 1 Row of Secondary table might look like this: Rowkey: cf:PrimarytableRowKey=x, cf:customerFirstName=xyz, cf:customerAddress=123, Union Sq, LA My primary table has around 50 columns and in secondary table i duplicate two columns to used along with secondary index for filtering. > > In doing so, you have one column containing all of the row ids for a given > value. > This means that it is a simple get(). > > My question is that since you don't have any formal SQL syntax, how are > you doing this all server side? > As Anoop said, I am not doing the index data scan at the server side. He scan the index table data back to client and from client doing gets to get the main table data. > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Dec 18, 2012, at 2:28 AM, anil gupta <[EMAIL PROTECTED]> wrote: > > > Hi Anoop, > > > > Please find my reply inline. > > > > Thanks, > > Anil Gupta > > > > On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[EMAIL PROTECTED]> > wrote: > > > >> Hi Anil > >> During the scan, there is no need to fetch any index data > >> to client side. So there is no need to create any scanner on the index > >> table at the client side. This happens at the server side. > > > > > >> > >> For the Scan on the main table with condition on timestamp and customer > >> id, a scanner to be created with Filters. Yes like normal when there is > no > >> secondary index. So this scan from the client will go through all the > >> regions in the main table. > > > > > > Anil: Do you mean that if the table is spread across 50 region servers in > > 60 node cluster then we need to send a scan request to all the 50 RS. > > Right? Doesn't it sounds expensive? IMHO you were not doing this in your > > solution. Your solution looked cleaner than this since you exactly knew > > which Node you need to go to for querying while using secondary index due > > to co-location(due to static begin part for secondary table rowkey) of > > region of primary table and secondary index table. My problem is little > > more complicated due to the constraints that: I cannot have a "static > begin > > part" in the rowkey of my secondary table. > > > > When it scans one particular region say (x,y] on the main table, using > the > >> CP we can get the index table region object corresponding to this main > >> table region from the RS. There is no issue in creating the static > part of > >> the rowkey. You know 'x' is the region start key. Then at the server > side > >> will create a scanner on the index region directly and here we can > specify > >> the startkey. 'x' + <timestamp value> + <customer id>.. Using the > results > >> from the index scan we will make reseek on the main region to the exact > >> rows where the data what we are interested in is available. So there > wont > >> be a full region data scan happening. > > > >> When in the cases where only timestamp is there but no customer id, it > >> will be simple again. Create a scanner on the main table with only one > >> filter. At the CP side the scanner on the index region will get created > >> with startkey as 'x' + <timestamp value>.. When you create the scan > >> object and set startRow on that it need not be the full rowkey. It can > be > >> part of the rowkey also. Yes like prefix. > >> > >> Hope u got it now :) > > Anil: I hope now we are on same page. Thanks a lot for your valuable time Thanks & Regards, Anil Gupta
-
Re: HBase - Secondary IndexDavid Arthur 2012-12-20, 02:47
Very cool design. Just curious, for the index did you write something
custom or using an existing library like Lucene? -David On 12/4/12 3:10 AM, Anoop Sam John wrote: > Hi All > > Last week I got a chance to present the secondary indexing solution what we have done in Huawei at the China Hadoop Conference. You can see the presentation from http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf > > > > I would like to hear what others think on this. :) > > > > -Anoop- >
-
RE: HBase - Secondary IndexAnoop Sam John 2012-12-20, 03:33
Hi Nick, Andrew
I am discussing this with the management here. Can tell more details after the new year.. Some people are not available due to Christmas & New year holidays :) -Anoop- ________________________________________ From: Andrew Purtell [[EMAIL PROTECTED]] Sent: Wednesday, December 19, 2012 6:21 AM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: HBase - Secondary Index Hi Anoop, What Nick asked. I've also heard people wonder this out loud in a few places. On Tue, Dec 18, 2012 at 9:48 AM, Nick Dimiduk <[EMAIL PROTECTED]> wrote: > Hi Anoop, > > Your presentation has garnered quite a bit of community interest. Have you > considered providing your implementation to the community, perhaps in an > HBase-contrib module? > > Thanks, > Nick > > On Tue, Dec 4, 2012 at 12:10 AM, Anoop Sam John <[EMAIL PROTECTED]> > wrote: > > > Hi All > > > > Last week I got a chance to present the secondary indexing > > solution what we have done in Huawei at the China Hadoop Conference. You > > can see the presentation from > > http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf > > > > > > > > I would like to hear what others think on this. :) > > > > > > > > -Anoop- > > > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
-
RE: HBase - Secondary IndexAnoop Sam John 2012-12-20, 03:44
David
Not using any existing library like Lucene. The index data of a table will be written in another HBase table. -Anoop- ________________________________________ From: David Arthur [[EMAIL PROTECTED]] Sent: Thursday, December 20, 2012 8:17 AM To: [EMAIL PROTECTED] Subject: Re: HBase - Secondary Index Very cool design. Just curious, for the index did you write something custom or using an existing library like Lucene? -David On 12/4/12 3:10 AM, Anoop Sam John wrote: > Hi All > > Last week I got a chance to present the secondary indexing solution what we have done in Huawei at the China Hadoop Conference. You can see the presentation from http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf > > > > I would like to hear what others think on this. :) > > > > -Anoop- >
-
Re: HBase - Secondary IndexFarah Karim 2012-12-25, 10:14
Hi i am student of MS and i am doing my thesis work on Hadoop Mapreduce and
HBase. I have done major implementation of my work. Now I have to use indexing to reduce the response time of query on hbase. I saw your presentation and it seems great, I want you to make this as open source as soon as possible. I welcome any other suggestions for indexing on HBase. Thanks :) On Thu, Dec 20, 2012 at 8:33 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Hi Nick, Andrew > I am discussing this with the management here. Can tell more > details after the new year.. Some people are not available due to Christmas > & New year holidays :) > > -Anoop- > ________________________________________ > From: Andrew Purtell [[EMAIL PROTECTED]] > Sent: Wednesday, December 19, 2012 6:21 AM > To: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Subject: Re: HBase - Secondary Index > > Hi Anoop, > > What Nick asked. I've also heard people wonder this out loud in a few > places. > > > On Tue, Dec 18, 2012 at 9:48 AM, Nick Dimiduk <[EMAIL PROTECTED]> wrote: > > > Hi Anoop, > > > > Your presentation has garnered quite a bit of community interest. Have > you > > considered providing your implementation to the community, perhaps in an > > HBase-contrib module? > > > > Thanks, > > Nick > > > > On Tue, Dec 4, 2012 at 12:10 AM, Anoop Sam John <[EMAIL PROTECTED]> > > wrote: > > > > > Hi All > > > > > > Last week I got a chance to present the secondary indexing > > > solution what we have done in Huawei at the China Hadoop Conference. > You > > > can see the presentation from > > > http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf > > > > > > > > > > > > I would like to hear what others think on this. :) > > > > > > > > > > > > -Anoop- > > > > > > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) >
-
Re: HBase - Secondary IndexShengjie Min 2012-12-27, 11:23
Hi Anoop,
>First all there will be same number of regions in both primary and index tables. All the start/stop keys of the regions also will be same. >Suppose there are 2 regions on main table say for keys 0-10 and 10-20. Then we will create 2 regions in index table also with same key ranges. >At the master balancing level it is easy to collocate these regions seeing the start and end keys. >When the selection of the rowkey that will be used in the index table is the key here. >What we will do is all the rowkeys in the index table will be prefixed with the start key of the region/ >When an entry is added to the main table with rowkey as 5 it will go to the 1st region (0-10) >Now there will be index region with range as 0-10. We will select this region to store this index data. >The row getting added into the index region for this entry will have a rowkey 0_x_5 >I am just using '_' as a seperator here just to show this. Actually we wont be having any seperator. >So the rowkeys (in index region) will have a static begin part always. Will scan time also we know this part and so the startrow and endrow creation for the scan will be possible.. Note that we will store the actual table row >key as the last part of the index rowkey itself not as a value. >This is better option in our case of handling the scan index usage also at sever side. There is no index data fetch to client side.. What happens when regions get splitted ? do you update the startkey on the index table? -Shengjie On 14 December 2012 08:54, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Hi Anil, > > >1. In your presentation you mentioned that region of Primary Table and > Region of Secondary Table are always located on the same region server. How > do you achieve it? By using the Primary table rowkey as prefix of Rowkey > of Secondary Table? Will your implementation work if the rowkey of primary > table cannot be used as prefix in rowkey of Secondary table( i have this > limitation in my use case)? > First all there will be same number of regions in both primary and index > tables. All the start/stop keys of the regions also will be same. > Suppose there are 2 regions on main table say for keys 0-10 and 10-20. > Then we will create 2 regions in index table also with same key ranges. > At the master balancing level it is easy to collocate these regions seeing > the start and end keys. > When the selection of the rowkey that will be used in the index table is > the key here. > What we will do is all the rowkeys in the index table will be prefixed > with the start key of the region/ > When an entry is added to the main table with rowkey as 5 it will go to > the 1st region (0-10) > Now there will be index region with range as 0-10. We will select this > region to store this index data. > The row getting added into the index region for this entry will have a > rowkey 0_x_5 > I am just using '_' as a seperator here just to show this. Actually we > wont be having any seperator. > So the rowkeys (in index region) will have a static begin part always. > Will scan time also we know this part and so the startrow and endrow > creation for the scan will be possible.. Note that we will store the actual > table row key as the last part of the index rowkey itself not as a value. > This is better option in our case of handling the scan index usage also at > sever side. There is no index data fetch to client side.. > > I feel your use case perfectly fit with our model > > >2. Are you using an Endpoint or Observer for building the secondary index > table? > Observer > > >3. "Custom balancer do collocation". Is it a custom load balancer of HBase > Master or something else? > It is a balancer implementation which will be plugged into Master > > >4. Your region split looks interesting. I dont have much info about it. > Can > you point to some docs on IndexHalfStoreFileReader? > Sorry I am not able to publish any design doc or code as the company has > not decided to open src the solution yet. All the best, Shengjie Min
-
RE: HBase - Secondary IndexAnoop Sam John 2012-12-27, 11:30
>What happens when regions get splitted ? do you update the startkey on the index table? We have a custom HalfStoreFileReader to read the split index region data. This reader will change the rowkey it returns with replacing the startkey part. After a split immediately HBase will initiate a compaction and the compation uses this new reader. So the rowkey coming out will be a changed one and thus the newly written HFiles will have the changed rowkey. Also a normal read (as part of scan) during this time uses this new reader and so we will always get the rowkey in the expected format.. :) Hope I make it clear for you. -Anoop- ________________________________________ From: Shengjie Min [[EMAIL PROTECTED]] Sent: Thursday, December 27, 2012 4:53 PM To: [EMAIL PROTECTED] Subject: Re: HBase - Secondary Index Hi Anoop, >First all there will be same number of regions in both primary and index tables. All the start/stop keys of the regions also will be same. >Suppose there are 2 regions on main table say for keys 0-10 and 10-20. Then we will create 2 regions in index table also with same key ranges. >At the master balancing level it is easy to collocate these regions seeing the start and end keys. >When the selection of the rowkey that will be used in the index table is the key here. >What we will do is all the rowkeys in the index table will be prefixed with the start key of the region/ >When an entry is added to the main table with rowkey as 5 it will go to the 1st region (0-10) >Now there will be index region with range as 0-10. We will select this region to store this index data. >The row getting added into the index region for this entry will have a rowkey 0_x_5 >I am just using '_' as a seperator here just to show this. Actually we wont be having any seperator. >So the rowkeys (in index region) will have a static begin part always. Will scan time also we know this part and so the startrow and endrow creation for the scan will be possible.. Note that we will store the actual table row >key as the last part of the index rowkey itself not as a value. >This is better option in our case of handling the scan index usage also at sever side. There is no index data fetch to client side.. What happens when regions get splitted ? do you update the startkey on the index table? -Shengjie On 14 December 2012 08:54, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Hi Anil, > > >1. In your presentation you mentioned that region of Primary Table and > Region of Secondary Table are always located on the same region server. How > do you achieve it? By using the Primary table rowkey as prefix of Rowkey > of Secondary Table? Will your implementation work if the rowkey of primary > table cannot be used as prefix in rowkey of Secondary table( i have this > limitation in my use case)? > First all there will be same number of regions in both primary and index > tables. All the start/stop keys of the regions also will be same. > Suppose there are 2 regions on main table say for keys 0-10 and 10-20. > Then we will create 2 regions in index table also with same key ranges. > At the master balancing level it is easy to collocate these regions seeing > the start and end keys. > When the selection of the rowkey that will be used in the index table is > the key here. > What we will do is all the rowkeys in the index table will be prefixed > with the start key of the region/ > When an entry is added to the main table with rowkey as 5 it will go to > the 1st region (0-10) > Now there will be index region with range as 0-10. We will select this > region to store this index data. > The row getting added into the index region for this entry will have a > rowkey 0_x_5 > I am just using '_' as a seperator here just to show this. Actually we > wont be having any seperator. > So the rowkeys (in index region) will have a static begin part always. > Will scan time also we know this part and so the startrow and endrow > creation for the scan will be possible.. Note that we will store the actual All the best, Shengjie Min
-
Re: HBase - Secondary IndexShengjie Min 2012-12-27, 13:07
Thanks, Anoop, That makes sense. Hope you guys make this open source soon.
This model seems working ok if the resultSet is not huge, you get the main keys and re-seek exact rows from the main table. But One thing concerns me a little bit is that after querying the index table, you get a resultSet of the main keys is very big, how the massive number of get() is going to perform againt the main table, becoz potentially the results can be scattered to all different regions. - Shengjie On 27 December 2012 11:30, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > >What happens when regions get splitted ? do you update the startkey on the > index table? > > We have a custom HalfStoreFileReader to read the split index region data. > This reader will change the rowkey it returns with replacing the startkey > part. > After a split immediately HBase will initiate a compaction and the > compation uses this new reader. So the rowkey coming out will be a changed > one and thus the newly written HFiles will have the changed rowkey. Also a > normal read (as part of scan) during this time uses this new reader and so > we will always get the rowkey in the expected format.. :) Hope I make it > clear for you. > > -Anoop- > ________________________________________ > From: Shengjie Min [[EMAIL PROTECTED]] > Sent: Thursday, December 27, 2012 4:53 PM > To: [EMAIL PROTECTED] > Subject: Re: HBase - Secondary Index > > Hi Anoop, > > >First all there will be same number of regions in both primary and index > tables. All the start/stop keys of the regions also will be same. > >Suppose there are 2 regions on main table say for keys 0-10 and 10-20. > Then we will create 2 regions in index table also with same key ranges. > >At the master balancing level it is easy to collocate these regions seeing > the start and end keys. > >When the selection of the rowkey that will be used in the index table is > the key here. > >What we will do is all the rowkeys in the index table will be prefixed > with the start key of the region/ > >When an entry is added to the main table with rowkey as 5 it will go to > the 1st region (0-10) > >Now there will be index region with range as 0-10. We will select this > region to store this index data. > >The row getting added into the index region for this entry will have a > rowkey 0_x_5 > >I am just using '_' as a seperator here just to show this. Actually we > wont be having any seperator. > >So the rowkeys (in index region) will have a static begin part always. > Will scan time also we know this part and so the startrow and endrow > creation for the scan will be possible.. Note that we will store the actual > table row >key as the last part of the index rowkey itself not as a value. > >This is better option in our case of handling the scan index usage also at > sever side. There is no index data fetch to client side.. > > What happens when regions get splitted ? do you update the startkey on the > index table? > > -Shengjie > > > On 14 December 2012 08:54, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > > Hi Anil, > > > > >1. In your presentation you mentioned that region of Primary Table and > > Region of Secondary Table are always located on the same region server. > How > > do you achieve it? By using the Primary table rowkey as prefix of Rowkey > > of Secondary Table? Will your implementation work if the rowkey of > primary > > table cannot be used as prefix in rowkey of Secondary table( i have this > > limitation in my use case)? > > First all there will be same number of regions in both primary and index > > tables. All the start/stop keys of the regions also will be same. > > Suppose there are 2 regions on main table say for keys 0-10 and 10-20. > > Then we will create 2 regions in index table also with same key ranges. > > At the master balancing level it is easy to collocate these regions > seeing > > the start and end keys. > > When the selection of the rowkey that will be used in the index table is > > the key here. All the best, Shengjie Min
-
Re: HBase - Secondary IndexAnoop John 2012-12-27, 15:54
>how the massive number of get() is going to
perform againt the main table Didnt follow u completely here. There wont be any get() happening.. As the exact rowkey in a region we get from the index table, we can seek to the exact position and return that row. -Anoop- On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <[EMAIL PROTECTED]> wrote: > how the massive number of get() is going to > perform againt the main table >
-
Re: HBase - Secondary Indexramkrishna vasudevan 2012-12-27, 16:11
As per the design there is no get() operation at all. Incase of equals
query nothing is cached in memory. For Range may be we need to cache some intermediate result. Regards Ram On Thu, Dec 27, 2012 at 9:24 PM, Anoop John <[EMAIL PROTECTED]> wrote: > >how the massive number of get() is going to > perform againt the main table > > Didnt follow u completely here. There wont be any get() happening.. As the > exact rowkey in a region we get from the index table, we can seek to the > exact position and return that row. > > -Anoop- > > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <[EMAIL PROTECTED]> > wrote: > > > how the massive number of get() is going to > > perform againt the main table > > >
-
Re: HBase - Secondary IndexShengjie Min 2012-12-27, 16:29
>Didnt follow u completely here. There wont be any get() happening.. As the
>exact rowkey in a region we get from the index table, we can seek to the >exact position and return that row. Sorry, When I misused "get()" here, I meant seeking. Yes, if it's just small number of rows returned, this works perfect. As you said you will get the exact rowkey positions per region, and simply seek them. I was trying to work out the case that when the number of result rows increases massively. Like in Anil's case, he wants to do a scan query against the 2ndary index(timestamp): "select all rows from timestamp1 to timestamp2" given no customerId provided. During that time period, he might have a big chunk of rows from different customerIds. The index table returns a lot of rowkey positions for different customerIds (I believe they are scattered in different regions), then you end up seeking all different positions in different regions and return all the rows needed. According to your presentation page14 - Performance Test Results (Scan), without index, it's a linear increase as result rows # increases. on the other hand, with index, time spent climbs up way quicker than the case without index. btw, quick question- in your presentation, the scale there is seconds or mill-seconds:) - Shengjie On 27 December 2012 15:54, Anoop John <[EMAIL PROTECTED]> wrote: > >how the massive number of get() is going to > perform againt the main table > > Didnt follow u completely here. There wont be any get() happening.. As the > exact rowkey in a region we get from the index table, we can seek to the > exact position and return that row. > > -Anoop- > > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <[EMAIL PROTECTED]> > wrote: > > > how the massive number of get() is going to > > perform againt the main table > > > -- All the best, Shengjie Min
-
RE: HBase - Secondary IndexAnoop Sam John 2012-12-28, 03:33
Yes as you say when the no of rows to be returned is becoming more and more the latency will be becoming more. seeks within an HFile block is some what expensive op now. (Not much but still) The new encoding prefix trie will be a huge bonus here. There the seeks will be flying.. [Ted also presented this in the Hadoop China] Thanks to Matt... :) I am trying to measure the scan performance with this new encoding . Trying to back port a simple patch for 94 version just for testing... Yes when the no of results to be returned is more and more any index will become less performing as per my study :)
>btw, quick question- in your presentation, the scale there is seconds or mill-seconds:) It is seconds. Dont consider the exact values. What is the % of increase in latency is important :) Those were not high end machines. -Anoop- ________________________________________ From: Shengjie Min [[EMAIL PROTECTED]] Sent: Thursday, December 27, 2012 9:59 PM To: [EMAIL PROTECTED] Subject: Re: HBase - Secondary Index >Didnt follow u completely here. There wont be any get() happening.. As the >exact rowkey in a region we get from the index table, we can seek to the >exact position and return that row. Sorry, When I misused "get()" here, I meant seeking. Yes, if it's just small number of rows returned, this works perfect. As you said you will get the exact rowkey positions per region, and simply seek them. I was trying to work out the case that when the number of result rows increases massively. Like in Anil's case, he wants to do a scan query against the 2ndary index(timestamp): "select all rows from timestamp1 to timestamp2" given no customerId provided. During that time period, he might have a big chunk of rows from different customerIds. The index table returns a lot of rowkey positions for different customerIds (I believe they are scattered in different regions), then you end up seeking all different positions in different regions and return all the rows needed. According to your presentation page14 - Performance Test Results (Scan), without index, it's a linear increase as result rows # increases. on the other hand, with index, time spent climbs up way quicker than the case without index. btw, quick question- in your presentation, the scale there is seconds or mill-seconds:) - Shengjie On 27 December 2012 15:54, Anoop John <[EMAIL PROTECTED]> wrote: > >how the massive number of get() is going to > perform againt the main table > > Didnt follow u completely here. There wont be any get() happening.. As the > exact rowkey in a region we get from the index table, we can seek to the > exact position and return that row. > > -Anoop- > > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <[EMAIL PROTECTED]> > wrote: > > > how the massive number of get() is going to > > perform againt the main table > > > -- All the best, Shengjie Min
-
Re: HBase - Secondary IndexMohit Anchlia 2012-12-28, 03:42
On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote:
> Yes as you say when the no of rows to be returned is becoming more and > more the latency will be becoming more. seeks within an HFile block is > some what expensive op now. (Not much but still) The new encoding prefix > trie will be a huge bonus here. There the seeks will be flying.. [Ted also > presented this in the Hadoop China] Thanks to Matt... :) I am trying to > measure the scan performance with this new encoding . Trying to back port a > simple patch for 94 version just for testing... Yes when the no of > results to be returned is more and more any index will become less > performing as per my study :) > > Do you have link to that presentation? > >btw, quick question- in your presentation, the scale there is seconds or > mill-seconds:) > > It is seconds. Dont consider the exact values. What is the % of increase > in latency is important :) Those were not high end machines. > > -Anoop- > ________________________________________ > From: Shengjie Min [[EMAIL PROTECTED]] > Sent: Thursday, December 27, 2012 9:59 PM > To: [EMAIL PROTECTED] > Subject: Re: HBase - Secondary Index > > >Didnt follow u completely here. There wont be any get() happening.. As > the > >exact rowkey in a region we get from the index table, we can seek to the > >exact position and return that row. > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's just > small number of rows returned, this works perfect. As you said you will get > the exact rowkey positions per region, and simply seek them. I was trying > to work out the case that when the number of result rows increases > massively. Like in Anil's case, he wants to do a scan query against the > 2ndary index(timestamp): "select all rows from timestamp1 to timestamp2" > given no customerId provided. During that time period, he might have a big > chunk of rows from different customerIds. The index table returns a lot of > rowkey positions for different customerIds (I believe they are scattered in > different regions), then you end up seeking all different positions in > different regions and return all the rows needed. According to your > presentation page14 - Performance Test Results (Scan), without index, it's > a linear increase as result rows # increases. on the other hand, with > index, time spent climbs up way quicker than the case without index. > > btw, quick question- in your presentation, the scale there is seconds or > mill-seconds:) > > - Shengjie > > > On 27 December 2012 15:54, Anoop John <[EMAIL PROTECTED]> wrote: > > > >how the massive number of get() is going to > > perform againt the main table > > > > Didnt follow u completely here. There wont be any get() happening.. As > the > > exact rowkey in a region we get from the index table, we can seek to the > > exact position and return that row. > > > > -Anoop- > > > > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <[EMAIL PROTECTED]> > > wrote: > > > > > how the massive number of get() is going to > > > perform againt the main table > > > > > > > > > -- > All the best, > Shengjie Min >
-
RE: HBase - Secondary IndexAnoop Sam John 2012-12-28, 04:14
> Do you have link to that presentation?
http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf -Anoop- ________________________________________ From: Mohit Anchlia [[EMAIL PROTECTED]] Sent: Friday, December 28, 2012 9:12 AM To: [EMAIL PROTECTED] Subject: Re: HBase - Secondary Index On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Yes as you say when the no of rows to be returned is becoming more and > more the latency will be becoming more. seeks within an HFile block is > some what expensive op now. (Not much but still) The new encoding prefix > trie will be a huge bonus here. There the seeks will be flying.. [Ted also > presented this in the Hadoop China] Thanks to Matt... :) I am trying to > measure the scan performance with this new encoding . Trying to back port a > simple patch for 94 version just for testing... Yes when the no of > results to be returned is more and more any index will become less > performing as per my study :) > > Do you have link to that presentation? > >btw, quick question- in your presentation, the scale there is seconds or > mill-seconds:) > > It is seconds. Dont consider the exact values. What is the % of increase > in latency is important :) Those were not high end machines. > > -Anoop- > ________________________________________ > From: Shengjie Min [[EMAIL PROTECTED]] > Sent: Thursday, December 27, 2012 9:59 PM > To: [EMAIL PROTECTED] > Subject: Re: HBase - Secondary Index > > >Didnt follow u completely here. There wont be any get() happening.. As > the > >exact rowkey in a region we get from the index table, we can seek to the > >exact position and return that row. > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's just > small number of rows returned, this works perfect. As you said you will get > the exact rowkey positions per region, and simply seek them. I was trying > to work out the case that when the number of result rows increases > massively. Like in Anil's case, he wants to do a scan query against the > 2ndary index(timestamp): "select all rows from timestamp1 to timestamp2" > given no customerId provided. During that time period, he might have a big > chunk of rows from different customerIds. The index table returns a lot of > rowkey positions for different customerIds (I believe they are scattered in > different regions), then you end up seeking all different positions in > different regions and return all the rows needed. According to your > presentation page14 - Performance Test Results (Scan), without index, it's > a linear increase as result rows # increases. on the other hand, with > index, time spent climbs up way quicker than the case without index. > > btw, quick question- in your presentation, the scale there is seconds or > mill-seconds:) > > - Shengjie > > > On 27 December 2012 15:54, Anoop John <[EMAIL PROTECTED]> wrote: > > > >how the massive number of get() is going to > > perform againt the main table > > > > Didnt follow u completely here. There wont be any get() happening.. As > the > > exact rowkey in a region we get from the index table, we can seek to the > > exact position and return that row. > > > > -Anoop- > > > > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <[EMAIL PROTECTED]> > > wrote: > > > > > how the massive number of get() is going to > > > perform againt the main table > > > > > > > > > -- > All the best, > Shengjie Min >
-
Re: HBase - Secondary IndexShengjie Min 2012-12-28, 10:55
>Yes as you say when the no of rows to be returned is becoming more and
more the latency will be becoming more. seeks within an HFile block is some what expensive op now. (Not much but still) The new encoding >prefix trie will be a huge bonus here. There the seeks will be flying.. [Ted also presented this in the Hadoop China] Thanks to Matt... :) I am trying to measure the scan performance with this new encoding . Trying to >back port a simple patch for 94 version just for testing... Yes when the no of results to be returned is more and more any index will become less performing as per my study :) yes, you are right, I guess it's just a drawback of any index approach. Thanks for the explanation. Shengjie On 28 December 2012 04:14, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > Do you have link to that presentation? > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf > > -Anoop- > > ________________________________________ > From: Mohit Anchlia [[EMAIL PROTECTED]] > Sent: Friday, December 28, 2012 9:12 AM > To: [EMAIL PROTECTED] > Subject: Re: HBase - Secondary Index > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[EMAIL PROTECTED]> > wrote: > > > Yes as you say when the no of rows to be returned is becoming more and > > more the latency will be becoming more. seeks within an HFile block is > > some what expensive op now. (Not much but still) The new encoding prefix > > trie will be a huge bonus here. There the seeks will be flying.. [Ted > also > > presented this in the Hadoop China] Thanks to Matt... :) I am trying to > > measure the scan performance with this new encoding . Trying to back > port a > > simple patch for 94 version just for testing... Yes when the no of > > results to be returned is more and more any index will become less > > performing as per my study :) > > > > Do you have link to that presentation? > > > > >btw, quick question- in your presentation, the scale there is seconds or > > mill-seconds:) > > > > It is seconds. Dont consider the exact values. What is the % of increase > > in latency is important :) Those were not high end machines. > > > > -Anoop- > > ________________________________________ > > From: Shengjie Min [[EMAIL PROTECTED]] > > Sent: Thursday, December 27, 2012 9:59 PM > > To: [EMAIL PROTECTED] > > Subject: Re: HBase - Secondary Index > > > > >Didnt follow u completely here. There wont be any get() happening.. As > > the > > >exact rowkey in a region we get from the index table, we can seek to the > > >exact position and return that row. > > > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's just > > small number of rows returned, this works perfect. As you said you will > get > > the exact rowkey positions per region, and simply seek them. I was trying > > to work out the case that when the number of result rows increases > > massively. Like in Anil's case, he wants to do a scan query against the > > 2ndary index(timestamp): "select all rows from timestamp1 to timestamp2" > > given no customerId provided. During that time period, he might have a > big > > chunk of rows from different customerIds. The index table returns a lot > of > > rowkey positions for different customerIds (I believe they are scattered > in > > different regions), then you end up seeking all different positions in > > different regions and return all the rows needed. According to your > > presentation page14 - Performance Test Results (Scan), without index, > it's > > a linear increase as result rows # increases. on the other hand, with > > index, time spent climbs up way quicker than the case without index. > > > > btw, quick question- in your presentation, the scale there is seconds or > > mill-seconds:) > > > > - Shengjie > > > > > > On 27 December 2012 15:54, Anoop John <[EMAIL PROTECTED]> wrote: > > > > > >how the massive number of get() is going to > > > perform againt the main table > > > > > > Didnt follow u completely here. There wont be any get() happening.. As All the best, Shengjie Min
-
Re: HBase - Secondary IndexAdrien Mogenet 2013-01-06, 20:30
Nice topic, perhaps one of the most important for 2013 :-)
I still don't get how you're ensuring consistency between index table and main table, without an external component (such as bookkeeper/zookeeper). What's the exact write path in your situation when inserting data ? (WAL/RegionObserver, pre/post put/WALedit...) The underlying question is about how you're ensuring that WALEdit in Index and Main tables are perfectly sync'ed, and how you 're able to rollback in case of issue in both WAL ? On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[EMAIL PROTECTED]> wrote: > >Yes as you say when the no of rows to be returned is becoming more and > more the latency will be becoming more. seeks within an HFile block is > some what expensive op now. (Not much but still) The new encoding >prefix > trie will be a huge bonus here. There the seeks will be flying.. [Ted also > presented this in the Hadoop China] Thanks to Matt... :) I am trying to > measure the scan performance with this new encoding . Trying to >back port > a simple patch for 94 version just for testing... Yes when the no of > results to be returned is more and more any index will become less > performing as per my study :) > > yes, you are right, I guess it's just a drawback of any index approach. > Thanks for the explanation. > > Shengjie > > On 28 December 2012 04:14, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > > > Do you have link to that presentation? > > > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf > > > > -Anoop- > > > > ________________________________________ > > From: Mohit Anchlia [[EMAIL PROTECTED]] > > Sent: Friday, December 28, 2012 9:12 AM > > To: [EMAIL PROTECTED] > > Subject: Re: HBase - Secondary Index > > > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[EMAIL PROTECTED]> > > wrote: > > > > > Yes as you say when the no of rows to be returned is becoming more and > > > more the latency will be becoming more. seeks within an HFile block is > > > some what expensive op now. (Not much but still) The new encoding > prefix > > > trie will be a huge bonus here. There the seeks will be flying.. [Ted > > also > > > presented this in the Hadoop China] Thanks to Matt... :) I am trying > to > > > measure the scan performance with this new encoding . Trying to back > > port a > > > simple patch for 94 version just for testing... Yes when the no of > > > results to be returned is more and more any index will become less > > > performing as per my study :) > > > > > > Do you have link to that presentation? > > > > > > > >btw, quick question- in your presentation, the scale there is seconds > or > > > mill-seconds:) > > > > > > It is seconds. Dont consider the exact values. What is the % of > increase > > > in latency is important :) Those were not high end machines. > > > > > > -Anoop- > > > ________________________________________ > > > From: Shengjie Min [[EMAIL PROTECTED]] > > > Sent: Thursday, December 27, 2012 9:59 PM > > > To: [EMAIL PROTECTED] > > > Subject: Re: HBase - Secondary Index > > > > > > >Didnt follow u completely here. There wont be any get() happening.. > As > > > the > > > >exact rowkey in a region we get from the index table, we can seek to > the > > > >exact position and return that row. > > > > > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's just > > > small number of rows returned, this works perfect. As you said you will > > get > > > the exact rowkey positions per region, and simply seek them. I was > trying > > > to work out the case that when the number of result rows increases > > > massively. Like in Anil's case, he wants to do a scan query against the > > > 2ndary index(timestamp): "select all rows from timestamp1 to > timestamp2" > > > given no customerId provided. During that time period, he might have a > > big > > > chunk of rows from different customerIds. The index table returns a lot > > of > > > rowkey positions for different customerIds (I believe they are Adrien Mogenet 06.59.16.64.22 http://www.mogenet.me
-
Re: HBase - Secondary IndexMohit Anchlia 2013-01-06, 20:36
Does anyone has any links or information to the new prefix encoding feature
in HBase that's being referred to in this mail? On Sun, Jan 6, 2013 at 12:30 PM, Adrien Mogenet <[EMAIL PROTECTED]>wrote: > Nice topic, perhaps one of the most important for 2013 :-) > I still don't get how you're ensuring consistency between index table and > main table, without an external component (such as bookkeeper/zookeeper). > What's the exact write path in your situation when inserting data ? > (WAL/RegionObserver, pre/post put/WALedit...) > > The underlying question is about how you're ensuring that WALEdit in Index > and Main tables are perfectly sync'ed, and how you 're able to rollback in > case of issue in both WAL ? > > > On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[EMAIL PROTECTED]> > wrote: > > > >Yes as you say when the no of rows to be returned is becoming more and > > more the latency will be becoming more. seeks within an HFile block is > > some what expensive op now. (Not much but still) The new encoding > >prefix > > trie will be a huge bonus here. There the seeks will be flying.. [Ted > also > > presented this in the Hadoop China] Thanks to Matt... :) I am trying to > > measure the scan performance with this new encoding . Trying to >back > port > > a simple patch for 94 version just for testing... Yes when the no of > > results to be returned is more and more any index will become less > > performing as per my study :) > > > > yes, you are right, I guess it's just a drawback of any index approach. > > Thanks for the explanation. > > > > Shengjie > > > > On 28 December 2012 04:14, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > > > > > Do you have link to that presentation? > > > > > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf > > > > > > -Anoop- > > > > > > ________________________________________ > > > From: Mohit Anchlia [[EMAIL PROTECTED]] > > > Sent: Friday, December 28, 2012 9:12 AM > > > To: [EMAIL PROTECTED] > > > Subject: Re: HBase - Secondary Index > > > > > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Yes as you say when the no of rows to be returned is becoming more > and > > > > more the latency will be becoming more. seeks within an HFile block > is > > > > some what expensive op now. (Not much but still) The new encoding > > prefix > > > > trie will be a huge bonus here. There the seeks will be flying.. [Ted > > > also > > > > presented this in the Hadoop China] Thanks to Matt... :) I am > trying > > to > > > > measure the scan performance with this new encoding . Trying to back > > > port a > > > > simple patch for 94 version just for testing... Yes when the no of > > > > results to be returned is more and more any index will become less > > > > performing as per my study :) > > > > > > > > Do you have link to that presentation? > > > > > > > > > > >btw, quick question- in your presentation, the scale there is > seconds > > or > > > > mill-seconds:) > > > > > > > > It is seconds. Dont consider the exact values. What is the % of > > increase > > > > in latency is important :) Those were not high end machines. > > > > > > > > -Anoop- > > > > ________________________________________ > > > > From: Shengjie Min [[EMAIL PROTECTED]] > > > > Sent: Thursday, December 27, 2012 9:59 PM > > > > To: [EMAIL PROTECTED] > > > > Subject: Re: HBase - Secondary Index > > > > > > > > >Didnt follow u completely here. There wont be any get() happening.. > > As > > > > the > > > > >exact rowkey in a region we get from the index table, we can seek to > > the > > > > >exact position and return that row. > > > > > > > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's > just > > > > small number of rows returned, this works perfect. As you said you > will > > > get > > > > the exact rowkey positions per region, and simply seek them. I was > > trying > > > > to work out the case that when the number of result rows increases
-
Re: HBase - Secondary IndexAdrien Mogenet 2013-01-06, 20:40
Are your talking about Data block encoding of K/V ?
https://issues.apache.org/jira/browse/HBASE-4218 On Sun, Jan 6, 2013 at 9:36 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > Does anyone has any links or information to the new prefix encoding feature > in HBase that's being referred to in this mail? > > On Sun, Jan 6, 2013 at 12:30 PM, Adrien Mogenet <[EMAIL PROTECTED] > >wrote: > > > Nice topic, perhaps one of the most important for 2013 :-) > > I still don't get how you're ensuring consistency between index table and > > main table, without an external component (such as bookkeeper/zookeeper). > > What's the exact write path in your situation when inserting data ? > > (WAL/RegionObserver, pre/post put/WALedit...) > > > > The underlying question is about how you're ensuring that WALEdit in > Index > > and Main tables are perfectly sync'ed, and how you 're able to rollback > in > > case of issue in both WAL ? > > > > > > On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[EMAIL PROTECTED]> > > wrote: > > > > > >Yes as you say when the no of rows to be returned is becoming more and > > > more the latency will be becoming more. seeks within an HFile block is > > > some what expensive op now. (Not much but still) The new encoding > > >prefix > > > trie will be a huge bonus here. There the seeks will be flying.. [Ted > > also > > > presented this in the Hadoop China] Thanks to Matt... :) I am trying > to > > > measure the scan performance with this new encoding . Trying to >back > > port > > > a simple patch for 94 version just for testing... Yes when the no of > > > results to be returned is more and more any index will become less > > > performing as per my study :) > > > > > > yes, you are right, I guess it's just a drawback of any index approach. > > > Thanks for the explanation. > > > > > > Shengjie > > > > > > On 28 December 2012 04:14, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > > > > > > > Do you have link to that presentation? > > > > > > > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf > > > > > > > > -Anoop- > > > > > > > > ________________________________________ > > > > From: Mohit Anchlia [[EMAIL PROTECTED]] > > > > Sent: Friday, December 28, 2012 9:12 AM > > > > To: [EMAIL PROTECTED] > > > > Subject: Re: HBase - Secondary Index > > > > > > > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > Yes as you say when the no of rows to be returned is becoming more > > and > > > > > more the latency will be becoming more. seeks within an HFile > block > > is > > > > > some what expensive op now. (Not much but still) The new encoding > > > prefix > > > > > trie will be a huge bonus here. There the seeks will be flying.. > [Ted > > > > also > > > > > presented this in the Hadoop China] Thanks to Matt... :) I am > > trying > > > to > > > > > measure the scan performance with this new encoding . Trying to > back > > > > port a > > > > > simple patch for 94 version just for testing... Yes when the no > of > > > > > results to be returned is more and more any index will become less > > > > > performing as per my study :) > > > > > > > > > > Do you have link to that presentation? > > > > > > > > > > > > > >btw, quick question- in your presentation, the scale there is > > seconds > > > or > > > > > mill-seconds:) > > > > > > > > > > It is seconds. Dont consider the exact values. What is the % of > > > increase > > > > > in latency is important :) Those were not high end machines. > > > > > > > > > > -Anoop- > > > > > ________________________________________ > > > > > From: Shengjie Min [[EMAIL PROTECTED]] > > > > > Sent: Thursday, December 27, 2012 9:59 PM > > > > > To: [EMAIL PROTECTED] > > > > > Subject: Re: HBase - Secondary Index > > > > > > > > > > >Didnt follow u completely here. There wont be any get() > happening.. > > > As > > > > > the > > > > > >exact rowkey in a region we get from the index table, we can seek > to Adrien Mogenet 06.59.16.64.22 http://www.mogenet.me
-
Re: HBase - Secondary Indexanil gupta 2013-01-06, 22:12
@Mohit: Here is the jira for prefix compression discussed here:
https://issues.apache.org/jira/browse/HBASE-4676 HTH, Anil Gupta On Sun, Jan 6, 2013 at 12:40 PM, Adrien Mogenet <[EMAIL PROTECTED]>wrote: > Are your talking about Data block encoding of K/V ? > https://issues.apache.org/jira/browse/HBASE-4218 > > > On Sun, Jan 6, 2013 at 9:36 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > Does anyone has any links or information to the new prefix encoding > feature > > in HBase that's being referred to in this mail? > > > > On Sun, Jan 6, 2013 at 12:30 PM, Adrien Mogenet < > [EMAIL PROTECTED] > > >wrote: > > > > > Nice topic, perhaps one of the most important for 2013 :-) > > > I still don't get how you're ensuring consistency between index table > and > > > main table, without an external component (such as > bookkeeper/zookeeper). > > > What's the exact write path in your situation when inserting data ? > > > (WAL/RegionObserver, pre/post put/WALedit...) > > > > > > The underlying question is about how you're ensuring that WALEdit in > > Index > > > and Main tables are perfectly sync'ed, and how you 're able to rollback > > in > > > case of issue in both WAL ? > > > > > > > > > On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[EMAIL PROTECTED]> > > > wrote: > > > > > > > >Yes as you say when the no of rows to be returned is becoming more > and > > > > more the latency will be becoming more. seeks within an HFile block > is > > > > some what expensive op now. (Not much but still) The new encoding > > > >prefix > > > > trie will be a huge bonus here. There the seeks will be flying.. [Ted > > > also > > > > presented this in the Hadoop China] Thanks to Matt... :) I am > trying > > to > > > > measure the scan performance with this new encoding . Trying to >back > > > port > > > > a simple patch for 94 version just for testing... Yes when the no > of > > > > results to be returned is more and more any index will become less > > > > performing as per my study :) > > > > > > > > yes, you are right, I guess it's just a drawback of any index > approach. > > > > Thanks for the explanation. > > > > > > > > Shengjie > > > > > > > > On 28 December 2012 04:14, Anoop Sam John <[EMAIL PROTECTED]> > wrote: > > > > > > > > > > Do you have link to that presentation? > > > > > > > > > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf > > > > > > > > > > -Anoop- > > > > > > > > > > ________________________________________ > > > > > From: Mohit Anchlia [[EMAIL PROTECTED]] > > > > > Sent: Friday, December 28, 2012 9:12 AM > > > > > To: [EMAIL PROTECTED] > > > > > Subject: Re: HBase - Secondary Index > > > > > > > > > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John < > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > Yes as you say when the no of rows to be returned is becoming > more > > > and > > > > > > more the latency will be becoming more. seeks within an HFile > > block > > > is > > > > > > some what expensive op now. (Not much but still) The new > encoding > > > > prefix > > > > > > trie will be a huge bonus here. There the seeks will be flying.. > > [Ted > > > > > also > > > > > > presented this in the Hadoop China] Thanks to Matt... :) I am > > > trying > > > > to > > > > > > measure the scan performance with this new encoding . Trying to > > back > > > > > port a > > > > > > simple patch for 94 version just for testing... Yes when the no > > of > > > > > > results to be returned is more and more any index will become > less > > > > > > performing as per my study :) > > > > > > > > > > > > Do you have link to that presentation? > > > > > > > > > > > > > > > > >btw, quick question- in your presentation, the scale there is > > > seconds > > > > or > > > > > > mill-seconds:) > > > > > > > > > > > > It is seconds. Dont consider the exact values. What is the % of > > > > increase > > > > > > in latency is important :) Those were not high end machines. > > > > > > > > > > > > -Anoop- Thanks & Regards, Anil Gupta
-
RE: HBase - Secondary IndexAnoop Sam John 2013-01-07, 03:48
Hi Adrien
We are making the consistency btw the main table and index table and the roll back mentioned below etc using the CP hooks. The current hooks were not enough for those though.. I am in the process of trying to contribute those new hooks, core changes etc now... Once all are done I will be able to explain in details.. -Anoop- ________________________________________ From: Adrien Mogenet [[EMAIL PROTECTED]] Sent: Monday, January 07, 2013 2:00 AM To: [EMAIL PROTECTED] Subject: Re: HBase - Secondary Index Nice topic, perhaps one of the most important for 2013 :-) I still don't get how you're ensuring consistency between index table and main table, without an external component (such as bookkeeper/zookeeper). What's the exact write path in your situation when inserting data ? (WAL/RegionObserver, pre/post put/WALedit...) The underlying question is about how you're ensuring that WALEdit in Index and Main tables are perfectly sync'ed, and how you 're able to rollback in case of issue in both WAL ? On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[EMAIL PROTECTED]> wrote: > >Yes as you say when the no of rows to be returned is becoming more and > more the latency will be becoming more. seeks within an HFile block is > some what expensive op now. (Not much but still) The new encoding >prefix > trie will be a huge bonus here. There the seeks will be flying.. [Ted also > presented this in the Hadoop China] Thanks to Matt... :) I am trying to > measure the scan performance with this new encoding . Trying to >back port > a simple patch for 94 version just for testing... Yes when the no of > results to be returned is more and more any index will become less > performing as per my study :) > > yes, you are right, I guess it's just a drawback of any index approach. > Thanks for the explanation. > > Shengjie > > On 28 December 2012 04:14, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > > > Do you have link to that presentation? > > > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf > > > > -Anoop- > > > > ________________________________________ > > From: Mohit Anchlia [[EMAIL PROTECTED]] > > Sent: Friday, December 28, 2012 9:12 AM > > To: [EMAIL PROTECTED] > > Subject: Re: HBase - Secondary Index > > > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[EMAIL PROTECTED]> > > wrote: > > > > > Yes as you say when the no of rows to be returned is becoming more and > > > more the latency will be becoming more. seeks within an HFile block is > > > some what expensive op now. (Not much but still) The new encoding > prefix > > > trie will be a huge bonus here. There the seeks will be flying.. [Ted > > also > > > presented this in the Hadoop China] Thanks to Matt... :) I am trying > to > > > measure the scan performance with this new encoding . Trying to back > > port a > > > simple patch for 94 version just for testing... Yes when the no of > > > results to be returned is more and more any index will become less > > > performing as per my study :) > > > > > > Do you have link to that presentation? > > > > > > > >btw, quick question- in your presentation, the scale there is seconds > or > > > mill-seconds:) > > > > > > It is seconds. Dont consider the exact values. What is the % of > increase > > > in latency is important :) Those were not high end machines. > > > > > > -Anoop- > > > ________________________________________ > > > From: Shengjie Min [[EMAIL PROTECTED]] > > > Sent: Thursday, December 27, 2012 9:59 PM > > > To: [EMAIL PROTECTED] > > > Subject: Re: HBase - Secondary Index > > > > > > >Didnt follow u completely here. There wont be any get() happening.. > As > > > the > > > >exact rowkey in a region we get from the index table, we can seek to > the > > > >exact position and return that row. > > > > > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's just > > > small number of rows returned, this works perfect. As you said you will > > get Adrien Mogenet 06.59.16.64.22 http://www.mogenet.me
-
Re: HBase - Secondary IndexMohit Anchlia 2013-01-07, 04:17
Hi Anoop,
Am I correct in understanding that this indexing mechanism is only applicable when you know the row key? It's not an inverted index truly based on the column value. Mohit On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Hi Adrien > We are making the consistency btw the main table and > index table and the roll back mentioned below etc using the CP hooks. The > current hooks were not enough for those though.. I am in the process of > trying to contribute those new hooks, core changes etc now... Once all are > done I will be able to explain in details.. > > -Anoop- > ________________________________________ > From: Adrien Mogenet [[EMAIL PROTECTED]] > Sent: Monday, January 07, 2013 2:00 AM > To: [EMAIL PROTECTED] > Subject: Re: HBase - Secondary Index > > Nice topic, perhaps one of the most important for 2013 :-) > I still don't get how you're ensuring consistency between index table and > main table, without an external component (such as bookkeeper/zookeeper). > What's the exact write path in your situation when inserting data ? > (WAL/RegionObserver, pre/post put/WALedit...) > > The underlying question is about how you're ensuring that WALEdit in Index > and Main tables are perfectly sync'ed, and how you 're able to rollback in > case of issue in both WAL ? > > > On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[EMAIL PROTECTED]> > wrote: > > > >Yes as you say when the no of rows to be returned is becoming more and > > more the latency will be becoming more. seeks within an HFile block is > > some what expensive op now. (Not much but still) The new encoding > >prefix > > trie will be a huge bonus here. There the seeks will be flying.. [Ted > also > > presented this in the Hadoop China] Thanks to Matt... :) I am trying to > > measure the scan performance with this new encoding . Trying to >back > port > > a simple patch for 94 version just for testing... Yes when the no of > > results to be returned is more and more any index will become less > > performing as per my study :) > > > > yes, you are right, I guess it's just a drawback of any index approach. > > Thanks for the explanation. > > > > Shengjie > > > > On 28 December 2012 04:14, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > > > > > Do you have link to that presentation? > > > > > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf > > > > > > -Anoop- > > > > > > ________________________________________ > > > From: Mohit Anchlia [[EMAIL PROTECTED]] > > > Sent: Friday, December 28, 2012 9:12 AM > > > To: [EMAIL PROTECTED] > > > Subject: Re: HBase - Secondary Index > > > > > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Yes as you say when the no of rows to be returned is becoming more > and > > > > more the latency will be becoming more. seeks within an HFile block > is > > > > some what expensive op now. (Not much but still) The new encoding > > prefix > > > > trie will be a huge bonus here. There the seeks will be flying.. [Ted > > > also > > > > presented this in the Hadoop China] Thanks to Matt... :) I am > trying > > to > > > > measure the scan performance with this new encoding . Trying to back > > > port a > > > > simple patch for 94 version just for testing... Yes when the no of > > > > results to be returned is more and more any index will become less > > > > performing as per my study :) > > > > > > > > Do you have link to that presentation? > > > > > > > > > > >btw, quick question- in your presentation, the scale there is > seconds > > or > > > > mill-seconds:) > > > > > > > > It is seconds. Dont consider the exact values. What is the % of > > increase > > > > in latency is important :) Those were not high end machines. > > > > > > > > -Anoop- > > > > ________________________________________ > > > > From: Shengjie Min [[EMAIL PROTECTED]] > > > > Sent: Thursday, December 27, 2012 9:59 PM > > > > To: [EMAIL PROTECTED]
-
RE: HBase - Secondary IndexAnoop Sam John 2013-01-07, 13:49
Hi,
It is inverted index based on column(s) value(s) It will be region wise indexing. Can work when some one knows the rowkey range or NOT. -Anoop- ________________________________________ From: Mohit Anchlia [[EMAIL PROTECTED]] Sent: Monday, January 07, 2013 9:47 AM To: [EMAIL PROTECTED] Subject: Re: HBase - Secondary Index Hi Anoop, Am I correct in understanding that this indexing mechanism is only applicable when you know the row key? It's not an inverted index truly based on the column value. Mohit On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Hi Adrien > We are making the consistency btw the main table and > index table and the roll back mentioned below etc using the CP hooks. The > current hooks were not enough for those though.. I am in the process of > trying to contribute those new hooks, core changes etc now... Once all are > done I will be able to explain in details.. > > -Anoop- > ________________________________________ > From: Adrien Mogenet [[EMAIL PROTECTED]] > Sent: Monday, January 07, 2013 2:00 AM > To: [EMAIL PROTECTED] > Subject: Re: HBase - Secondary Index > > Nice topic, perhaps one of the most important for 2013 :-) > I still don't get how you're ensuring consistency between index table and > main table, without an external component (such as bookkeeper/zookeeper). > What's the exact write path in your situation when inserting data ? > (WAL/RegionObserver, pre/post put/WALedit...) > > The underlying question is about how you're ensuring that WALEdit in Index > and Main tables are perfectly sync'ed, and how you 're able to rollback in > case of issue in both WAL ? > > > On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[EMAIL PROTECTED]> > wrote: > > > >Yes as you say when the no of rows to be returned is becoming more and > > more the latency will be becoming more. seeks within an HFile block is > > some what expensive op now. (Not much but still) The new encoding > >prefix > > trie will be a huge bonus here. There the seeks will be flying.. [Ted > also > > presented this in the Hadoop China] Thanks to Matt... :) I am trying to > > measure the scan performance with this new encoding . Trying to >back > port > > a simple patch for 94 version just for testing... Yes when the no of > > results to be returned is more and more any index will become less > > performing as per my study :) > > > > yes, you are right, I guess it's just a drawback of any index approach. > > Thanks for the explanation. > > > > Shengjie > > > > On 28 December 2012 04:14, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > > > > > Do you have link to that presentation? > > > > > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf > > > > > > -Anoop- > > > > > > ________________________________________ > > > From: Mohit Anchlia [[EMAIL PROTECTED]] > > > Sent: Friday, December 28, 2012 9:12 AM > > > To: [EMAIL PROTECTED] > > > Subject: Re: HBase - Secondary Index > > > > > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Yes as you say when the no of rows to be returned is becoming more > and > > > > more the latency will be becoming more. seeks within an HFile block > is > > > > some what expensive op now. (Not much but still) The new encoding > > prefix > > > > trie will be a huge bonus here. There the seeks will be flying.. [Ted > > > also > > > > presented this in the Hadoop China] Thanks to Matt... :) I am > trying > > to > > > > measure the scan performance with this new encoding . Trying to back > > > port a > > > > simple patch for 94 version just for testing... Yes when the no of > > > > results to be returned is more and more any index will become less > > > > performing as per my study :) > > > > > > > > Do you have link to that presentation? > > > > > > > > > > >btw, quick question- in your presentation, the scale there is > seconds > > or > > > > mill-seconds:) > > > > > > > > It is seconds. Dont consider the exact values. What is the % of
-
Re: HBase - Secondary IndexMichael Segel 2013-01-08, 14:33
So if you're using an inverted table / index why on earth are you doing it at the region level?
I've tried to explain this to others over 6 months ago and its not really a good idea. You're over complicating this and you will end up creating performance bottlenecks when your secondary index is completely orthogonal to your row key. To give you an example... Suppose you're CCCIS and you have a large database of auto insurance claims that you've acquired over the years from your Pathways product. Your primary key would be a combination of the Insurance Company's ID and their internal claim ID for the individual claim. Your row would be all of the data associated to that claim. So now lets say you want to find the average cost to repair a front end collision of an S80 Volvo. The make and model of the car would be orthogonal to the initial key. This means that the result set containing insurance records for Front End collisions of S80 Volvos would be most likely evenly distributed across the cluster's regions. If you used a series of inverted tables, you would be able to use a series of get()s to get the result set from each index and then find their intersections. (Note that you could also put them in sort order so that the intersections would be fairly straight forward to find. Doing this at the region level isn't so simple. So I have to again ask why go through and over complicate things? Just saying... On Jan 7, 2013, at 7:49 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Hi, > It is inverted index based on column(s) value(s) > It will be region wise indexing. Can work when some one knows the rowkey range or NOT. > > -Anoop- > ________________________________________ > From: Mohit Anchlia [[EMAIL PROTECTED]] > Sent: Monday, January 07, 2013 9:47 AM > To: [EMAIL PROTECTED] > Subject: Re: HBase - Secondary Index > > Hi Anoop, > > Am I correct in understanding that this indexing mechanism is only > applicable when you know the row key? It's not an inverted index truly > based on the column value. > > Mohit > On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > >> Hi Adrien >> We are making the consistency btw the main table and >> index table and the roll back mentioned below etc using the CP hooks. The >> current hooks were not enough for those though.. I am in the process of >> trying to contribute those new hooks, core changes etc now... Once all are >> done I will be able to explain in details.. >> >> -Anoop- >> ________________________________________ >> From: Adrien Mogenet [[EMAIL PROTECTED]] >> Sent: Monday, January 07, 2013 2:00 AM >> To: [EMAIL PROTECTED] >> Subject: Re: HBase - Secondary Index >> >> Nice topic, perhaps one of the most important for 2013 :-) >> I still don't get how you're ensuring consistency between index table and >> main table, without an external component (such as bookkeeper/zookeeper). >> What's the exact write path in your situation when inserting data ? >> (WAL/RegionObserver, pre/post put/WALedit...) >> >> The underlying question is about how you're ensuring that WALEdit in Index >> and Main tables are perfectly sync'ed, and how you 're able to rollback in >> case of issue in both WAL ? >> >> >> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[EMAIL PROTECTED]> >> wrote: >> >>>> Yes as you say when the no of rows to be returned is becoming more and >>> more the latency will be becoming more. seeks within an HFile block is >>> some what expensive op now. (Not much but still) The new encoding >>> prefix >>> trie will be a huge bonus here. There the seeks will be flying.. [Ted >> also >>> presented this in the Hadoop China] Thanks to Matt... :) I am trying to >>> measure the scan performance with this new encoding . Trying to >back >> port >>> a simple patch for 94 version just for testing... Yes when the no of >>> results to be returned is more and more any index will become less >>> performing as per my study :)
-
Re: HBase - Secondary IndexAsaf Mesika 2013-01-08, 23:00
I guess one reason is the the amount of data traveling. In your design, you
have to query a secondary index table, read all the matched original table row keys, send them back to the client, and then issue a special scan that retrieves only those row keys values. In his example, he retrieved 2% of the data which was around 10 million records, which is around 1 GB according his key size (800 bytes). That's a lot of bytes being transferred and throttling your switches. In hi design you read the rowkeys locally, thus able to apply the rest of the filters, and may eventually return just 100 key values which matches to those extra filters. Thus saving tons of bandwidth and lots of rpc calls. In your example, and using his design, you can treat each region as mini table, each indexing its own data. Having a secondary indexing solution which also supports join like any RDBMS does as yet to be found since its fairly complex. On Tuesday, January 8, 2013, Michael Segel wrote: > So if you're using an inverted table / index why on earth are you doing it > at the region level? > > I've tried to explain this to others over 6 months ago and its not really > a good idea. > > You're over complicating this and you will end up creating performance > bottlenecks when your secondary index is completely orthogonal to your row > key. > > To give you an example... > > Suppose you're CCCIS and you have a large database of auto insurance > claims that you've acquired over the years from your Pathways product. > > Your primary key would be a combination of the Insurance Company's ID and > their internal claim ID for the individual claim. > Your row would be all of the data associated to that claim. > > So now lets say you want to find the average cost to repair a front end > collision of an S80 Volvo. > The make and model of the car would be orthogonal to the initial key. This > means that the result set containing insurance records for Front End > collisions of S80 Volvos would be most likely evenly distributed across the > cluster's regions. > > If you used a series of inverted tables, you would be able to use a series > of get()s to get the result set from each index and then find their > intersections. (Note that you could also put them in sort order so that the > intersections would be fairly straight forward to find. > > Doing this at the region level isn't so simple. > > So I have to again ask why go through and over complicate things? > > Just saying... > > On Jan 7, 2013, at 7:49 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > > Hi, > > It is inverted index based on column(s) value(s) > > It will be region wise indexing. Can work when some one knows the rowkey > range or NOT. > > > > -Anoop- > > ________________________________________ > > From: Mohit Anchlia [[EMAIL PROTECTED]] > > Sent: Monday, January 07, 2013 9:47 AM > > To: [EMAIL PROTECTED] > > Subject: Re: HBase - Secondary Index > > > > Hi Anoop, > > > > Am I correct in understanding that this indexing mechanism is only > > applicable when you know the row key? It's not an inverted index truly > > based on the column value. > > > > Mohit > > On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <[EMAIL PROTECTED]> > wrote: > > > >> Hi Adrien > >> We are making the consistency btw the main table and > >> index table and the roll back mentioned below etc using the CP hooks. > The > >> current hooks were not enough for those though.. I am in the process of > >> trying to contribute those new hooks, core changes etc now... Once all > are > >> done I will be able to explain in details.. > >> > >> -Anoop- > >> ________________________________________ > >> From: Adrien Mogenet [[EMAIL PROTECTED]] > >> Sent: Monday, January 07, 2013 2:00 AM > >> To: [EMAIL PROTECTED] > >> Subject: Re: HBase - Secondary Index > >> > >> Nice topic, perhaps one of the most important for 2013 :-) > >> I still don't get how you're ensuring consistency between index table > and > >> main table, without an external component (such as
-
Re: HBase - Secondary Indexlars hofhansl 2013-01-09, 00:30
Different use cases.
For global point queries you want exactly what you said below. For range scans across many rows you want Anoop's design. As usually it depends. The tradeoff is bringing a lot of unnecessary data to the client vs having to contact each region (or at least each region server). -- Lars ________________________________ From: Michael Segel <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, January 8, 2013 6:33 AM Subject: Re: HBase - Secondary Index So if you're using an inverted table / index why on earth are you doing it at the region level? I've tried to explain this to others over 6 months ago and its not really a good idea. You're over complicating this and you will end up creating performance bottlenecks when your secondary index is completely orthogonal to your row key. To give you an example... Suppose you're CCCIS and you have a large database of auto insurance claims that you've acquired over the years from your Pathways product. Your primary key would be a combination of the Insurance Company's ID and their internal claim ID for the individual claim. Your row would be all of the data associated to that claim. So now lets say you want to find the average cost to repair a front end collision of an S80 Volvo. The make and model of the car would be orthogonal to the initial key. This means that the result set containing insurance records for Front End collisions of S80 Volvos would be most likely evenly distributed across the cluster's regions. If you used a series of inverted tables, you would be able to use a series of get()s to get the result set from each index and then find their intersections. (Note that you could also put them in sort order so that the intersections would be fairly straight forward to find. Doing this at the region level isn't so simple. So I have to again ask why go through and over complicate things? Just saying... On Jan 7, 2013, at 7:49 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Hi, > It is inverted index based on column(s) value(s) > It will be region wise indexing. Can work when some one knows the rowkey range or NOT. > > -Anoop- > ________________________________________ > From: Mohit Anchlia [[EMAIL PROTECTED]] > Sent: Monday, January 07, 2013 9:47 AM > To: [EMAIL PROTECTED] > Subject: Re: HBase - Secondary Index > > Hi Anoop, > > Am I correct in understanding that this indexing mechanism is only > applicable when you know the row key? It's not an inverted index truly > based on the column value. > > Mohit > On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > >> Hi Adrien >> We are making the consistency btw the main table and >> index table and the roll back mentioned below etc using the CP hooks. The >> current hooks were not enough for those though.. I am in the process of >> trying to contribute those new hooks, core changes etc now... Once all are >> done I will be able to explain in details.. >> >> -Anoop- >> ________________________________________ >> From: Adrien Mogenet [[EMAIL PROTECTED]] >> Sent: Monday, January 07, 2013 2:00 AM >> To: [EMAIL PROTECTED] >> Subject: Re: HBase - Secondary Index >> >> Nice topic, perhaps one of the most important for 2013 :-) >> I still don't get how you're ensuring consistency between index table and >> main table, without an external component (such as bookkeeper/zookeeper). >> What's the exact write path in your situation when inserting data ? >> (WAL/RegionObserver, pre/post put/WALedit...) >> >> The underlying question is about how you're ensuring that WALEdit in Index >> and Main tables are perfectly sync'ed, and how you 're able to rollback in >> case of issue in both WAL ? >> >> >> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[EMAIL PROTECTED]> >> wrote: >> >>>> Yes as you say when the no of rows to be returned is becoming more and >>> more the latency will be becoming more. seeks within an HFile block is
-
Re: HBase - Secondary Indexanil gupta 2013-01-09, 01:28
+1 on Lars comment.
Either the client gets the rowkey from secondary table and then gets the real data from Primary Table. ** OR ** Send the request to all the RS(or region) hosting a region of primary table. Anoop is using the latter mechanism. Both the mechanism have their pros and cons. IMO, there is no outright winner. ~Anil Gupta On Tue, Jan 8, 2013 at 4:30 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Different use cases. > > > For global point queries you want exactly what you said below. > For range scans across many rows you want Anoop's design. As usually it > depends. > > > The tradeoff is bringing a lot of unnecessary data to the client vs having > to contact each region (or at least each region server). > > > -- Lars > > > > ________________________________ > From: Michael Segel <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Tuesday, January 8, 2013 6:33 AM > Subject: Re: HBase - Secondary Index > > So if you're using an inverted table / index why on earth are you doing it > at the region level? > > I've tried to explain this to others over 6 months ago and its not really > a good idea. > > You're over complicating this and you will end up creating performance > bottlenecks when your secondary index is completely orthogonal to your row > key. > > To give you an example... > > Suppose you're CCCIS and you have a large database of auto insurance > claims that you've acquired over the years from your Pathways product. > > Your primary key would be a combination of the Insurance Company's ID and > their internal claim ID for the individual claim. > Your row would be all of the data associated to that claim. > > So now lets say you want to find the average cost to repair a front end > collision of an S80 Volvo. > The make and model of the car would be orthogonal to the initial key. This > means that the result set containing insurance records for Front End > collisions of S80 Volvos would be most likely evenly distributed across the > cluster's regions. > > If you used a series of inverted tables, you would be able to use a series > of get()s to get the result set from each index and then find their > intersections. (Note that you could also put them in sort order so that the > intersections would be fairly straight forward to find. > > Doing this at the region level isn't so simple. > > So I have to again ask why go through and over complicate things? > > Just saying... > > On Jan 7, 2013, at 7:49 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > > Hi, > > It is inverted index based on column(s) value(s) > > It will be region wise indexing. Can work when some one knows the rowkey > range or NOT. > > > > -Anoop- > > ________________________________________ > > From: Mohit Anchlia [[EMAIL PROTECTED]] > > Sent: Monday, January 07, 2013 9:47 AM > > To: [EMAIL PROTECTED] > > Subject: Re: HBase - Secondary Index > > > > Hi Anoop, > > > > Am I correct in understanding that this indexing mechanism is only > > applicable when you know the row key? It's not an inverted index truly > > based on the column value. > > > > Mohit > > On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <[EMAIL PROTECTED]> > wrote: > > > >> Hi Adrien > >> We are making the consistency btw the main table and > >> index table and the roll back mentioned below etc using the CP hooks. > The > >> current hooks were not enough for those though.. I am in the process of > >> trying to contribute those new hooks, core changes etc now... Once all > are > >> done I will be able to explain in details.. > >> > >> -Anoop- > >> ________________________________________ > >> From: Adrien Mogenet [[EMAIL PROTECTED]] > >> Sent: Monday, January 07, 2013 2:00 AM > >> To: [EMAIL PROTECTED] > >> Subject: Re: HBase - Secondary Index > >> > >> Nice topic, perhaps one of the most important for 2013 :-) > >> I still don't get how you're ensuring consistency between index table > and > >> main table, without an external component (such as Thanks & Regards, Anil Gupta
-
Re: HBase - Secondary IndexMichel Segel 2013-01-09, 01:30
Can you provide a use case?
Sent from a remote device. Please excuse any typos... Mike Segel On Jan 8, 2013, at 6:30 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Different use cases. > > > For global point queries you want exactly what you said below. > For range scans across many rows you want Anoop's design. As usually it depends. > > > The tradeoff is bringing a lot of unnecessary data to the client vs having to contact each region (or at least each region server). > > > -- Lars > > > > ________________________________ > From: Michael Segel <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Tuesday, January 8, 2013 6:33 AM > Subject: Re: HBase - Secondary Index > > So if you're using an inverted table / index why on earth are you doing it at the region level? > > I've tried to explain this to others over 6 months ago and its not really a good idea. > > You're over complicating this and you will end up creating performance bottlenecks when your secondary index is completely orthogonal to your row key. > > To give you an example... > > Suppose you're CCCIS and you have a large database of auto insurance claims that you've acquired over the years from your Pathways product. > > Your primary key would be a combination of the Insurance Company's ID and their internal claim ID for the individual claim. > Your row would be all of the data associated to that claim. > > So now lets say you want to find the average cost to repair a front end collision of an S80 Volvo. > The make and model of the car would be orthogonal to the initial key. This means that the result set containing insurance records for Front End collisions of S80 Volvos would be most likely evenly distributed across the cluster's regions. > > If you used a series of inverted tables, you would be able to use a series of get()s to get the result set from each index and then find their intersections. (Note that you could also put them in sort order so that the intersections would be fairly straight forward to find. > > Doing this at the region level isn't so simple. > > So I have to again ask why go through and over complicate things? > > Just saying... > > On Jan 7, 2013, at 7:49 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > >> Hi, >> It is inverted index based on column(s) value(s) >> It will be region wise indexing. Can work when some one knows the rowkey range or NOT. >> >> -Anoop- >> ________________________________________ >> From: Mohit Anchlia [[EMAIL PROTECTED]] >> Sent: Monday, January 07, 2013 9:47 AM >> To: [EMAIL PROTECTED] >> Subject: Re: HBase - Secondary Index >> >> Hi Anoop, >> >> Am I correct in understanding that this indexing mechanism is only >> applicable when you know the row key? It's not an inverted index truly >> based on the column value. >> >> Mohit >> On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote: >> >>> Hi Adrien >>> We are making the consistency btw the main table and >>> index table and the roll back mentioned below etc using the CP hooks. The >>> current hooks were not enough for those though.. I am in the process of >>> trying to contribute those new hooks, core changes etc now... Once all are >>> done I will be able to explain in details.. >>> >>> -Anoop- >>> ________________________________________ >>> From: Adrien Mogenet [[EMAIL PROTECTED]] >>> Sent: Monday, January 07, 2013 2:00 AM >>> To: [EMAIL PROTECTED] >>> Subject: Re: HBase - Secondary Index >>> >>> Nice topic, perhaps one of the most important for 2013 :-) >>> I still don't get how you're ensuring consistency between index table and >>> main table, without an external component (such as bookkeeper/zookeeper). >>> What's the exact write path in your situation when inserting data ? >>> (WAL/RegionObserver, pre/post put/WALedit...) >>> >>> The underlying question is about how you're ensuring that WALEdit in Index >>> and Main tables are perfectly sync'ed, and how you 're able to rollback in
-
Re: HBase - Secondary IndexMohit Anchlia 2013-01-09, 01:50
It makes sense to use inverted indexes when you have unique index columns.
But if you have columns that are evenly distributed then parallel search makes more sense. It just depends on cardinality of your indexed columns. On Tue, Jan 8, 2013 at 5:28 PM, anil gupta <[EMAIL PROTECTED]> wrote: > +1 on Lars comment. > > Either the client gets the rowkey from secondary table and then gets the > real data from Primary Table. ** OR ** Send the request to all the RS(or > region) hosting a region of primary table. > > Anoop is using the latter mechanism. Both the mechanism have their pros and > cons. IMO, there is no outright winner. > > ~Anil Gupta > > On Tue, Jan 8, 2013 at 4:30 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > > > Different use cases. > > > > > > For global point queries you want exactly what you said below. > > For range scans across many rows you want Anoop's design. As usually it > > depends. > > > > > > The tradeoff is bringing a lot of unnecessary data to the client vs > having > > to contact each region (or at least each region server). > > > > > > -- Lars > > > > > > > > ________________________________ > > From: Michael Segel <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Tuesday, January 8, 2013 6:33 AM > > Subject: Re: HBase - Secondary Index > > > > So if you're using an inverted table / index why on earth are you doing > it > > at the region level? > > > > I've tried to explain this to others over 6 months ago and its not really > > a good idea. > > > > You're over complicating this and you will end up creating performance > > bottlenecks when your secondary index is completely orthogonal to your > row > > key. > > > > To give you an example... > > > > Suppose you're CCCIS and you have a large database of auto insurance > > claims that you've acquired over the years from your Pathways product. > > > > Your primary key would be a combination of the Insurance Company's ID and > > their internal claim ID for the individual claim. > > Your row would be all of the data associated to that claim. > > > > So now lets say you want to find the average cost to repair a front end > > collision of an S80 Volvo. > > The make and model of the car would be orthogonal to the initial key. > This > > means that the result set containing insurance records for Front End > > collisions of S80 Volvos would be most likely evenly distributed across > the > > cluster's regions. > > > > If you used a series of inverted tables, you would be able to use a > series > > of get()s to get the result set from each index and then find their > > intersections. (Note that you could also put them in sort order so that > the > > intersections would be fairly straight forward to find. > > > > Doing this at the region level isn't so simple. > > > > So I have to again ask why go through and over complicate things? > > > > Just saying... > > > > On Jan 7, 2013, at 7:49 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > It is inverted index based on column(s) value(s) > > > It will be region wise indexing. Can work when some one knows the > rowkey > > range or NOT. > > > > > > -Anoop- > > > ________________________________________ > > > From: Mohit Anchlia [[EMAIL PROTECTED]] > > > Sent: Monday, January 07, 2013 9:47 AM > > > To: [EMAIL PROTECTED] > > > Subject: Re: HBase - Secondary Index > > > > > > Hi Anoop, > > > > > > Am I correct in understanding that this indexing mechanism is only > > > applicable when you know the row key? It's not an inverted index truly > > > based on the column value. > > > > > > Mohit > > > On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <[EMAIL PROTECTED]> > > wrote: > > > > > >> Hi Adrien > > >> We are making the consistency btw the main table and > > >> index table and the roll back mentioned below etc using the CP hooks. > > The > > >> current hooks were not enough for those though.. I am in the process > of > > >> trying to contribute those new hooks, core changes etc now... Once
-
RE: HBase - Secondary IndexAnoop Sam John 2013-01-09, 03:22
Totally agree with Lars. The design came up as per our usage and data distribution style etc.
Also the put performance we were not able to compromise. That is why the region collocation based region based indexing design came :) Also as we are having the indexing and index usage every thing happening at server side, there is no need for any change in the client part depending on what type of client u use. Java code or REST APIs or any thing. Also MR based parallel scans any thing can be comparably easy I feel as there is absolutely no changes needed at client side. :) As Anil said there will be pros and cons for every way and which one suits your usage, needs to be adopted. :) -Anoop- ________________________________________ From: anil gupta [[EMAIL PROTECTED]] Sent: Wednesday, January 09, 2013 6:58 AM To: [EMAIL PROTECTED]; lars hofhansl Subject: Re: HBase - Secondary Index +1 on Lars comment. Either the client gets the rowkey from secondary table and then gets the real data from Primary Table. ** OR ** Send the request to all the RS(or region) hosting a region of primary table. Anoop is using the latter mechanism. Both the mechanism have their pros and cons. IMO, there is no outright winner. ~Anil Gupta On Tue, Jan 8, 2013 at 4:30 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Different use cases. > > > For global point queries you want exactly what you said below. > For range scans across many rows you want Anoop's design. As usually it > depends. > > > The tradeoff is bringing a lot of unnecessary data to the client vs having > to contact each region (or at least each region server). > > > -- Lars > > > > ________________________________ > From: Michael Segel <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Tuesday, January 8, 2013 6:33 AM > Subject: Re: HBase - Secondary Index > > So if you're using an inverted table / index why on earth are you doing it > at the region level? > > I've tried to explain this to others over 6 months ago and its not really > a good idea. > > You're over complicating this and you will end up creating performance > bottlenecks when your secondary index is completely orthogonal to your row > key. > > To give you an example... > > Suppose you're CCCIS and you have a large database of auto insurance > claims that you've acquired over the years from your Pathways product. > > Your primary key would be a combination of the Insurance Company's ID and > their internal claim ID for the individual claim. > Your row would be all of the data associated to that claim. > > So now lets say you want to find the average cost to repair a front end > collision of an S80 Volvo. > The make and model of the car would be orthogonal to the initial key. This > means that the result set containing insurance records for Front End > collisions of S80 Volvos would be most likely evenly distributed across the > cluster's regions. > > If you used a series of inverted tables, you would be able to use a series > of get()s to get the result set from each index and then find their > intersections. (Note that you could also put them in sort order so that the > intersections would be fairly straight forward to find. > > Doing this at the region level isn't so simple. > > So I have to again ask why go through and over complicate things? > > Just saying... > > On Jan 7, 2013, at 7:49 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > > Hi, > > It is inverted index based on column(s) value(s) > > It will be region wise indexing. Can work when some one knows the rowkey > range or NOT. > > > > -Anoop- > > ________________________________________ > > From: Mohit Anchlia [[EMAIL PROTECTED]] > > Sent: Monday, January 07, 2013 9:47 AM > > To: [EMAIL PROTECTED] > > Subject: Re: HBase - Secondary Index > > > > Hi Anoop, > > > > Am I correct in understanding that this indexing mechanism is only > > applicable when you know the row key? It's not an inverted index truly > > based on the column value. Thanks & Regards, Anil Gupta
-
Re: HBase - Secondary Indexramkrishna vasudevan 2013-01-09, 04:11
As far as i can see its more related to using the coprocessor framework in
this soln that helps us in a great way to avoid unnecessary RPC calls when we go with Region level indexing. Regards Ram On Wed, Jan 9, 2013 at 8:52 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Totally agree with Lars. The design came up as per our usage and data > distribution style etc. > Also the put performance we were not able to compromise. That is why the > region collocation based region based indexing design came :) > Also as we are having the indexing and index usage every thing happening > at server side, there is no need for any change in the client part > depending on what type of client u use. Java code or REST APIs or any > thing. Also MR based parallel scans any thing can be comparably easy I > feel as there is absolutely no changes needed at client side. :) > > As Anil said there will be pros and cons for every way and which one suits > your usage, needs to be adopted. :) > > -Anoop- > ________________________________________ > From: anil gupta [[EMAIL PROTECTED]] > Sent: Wednesday, January 09, 2013 6:58 AM > To: [EMAIL PROTECTED]; lars hofhansl > Subject: Re: HBase - Secondary Index > > +1 on Lars comment. > > Either the client gets the rowkey from secondary table and then gets the > real data from Primary Table. ** OR ** Send the request to all the RS(or > region) hosting a region of primary table. > > Anoop is using the latter mechanism. Both the mechanism have their pros and > cons. IMO, there is no outright winner. > > ~Anil Gupta > > On Tue, Jan 8, 2013 at 4:30 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > > > Different use cases. > > > > > > For global point queries you want exactly what you said below. > > For range scans across many rows you want Anoop's design. As usually it > > depends. > > > > > > The tradeoff is bringing a lot of unnecessary data to the client vs > having > > to contact each region (or at least each region server). > > > > > > -- Lars > > > > > > > > ________________________________ > > From: Michael Segel <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Tuesday, January 8, 2013 6:33 AM > > Subject: Re: HBase - Secondary Index > > > > So if you're using an inverted table / index why on earth are you doing > it > > at the region level? > > > > I've tried to explain this to others over 6 months ago and its not really > > a good idea. > > > > You're over complicating this and you will end up creating performance > > bottlenecks when your secondary index is completely orthogonal to your > row > > key. > > > > To give you an example... > > > > Suppose you're CCCIS and you have a large database of auto insurance > > claims that you've acquired over the years from your Pathways product. > > > > Your primary key would be a combination of the Insurance Company's ID and > > their internal claim ID for the individual claim. > > Your row would be all of the data associated to that claim. > > > > So now lets say you want to find the average cost to repair a front end > > collision of an S80 Volvo. > > The make and model of the car would be orthogonal to the initial key. > This > > means that the result set containing insurance records for Front End > > collisions of S80 Volvos would be most likely evenly distributed across > the > > cluster's regions. > > > > If you used a series of inverted tables, you would be able to use a > series > > of get()s to get the result set from each index and then find their > > intersections. (Note that you could also put them in sort order so that > the > > intersections would be fairly straight forward to find. > > > > Doing this at the region level isn't so simple. > > > > So I have to again ask why go through and over complicate things? > > > > Just saying... > > > > On Jan 7, 2013, at 7:49 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > It is inverted index based on column(s) value(s) > > > It will be region wise indexing. Can work when some one knows the |