Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Regarding Indexing columns in HBASE


+
Ramasubramanian Narayanan... 2013-06-04, 16:43
+
Shahab Yunus 2013-06-04, 16:51
+
Michel Segel 2013-06-04, 17:13
+
Ramasubramanian Narayanan... 2013-06-04, 17:22
+
Ian Varley 2013-06-04, 17:30
Copy link to this message
-
Re: Regarding Indexing columns in HBASE
Ok...

A little bit more detail...

First, its possible to store your data in multiple tables each with a different key.
Not a good idea for some very obvious reasons....

You could however create a secondary table which is an inverted table where the rowkey of the index is the value in the base table and the column name is the rowkey in the base table and the value is the base table.

This will work well, as long as you're not indexing a column that has a small finite set of values like a binary index. (Male/Female as an example...)
(It will create a very wide row...)

But in a general case it should work ok.  Note too that you can also still create a compound key for the index.

As an example... you could create an index on manufacture, model, year, color  where the value is the VIN which would be the rowkey for the base table.

Then if you want to find all of the 2005 Volvo S80's on the road, you can do a partial scan of the index setting up start and stop rows.
Then filter the result set based on the state listed on the vehicle's registration.

The idea is that you would fetch the rows from the index query's result set and that would be your list that you would use for your next query.

Again, there is more to this... like if you have multiple indexes on the data, you'd take the intersection of the result set(s) and then apply the filters that are not indexed.  

The initial key lookups should normally be a simple fetch of a single row, yielding you a list of rows in the base table.

PLEASE NOTE THE FOLLOWING:

1) This is a general use case example.
2) YMMV based on the use case
3) YMMV based on the data contained in your underlying table
4) This is one simple way that can work with or without coprocessors
5) There is more to the solution, I'm painting a very high level solution.

And of course I'm waiting for someone to mention that you look at Phoenix which can implement this or a variation on this to do indexing.

And of course you have other indexing options.

HTH...

-Mike

On Jun 4, 2013, at 12:30 PM, Ian Varley <[EMAIL PROTECTED]> wrote:

> Rams - you might enjoy this blog post from HBase committer Jesse Yates (from last summer):
>
> http://jyates.github.io/2012/07/09/consistent-enough-secondary-indexes.html
>
> Secondary Indexing doesn't exist in HBase core today, but there are various proposals and early implementations of it in flight.
>
> In the mean time, as Mike and others have said, if you don't need them to be immediately consistent in a real-time write scenario, you can simply write the same data into multiple tables in different sort orders. (This is hard in a real-time write scenario because, without cross-table transactions, you'd have to handle all the cases where the record was written but the index wasn't, or vice versa.)
>
> Ian
>
> On Jun 4, 2013, at 12:22 PM, Ramasubramanian Narayanan wrote:
>
> Hi Michel,
>
> If you don't mind can you please help explain in detail ...
>
> Also can you pls let me know whether we have secondary index in HBASE?
>
> regards,
> Rams
>
>
> On Tue, Jun 4, 2013 at 1:13 PM, Michel Segel <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>wrote:
>
> Quick and dirty...
>
> Create an inverted table for each index....
> Then you can take the intersection of the result set(s) to get your list
> of rows for further filtering.
>
> There is obviously more to this, but its the core idea...
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Jun 4, 2013, at 11:51 AM, Shahab Yunus <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
>
> Just a quick thought, why don't you create different tables and duplicate
> data i.e. go for demoralization and data redundancy. Is your all read
> access patterns that would require 70 columns are incorporated into one
> application/client? Or it will be bunch of different
> clients/applications?
> If that is not the case then I think why not take advantage of more
> storage.
+
Ramasubramanian Narayanan... 2013-06-04, 17:07