Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Index building process design


+
Eric Czech 2012-07-12, 05:26
+
Eric Czech 2012-07-23, 20:06
Copy link to this message
-
Re: Index building process design
Michael Segel 2012-07-23, 23:13
Ok, I'll take a stab at the shorter one. :-)

You can create a base data table which contains your raw data.
Depending on your index... like an inverted table, you can run a map/reduce job that builds up a second table.  And a third, a fourth... depending on how many inverted indexes you want.

When you want to find a data set based on a known value in the index, you can scan the index table, and the result set will contain a list of keys for the data in the base table.

Now you can then just fetch those rows from HBase.
If you are using multiple indexes, you just take the intersection of the result set(s) and now you have the end data set to fetch.

Not sure why you would want a second cluster. Could you expand on your use case?

On Jul 23, 2012, at 3:06 PM, Eric Czech wrote:

> Hmm, maybe that was too long -- I'll keep this one shorter I swear:
>
> Would it make sense to build indexes with two Hadoop/Hbase clusters by
> simply pointing client traffic at the cluster that is currently NOT
> building indexes via M/R jobs?  Basically, has anyone ever tried switching
> back and forth between clusters instead of building indexes on one cluster
> and copying them to another?
>
>
> On Thu, Jul 12, 2012 at 1:26 AM, Eric Czech <[EMAIL PROTECTED]> wrote:
>
>> Hi everyone,
>>
>> I have a general design question (apologies in advanced if this has
>> been asked before).
>>
>> I'd like to build indexes off of a raw data store and I'm trying to
>> think of the best way to control processing so some part of my cluster
>> can still serve reads and writes without being affected heavily by the
>> index building process.
>>
>> I get the sense that the typical process for this involves something
>> like the following:
>>
>> 1.  Dedicate one cluster for index building (let's call it the INDEX
>> cluster) and one for serving application reads on the indexes as well
>> as writes/reads on the raw data set (let's call it the MAIN cluster).
>> 2.  Have the raw data set replicated from the MAIN cluster to the INDEX
>> cluster.
>> 3.  On the INDEX cluster, use the replicated raw data to constantly
>> rebuild indexes and copy the new versions to the MAIN cluster,
>> overwriting the old versions if necessary.
>>
>> While conceptually simple, I can't help but wonder if it doesn't make
>> more sense to simply switch application reads / writes from one
>> cluster to another based on which one is NOT currently building
>> indexes (but still have the raw data set replicate master-master
>> between them).
>>
>> To be more clear, I'm proposing doing this:
>>
>> 1.  Have two clusters, call them CLUSTER_1 and CLUSTER_2, and have the
>> raw data set replicated master-master between them.
>> 2.  if CLUSTER_1 is currently rebuilding indexes, redirect all
>> application traffic to CLUSTER_2 including reads from the indexes as
>> well as writes to the raw data set (and vise-versa).
>>
>> I know I'm not addressing a lot of details here but I'm just curious
>> if anyone has ever implemented something along these lines.
>>
>> The main advantage to what I'm proposing would be not having to copy
>> potentially massive indexes across the network but at the cost of
>> having to deal with having clients not always read from the same
>> cluster (seems doable though).
>>
>> Any advice would be much appreciated!
>>
>> Thanks
>>
+
Amandeep Khurana 2012-07-24, 01:15
+
Eric Czech 2012-07-25, 14:44