You have an interesting question. I haven't come across anyone trying this approach yet. Replication is a relatively new feature so that might be one reason. But I think you are over engineering the solution here. I'll try to break it down to the best of my abilities.
1. First question to answer is - where are the indices going to be stored? Are they going to be other HBase tables (something like secondary indices)? Or are they going to be in an independent indexing system like Lucene/Solr/Elastic Search/Blurr? The number of fields you want to index on will determine that answer IMO.
2. Does the index building happen synchronously as the writes come in or is it an asynchronous process? The answer to this question will depend on the expected throughput and the cluster size. Also, the answer to this question will determine your indexing process.
a) If you are going to build indices synchronously, you have two options. First being - get your client to write to HBase as well as create the index. Second option - use coprocessors. Coprocs are also a new feature and I'm not certain yet that they'll solve your problem. Try them out though, they are pretty neat.
b) If you are going to build indices asynchronously, you'll likely be using MR for it. MR can run on the same cluster but if you pound your cluster with full table scans, your latencies for real time serving will get affected. Now, if you have relatively low throughput for data ingests, you might be able to get away with running fewer tasks in your MR job. So, data serving plus index creation can happen on the same cluster. If your throughput is high and you need MR jobs to run full force, separate the clusters out. Have a cluster to run the MR jobs and a separate cluster to serve the data. My approach in the second case would be to dump new data into HDFS directly as flat files, run MR over them to create the index and also put them into HBase for serving.
Hope that gives you some more ideas to think about.
On Wednesday, July 11, 2012 at 10:26 PM, Eric Czech wrote:
> Hi everyone,
> I have a general design question (apologies in advanced if this has
> been asked before).
> I'd like to build indexes off of a raw data store and I'm trying to
> think of the best way to control processing so some part of my cluster
> can still serve reads and writes without being affected heavily by the
> index building process.
> I get the sense that the typical process for this involves something
> like the following:
> 1. Dedicate one cluster for index building (let's call it the INDEX
> cluster) and one for serving application reads on the indexes as well
> as writes/reads on the raw data set (let's call it the MAIN cluster).
> 2. Have the raw data set replicated from the MAIN cluster to the INDEX cluster.
> 3. On the INDEX cluster, use the replicated raw data to constantly
> rebuild indexes and copy the new versions to the MAIN cluster,
> overwriting the old versions if necessary.
> While conceptually simple, I can't help but wonder if it doesn't make
> more sense to simply switch application reads / writes from one
> cluster to another based on which one is NOT currently building
> indexes (but still have the raw data set replicate master-master
> between them).
> To be more clear, I'm proposing doing this:
> 1. Have two clusters, call them CLUSTER_1 and CLUSTER_2, and have the
> raw data set replicated master-master between them.
> 2. if CLUSTER_1 is currently rebuilding indexes, redirect all
> application traffic to CLUSTER_2 including reads from the indexes as
> well as writes to the raw data set (and vise-versa).
> I know I'm not addressing a lot of details here but I'm just curious
> if anyone has ever implemented something along these lines.
> The main advantage to what I'm proposing would be not having to copy
> potentially massive indexes across the network but at the cost of
> having to deal with having clients not always read from the same
> cluster (seems doable though).