-Re: Indexing w/ HBase
Michael Segel 2012-10-12, 13:00
1) What sort of indexes do you want to build?
2) Why would you want to store your indexes outside of HBase?
(Ok they are not so silly. But I don't want people to think that I'm against the idea, just that its more of an issue of design.)
On Oct 12, 2012, at 7:03 AM, Eric Czech <[EMAIL PROTECTED]> wrote:
> Hi everyone,
> Are there any tools or libraries for managing HDFS files that are used
> solely for the purpose of creating indexes in HBase? In other words, is
> there any way to seamlessly integrate new HDFS files into a periodic
> MapReduce process that builds indexes and also reprocess those files if the
> index building logic or underlying HDFS files change?
> I'm looking for something similar to HCatalog but the limitation I find
> with it is that there's no way to rebuild parts of an index with out
> deleting the old index entries or having to guarantee that the new index
> cells will completely overwrite the old ones.
> Here's an example to better explain:
> - Assume I want to build an index in HBase on HDFS files A, B, and C.
> - Let's say I build that index with a MapReduce job and then realize that
> one of the auxiliary lookup files used in that job was not completely
> - I'd like to rerun the indexing job at this point but it's entirely
> possible that the new index won't involve all the same cells as the old
> - Now, I can't delete all the old index entries before running the new job
> since that index may still be in use so there's no obvious way to update
> the index in isolation
> The prevailing approach to solving this seems to be continually rebuilding
> the indexes in full and having a way to atomically switch the old indexes
> out with the new ones. A better approach might be to do the same thing
> with a higher granularity and what I'm really asking is whether or not
> there is any tool that does exactly that.
> A naive approach at "versioning" like this with higher granularity might
> simply tie HDFS files to cells in HBase, give that association a version
> number, and allow clients to only read cells from hbase associated with
> active versions (as opposed to versions that are currently being inserted
> into HBase). Then the "active" version could be incremented at the end of
> a successful MapReduce index build for all files used in that job.
> If there are no existing tools for something like this, then doing what I
> mentioned above is probably the route I'll take and I'm very curious to
> hear if others are facing similar problems and whether or not a tool to
> solve them would be more widely beneficial.
> Thank you!