Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Managing index generation processes


Copy link to this message
-
Managing index generation processes
Hi everyone,

Are there any tools or libraries for managing HDFS files that are used
solely for the purpose of creating indexes in HBase?  In other words, is
there any way to seamlessly integrate new HDFS files into a periodic
MapReduce process that builds indexes and also reprocess those files if the
index building logic or underlying HDFS files change?

I'm looking for something similar to HCatalog but the limitation I find
with it is that there's no way to rebuild parts of an index with out
deleting the old index entries or having to guarantee that the new index
cells will completely overwrite the old ones.

Here's an example to better explain:

-  Assume I want to build an index in HBase on HDFS files A, B, and C.
-  Let's say I build that index with a MapReduce job and then realize that
one of the auxiliary lookup files used in that job was not completely
correct.
-  I'd like to rerun the indexing job at this point but it's entirely
possible that the new index won't involve all the same cells as the old
index.
-  Now, I can't delete all the old index entries before running the new job
since that index may still be in use so there's no obvious way to update
the index in isolation

The prevailing approach to solving this seems to be continually rebuilding
the indexes in full and having a way to atomically switch the old indexes
out with the new ones.  A better approach might be to do the same thing
with a higher granularity and what I'm really asking is whether or not
there is any tool that does exactly that.

A naive approach at "versioning" like this with higher granularity might
simply tie HDFS files to cells in HBase, give that association a version
number, and allow clients to only read cells from hbase associated with
active versions (as opposed to versions that are currently being inserted
into HBase).  Then the "active" version could be incremented at the end of
a successful MapReduce index build for all files used in that job.

If there are no existing tools for something like this, then doing what I
mentioned above is probably the route I'll take and I'm very curious to
hear if others are facing similar problems and whether or not a tool to
solve them would be more widely beneficial.

Thank you for your time and I apologize if this might be a better question
for the hbase users list.

- Eric
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB