Search Hadoop and all its sub project:

Switch to Threaded View
Subject: Indexes

I am using Hadoop 1.0.4 and Hive 0.11.0. But I've tried and have the same problems
with Hive versions 10, 11, 12 and 13.

I am trying to create my own indexes. As I've mentioned before (24/1/14) I have created my own class derived from TableBasedIndexHandler
I copied all the methods from CompactIndexHandler but I added lots of System.out.printlns so that I
could check and see what was going on. So this is, effectively, an instrumented copy of CompactIndexHandler.
Of course it doesn't work. Now I have built Hive 13 from source and investigated and would like to discuss a few points.
1)      The reason that SHOW INDEX and SHOW INDEX FORMATTED fails is because on line 127 of file we find this code:
    IndexType indexType = HiveIndex.getIndexTypeByClassName(indexHandlerClass);

This code fails with an NPE because the HiveIndex class is an Enum that includes compact

and bitmap indexes only. This code can easily be fixed with something like:
    IndexType indexType = HiveIndex.getIndexTypeByClassName(indexHandlerClass);
    indexColumns.add((indexType == null) ? "" : indexType.getName());

2)      The next problem I run into is that the generateIndexQuery method of my index class

is not being invoked. It's not hard to track this down. It's because in IndexWhereTaskDispatcher

method createOperatorRules the code checks that the index class name is

in a list of supported indexes. It builds a list of supported indexes and puts compact and bitmap only in it.

In other words the code seems to be written quite explicitly so that it only supports bitmap and compact
indexes. It would seem that to add any more indexes you have to build your own custom version of Hive.
However I thought that this page
which has this text:
"This document explains the proposed design for adding index support to Hive (HIVE-417<>). Indexing is a standard database technique, but with many possible variations. Rather than trying to provide a "one-size-fits-all" index implementation, the approach we are taking is to define indexing in a pluggable manner (related to StorageHandlers<>) and provide one concrete indexing implementation as a reference, leaving it open for contributors to plug in other indexing schemes as time goes by."

Surely this implies that end-users can plug their own index implementations in. (Similarly chapter 8 of
the Programming Hive book gave me the same impression.) Is it just me? Have I got the
wrong end of the stick? is the Hive implementation of indexes supposed to be
non-extensible or is it fundamentally broken?

I also have another fundamental problem.
The reason that I'm doing all this in the first place is that I want to be able to use my
indexes but without running Map/Reduce. I know that I will have to modify Hive
quite a lot to do this because it currently assumes that indexes can only be used
when running map/reduce jobs. The current compact and bitmap index implementations
require a map/reduce job and so I will have to stop them from being used when there
is no map/reduce job. My inclination would be to extend the HiveIndexHandler interface
so that there's another method boolean requiresMapReduce() which defaults to true in
the AbstractIndexHandler base class. Would this be viewed as a sensible start?

I'm only just starting and so I'm not really in a position to submit patches yet
but I thought that it would be sensible to see if these sort of changes are going
to be acceptable.


Peter Marron
Senior Developer
Trillium Software, A Harte Hanks Company
Theale Court, 1st Floor, 11-13 High Street
+44 (0) 118 940 7609 office
+44 (0) 118 940 7699 fax
[]<><> / linkedin<> / twitter<> / facebook<>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB