Hope that this is to the correct list. Apologies if not.
I am using Hive 0.11.0 and Hadoop 1.0.4.
My goal is to get my Hive queries running without Map/Reduce
but using my custom indexes. To this end I have been building Hive version 13 from source
and working through the sources to see what I can do.
I can see that the non-M/R path through Hive splits off really early.
I can see that in SemanticAnalyzer.java if it determines that a FetchTask
is sufficient for the query then the genMapRedTasks method returns really
early and it never gets near the code that uses indexes.
I have also followed the code through the index code and I can see that in
IndexWhereProcessor.java an index can insert a "index query" task
to run before the main query. (By also calling the
queryContext setIndexInputFormat and setIndexIntermediateFile
methods it can redirect the main query to pick up the data generated by the index.)
So I can see two approaches to achieve my goal.
1) I can modify the FetchTask path to support the use of indexes.
2) I can allow the query to start down the Map/Reduce path and then
I can arrange for my index code to trash the original query completely and
replace it with a query that will run as a FetchTask that will do what I want.
Of course there are pros and cons to both of these approaches.
1) This approach has the advantage that I don't need to change the
current index path at all and so there's much less likely that I will
damage it. However I will probably end up replicating some of the
existing index code, which is not desirable. Also I am not sufficiently
au fait with the Hive code to feel confident that I would make such
a major change in the way that a real Hive developer might.
2) This approach has the advantage that I am building on top of the
existing index infrastructure and so I probably will end up writing
less code. However it means that my queries will run once as Map
Reduce and again as FetchTasks which will make them slower than
I would like. The approach is also more complicated than I would like.
And I don't really know how cleanly I can "abort" the initial query and
replace it with a FetchTask. (if, indeed, this is possible.)
Obviously at some point I would like for my changes to get submitted
back into the main Hive source and so I want maximize the chances that
they will be viewed positively.
Does anyone have any opinions or advice to offer?
Trillium Software, A Harte Hanks Company
Theale Court, 1st Floor, 11-13 High Street
+44 (0) 118 940 7609 office
+44 (0) 118 940 7699 fax