Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # dev - Indexes without Map/Reduce


Copy link to this message
-
Indexes without Map/Reduce
Peter Marron 2014-03-18, 11:38
Hi,

Hope that this is to the correct list. Apologies if not.

I am using Hive 0.11.0 and Hadoop 1.0.4.

My goal is to get my Hive queries running without Map/Reduce
but using my custom indexes. To this end I have been building Hive version 13 from source
and working through the sources to see what I can do.

I can see that the non-M/R path through Hive splits off really early.
I can see that in SemanticAnalyzer.java if it determines that a FetchTask
is sufficient for the query then the genMapRedTasks method returns really
early and it never gets near the code that uses indexes.

I have also followed the code through the index code and I can see that in
IndexWhereProcessor.java an index can insert a "index query" task
to run before the main query. (By also calling the
queryContext  setIndexInputFormat and setIndexIntermediateFile
methods it can redirect the main query to pick up the data generated by the index.)

So I can see two approaches to achieve my goal.
1)      I can modify the FetchTask path to support the use of indexes.
2)      I can allow the query to start down the Map/Reduce path and then

I can arrange for my index code to trash the original query completely and

replace it  with a query that will run as a FetchTask that will do what I want.

Of course there are pros and cons to both of these approaches.
1)      This approach has the advantage that I don't need to change the

current index path at all and so there's much less likely that I will

damage it. However I will probably end up replicating some of the

existing index code, which is not desirable. Also I am not sufficiently

au fait with the Hive code to feel confident that I would make such

a major change in the way that a real Hive developer might.

2)      This approach has the advantage that I am building on top of the

existing index infrastructure and so I probably will end up writing

less code. However it means that my queries will run once as Map

Reduce and again as FetchTasks which will make them slower than

I would like. The approach is also more complicated than I would like.

And I don't really know how cleanly I can "abort" the initial query and

replace it with a FetchTask. (if, indeed, this is possible.)

Obviously at some point I would like for my changes to get submitted
back into the main Hive source and so I want maximize the chances that
they will be viewed positively.

Does anyone have any opinions or advice to offer?

Regards,

Peter Marron
Senior Developer
Trillium Software, A Harte Hanks Company
Theale Court, 1st Floor, 11-13 High Street
Theale
RG7 5AH
+44 (0) 118 940 7609 office
+44 (0) 118 940 7699 fax
[https://4b2685446389bc779b46-5f66fbb59518cc4fcae8900db28267f5.ssl.cf2.rackcdn.com/trillium.png]<http://www.trilliumsoftware.com/>
trilliumsoftware.com<http://www.trilliumsoftware.com/> / linkedin<http://www.linkedin.com/company/17710> / twitter<https://twitter.com/trilliumsw> / facebook<http://www.facebook.com/HarteHanks>