I am playing with Hive indexing and a little discouraged by the gap between
the potential seen and the amount of documentation around indexing. I am
running Hive 0.9 and started playing with indexing as follows:
I have a table logs that has a bunch of fields but for this, lets say
three. sessionutc, srcip, dstip and partitioned by DAY.
CREATE TABLE logs(sessionutc STRING, srcip STRING, dstip STRING)
PARTITIONED by (day STRING)
The field I am hoping to index is srcip, so I created this:
CREATE INDEX idx_srcip ON TABLE logs(srcip) as
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED
ALTER INDEX idx_srcip ON pcaps REBUILD;
This alter index ran overnight.... Also be warned, the data in my table is
522GB. The buidling of index took all night and wrote a ton of data (49
GB) to the hive history file in the CLI... that just aint right if you ask
me. :) Can we limit that somehow?
The actual index table ended up being around 1.8 GB on 522GB of data. That
wasn't too bad I guess.
Then time for queries... I thought Hive0.9 just supported indexes on
queries, the old school DB guy just ran a query
SELECT sessionutc, srcip, dstip FROM logs WHERE srcip='127.0.0.1'
Started running 1028 Map Tasks... obviously not usiung an index, takes
So I google, and find this:
INSERT OVERWRITE DIRECTORY "/tmp/index_result" SELECT `_bucketname` ,
`_offsets` FROM web__logs_idx_srcip__ WHERE srcip = '127.0.0.1';
SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip='127.0.0.1'
43 Map task, done quickly. Indexes WORK! WOOOOOOO.
So Do Indexes work when you have a non-indexed field in your where clause.
Hypothesis: If you "and" your indexed field, it will work (like partition
SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip='127.0.0.1' and
sessionutc like '2012-04-04%'
43 Map Tasks: Quick Hits Index Hypothesis: True
Hypothesis If you "or" your indexed field it will not work (like partition
SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip='127.0.0.1' or
sessionutc like '2012-04-04%'
43 Map Tasks: Quick Hits Index Hypothesis: False
What the heck? HOW DOES THAT WORK? That's impossible. The data returned is
truly that data without the srcip = 127.0.0.1 It's from all partitions. I
am so confused there.
Ok: Other questions.
How are indexed updated? How does INSERT OVERWRITE or INSERT APPEND affect
currently built indexes? Are indexes rebuilt automatically or do you have
to have separate job to rebuild the indexes on partitions that you've added
data too? If index updating is automagic, does it slow INSERT times? Need
to do some testing here.
Also: Is there a better way to enable Index hitting automagically without
having to do the separate index query and the two SET commands prior to my
query I want to hit the index? I'd like to utilize the indices to help our
operations staff write better queries (read use less cluster resources) but
teaching that process may be difficult.
Any other good non-googlable sources of information on indexes, The reason
I posted my results and my questions here is the general lack of
information around this topic.