Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Indexing in hive


Copy link to this message
-
Re: Indexing in hive
Hi Ranjith,
Here are the steps for using an index in Hive 0.7.1.

1) Create the index
CREATE INDEX x ON TABLE t(j)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;

2) Build the index (since you specified the 'DEFERRED REBUILD' flag in the create index statement

ALTER INDEX x ON t REBUILD;

3) Use the index
If your original query is: select a, count(*) from t where j='and' group by a;
and you wanted to use the index on column j, do something like:

INSERT OVERWRITE DIRECTORY '/tmp/indexes/x' SELECT `_bucketname`, `_offsets` FROM default__t_x__ where j='and';
(The name default__t_x__ can be found in the output of step 2. Also, /tmp/indexes directory needs to exist in HDFS. You can substitute this to be any pre-existing directory in HDFS)
SET hive.index.compact.file=/tmp/indexes/x;
SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
SELECT a, count(*) from t where j='and' group by a;

Since the semantics of this usage make you specify the compact file, I have not been able to figure out a way to use multiple indexes in the same query. In this case we are using the index on j, the column in the where clause.

I hope you now understand why indexing in Hive is a work in progress:-)

Good luck!
Mark
Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com

----- Original Message -----
From: "Ranjith Raghunath" <[EMAIL PROTECTED]>
To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
Sent: Wednesday, May 16, 2012 9:46:23 PM
Subject: Re: Indexing in hive

Thanks Mark, Carl, and Ransom. I really appreciate the answers here. I am using Hive 0.7.1 and currently trying to create an index to help with performance associated to a particular where clause. I have not set any properties as mentioned below. I can try the options you listed below. Based on the feedback, I do have a few questions:

1. It seems like if you optimize for a groupby you cannot optimize for the where clause. I am reading this correctly?

2. How do you build the index?

3. Does the build process create a dataset that contains the index keys along with an offset value associated with the row?

4. And I am guessing that you need to use the columns in the same order as it is defined in the index?
Thanks,
Ranjith      

----- Original Message -----
From: Mark Grover [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, May 16, 2012 07:52 PM
To: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
Cc: Zhaojun (Terry) <[EMAIL PROTECTED]>
Subject: Re: Indexing in hive

Ransom,
>From this JIRA (https://issues.apache.org/jira/browse/HIVE-1644), it looks like automatic use of indexes using hive.optimize.index.filter was introduced in Hive 0.8. However, Ranjith seems to be using Hive 0.7.1 which doesn't support those properties.

Ranjith, you need to set the appropriate priorities before calling your query to make use of indexes. Are you setting any properties? If so, what?
I will try to dig up what those properties are, in the meanwhile.

Mark

----- Original Message -----
From: "Hezhiqiang (Ransom)" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: "Zhaojun (Terry)" <[EMAIL PROTECTED]>
Sent: Wednesday, May 16, 2012 8:32:55 PM
Subject: RE: Indexing in hive
“ hive.optimize.index.filter ” is the conf automatically use indexes

If u set hive.optimize.index.groupby = true.

It will set hive.optimize.index.filter =false.

See your configurations.

And you need to build index after create index.
Best regards

Ransom.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB