Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Creating Indexes


Copy link to this message
-
Creating Indexes
Hi,

I am still having problems building my index.
In an attempt to find someone who can help me
I'll go through all the steps that I try.
1)      First I load my data into hive.

hive> LOAD DATA INPATH 'E3/score.csv' OVERWRITE INTO TABLE score;
Loading data to table default.score
Deleted hdfs://localhost/data/warehouse/score
OK
Time taken: 7.817 seconds
2)      Then I try to create the index

hive> CREATE INDEX bigIndex
    > ON TABLE score(Ath_Seq_Num)
    > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
FAILED: Error in metadata: java.lang.RuntimeException: Please specify deferred rebuild using " WITH DEFERRED REBUILD ".
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive>
3)      OK, so it suggests that I use "DEFERRED BUILD" and so I do
hive>
    >
    > CREATE INDEX bigIndex
    > ON TABLE score(Ath_Seq_Num)
    > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
    > WITH DEFERRED REBUILD;
OK
Time taken: 0.603 seconds
4)      Now, to create the index I assume that I use ALTER INDEX as follows:

hive>ALTER INDEX bigIndex ON score REBUILD;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 138
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201210311448_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201210311448_0001
Kill Command = /data/hadoop-1.0.3/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201210311448_0001
Hadoop job information for Stage-1: number of mappers: 511; number of reducers: 138
2012-10-31 15:59:27,076 Stage-1 map = 0%,  reduce = 0%
5)      This all looks promising, and after increasing my heapsize to get the Map/Reduce to complete, I get this an hour later

2012-10-31 17:08:23,572 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4135.47 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 8 minutes 55 seconds 470 msec
Ended Job = job_201210311448_0001
Loading data to table default.default__score_bigindex__
Deleted hdfs://localhost/data/warehouse/default__score_bigindex__
Invalid alter operation: Unable to alter index.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

So what have I done wrong, and what am I to do to get this index to build successfully?

Any help appreciated.

Peter Marron

From: Peter Marron [mailto:[EMAIL PROTECTED]]
Sent: 24 October 2012 13:27
To: [EMAIL PROTECTED]
Subject: RE: Indexes

Hi Shreepadma,

Thanks for this. Looks exactly like the information I need.
I was going to reply when I had tried it all out, but I'm having
problems creating the index at the moment (I'm getting an
OutOfMemoryError at the moment). So I thought that I had
better reply now to say thank you.

Peter Marron
From: Shreepadma Venugopalan [mailto:[EMAIL PROTECTED]]
Sent: 23 October 2012 19:49
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: Re: Indexes

Hi Peter,

Indexing support was added to Hive in 0.7 and in 0.8 the query compiler was enhanced to optimized some class of queries (certain group bys and joins) using indexes. Assuming you are using the built in index handler you need to do the following _after_ you have created and rebuilt the index,

SET hive.index.compact.file='/tmp/index_result';
SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

You will then notice speed up for a query of the form,

select count(*) from tab where indexed_col = some_val

Thanks,
Shreepadma

On Tue, Oct 23, 2012 at 5:44 AM, Peter Marron <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Hi,

I'm very much a Hive newbie but I've been looking at HIVE-417 and this page in particular:
http://cwiki.apache.org/confluence/display/Hive/IndexDev
Using this information I've been able to create an index (using Hive 0.8.1)
and when I look at the contents it all looks very promising indeed.
However on the same page there's this comment:

"...This document currently only covers index creation and maintenance. A follow-on will explain how indexes are used to optimize queries (building on FilterPushdownDev<https://cwiki.apache.org/confluence/display/Hive/FilterPushdownDev>)...."

However I can't find the "follow-on" which tells me how to exploit the index that I've
created to "optimize" subsequent queries.
Now I've been told that I can create and use indexes with the current
release of Hive _without_ writing and developing any Java code of my own.
Is this true? If so, how?

Any help appreciated.

Peter Marron.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB