Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Indexes, again


Copy link to this message
-
Indexes, again
Peter Marron 2014-01-27, 14:30
Hi,

I am using Hadoop 1.0.4 and Hive 0.11.0.

I am trying to create my own indexes. Given the problems that I have had in the past I thought
it best to try and do things slowly. So I created my own class which derived from TableBasedIndexHandler
I copied all the methods from CompactIndexHandler but I added lots of System.out.printlns so that I
could check and see what was going on. So this is, effectively, an instrumented copy of CompactIndexHandler.

When I try to create an index using compact most things seem to be working:

> DROP INDEX champions_attendance ON champions;
OK
Time taken: 0.139 seconds
hive> CREATE INDEX champions_attendance ON TABLE champions(attendance) AS 'compact' WITH DEFERRED REBUILD;
OK
Time taken: 0.173 seconds
hive> SHOW INDEX ON champions;
OK
champions_attendance    champions               attendance              default__champions_champions_attendance__       compact
Time taken: 0.073 seconds, Fetched: 1 row(s)
hive> SHOW FORMATTED INDEX ON champions;
OK
idx_name                tab_name                col_names               idx_tab_name            idx_type                comment
champions_attendance    champions               attendance              default__champions_champions_attendance__       compact
Time taken: 0.067 seconds, Fetched: 4 row(s)
hive>

However when I try the same thing with my class things start promising:

Time taken: 0.149 seconds
hive> CREATE INDEX champions_attendance ON TABLE champions (attendance) AS 'com.trilliumsoftware.profiling.index.ProfilerIndex' WITH DEFERRED REBUILD;
My usesIndexTable - returning true!
My analyzeIndexDefinitionYYY
table ->Table(tableName:champions, dbName:default, owner:pmarron, createTime:1390214100, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:year, type:string, comment:null), FieldSchema(name:home, type:string, comment:null), FieldSchema(name:away, type:string, comment:null), FieldSchema(name:score, type:string, comment:null), FieldSchema(name:venue, type:string, comment:null), FieldSchema(name:attendance, type:string, comment:null)], location:hdfs://hpcluster1/user/pmarron/Ex/data, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1390214100}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)<-
index ->Index(indexName:champions_attendance, indexHandlerClass:com.trilliumsoftware.profiling.index.ProfilerIndex, dbName:default, origTableName:champions, createTime:1390832429, lastAccessTime:1390832429, indexTableName:default__champions_champions_attendance__, sd:StorageDescriptor(cols:[FieldSchema(name:attendance, type:string, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:null, sortCols:[Order(col:attendance, order:1)], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{}, deferredRebuild:true)<-
My usesIndexTable - returning true!
usesIndexTable ->true<-
indexTable ->Table(tableName:default__champions_champions_attendance__, dbName:default, owner:null, createTime:0, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:null, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{}, viewOriginalText:null, viewExpandedText:null, tableType:INDEX_TABLE)<-
storageDesc ->StorageDescriptor(cols:[FieldSchema(name:attendance, type:string, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:null, sortCols:[Order(col:attendance, order:1)], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false)<-
My usesIndexTable - returning true!
Going into the branch
My analyzeIndexDefinition OUT
My usesIndexTable - returning true!
OK
Time taken: 0.263 seconds
hive>
    >
But then things seem to go wrong.
Time taken: 0.149 seconds
    > SHOW INDEX ON champions;
FAILED: Error in metadata: java.lang.NullPointerException
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive>

I have instrumented all of the method calls, so the fact that I don't see any tracing suggests that there isn't
of my code on the path that makes this fail. So I am at a loss to know where to start.
Is there some other sort of registration of my index handler class that I have to make somewhere?

If I ignore this error and carry on then the command

                ALTER INDEX champions_attendance ON champions REBUILD;

seems to succeed _and_ build an index. However when I issue a query o