Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Indexes, again


Copy link to this message
-
Indexes, again
Hi,

I am using Hadoop 1.0.4 and Hive 0.11.0.

I am trying to create my own indexes. Given the problems that I have had in the past I thought
it best to try and do things slowly. So I created my own class which derived from TableBasedIndexHandler
I copied all the methods from CompactIndexHandler but I added lots of System.out.printlns so that I
could check and see what was going on. So this is, effectively, an instrumented copy of CompactIndexHandler.

When I try to create an index using compact most things seem to be working:

> DROP INDEX champions_attendance ON champions;
OK
Time taken: 0.139 seconds
hive> CREATE INDEX champions_attendance ON TABLE champions(attendance) AS 'compact' WITH DEFERRED REBUILD;
OK
Time taken: 0.173 seconds
hive> SHOW INDEX ON champions;
OK
champions_attendance    champions               attendance              default__champions_champions_attendance__       compact
Time taken: 0.073 seconds, Fetched: 1 row(s)
hive> SHOW FORMATTED INDEX ON champions;
OK
idx_name                tab_name                col_names               idx_tab_name            idx_type                comment
champions_attendance    champions               attendance              default__champions_champions_attendance__       compact
Time taken: 0.067 seconds, Fetched: 4 row(s)
hive>

However when I try the same thing with my class things start promising:

Time taken: 0.149 seconds
hive> CREATE INDEX champions_attendance ON TABLE champions (attendance) AS 'com.trilliumsoftware.profiling.index.ProfilerIndex' WITH DEFERRED REBUILD;
My usesIndexTable - returning true!
My analyzeIndexDefinitionYYY
table ->Table(tableName:champions, dbName:default, owner:pmarron, createTime:1390214100, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:year, type:string, comment:null), FieldSchema(name:home, type:string, comment:null), FieldSchema(name:away, type:string, comment:null), FieldSchema(name:score, type:string, comment:null), FieldSchema(name:venue, type:string, comment:null), FieldSchema(name:attendance, type:string, comment:null)], location:hdfs://hpcluster1/user/pmarron/Ex/data, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1390214100}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)<-
index ->Index(indexName:champions_attendance, indexHandlerClass:com.trilliumsoftware.profiling.index.ProfilerIndex, dbName:default, origTableName:champions, createTime:1390832429, lastAccessTime:1390832429, indexTableName:default__champions_champions_attendance__, sd:StorageDescriptor(cols:[FieldSchema(name:attendance, type:string, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:null, sortCols:[Order(col:attendance, order:1)], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{}, deferredRebuild:true)<-
My usesIndexTable - returning true!
usesIndexTable ->true<-
indexTable ->Table(tableName:default__champions_champions_attendance__, dbName:default, owner:null, createTime:0, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:null, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{}, viewOriginalText:null, viewExpandedText:null, tableType:INDEX_TABLE)<-
storageDesc ->StorageDescriptor(cols:[FieldSchema(name:attendance, type:string, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:null, sortCols:[Order(col:attendance, order:1)], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false)<-
My usesIndexTable - returning true!
Going into the branch
My analyzeIndexDefinition OUT
My usesIndexTable - returning true!
OK
Time taken: 0.263 seconds
hive>
    >
But then things seem to go wrong.
Time taken: 0.149 seconds
    > SHOW INDEX ON champions;
FAILED: Error in metadata: java.lang.NullPointerException
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive>

I have instrumented all of the method calls, so the fact that I don't see any tracing suggests that there isn't
of my code on the path that makes this fail. So I am at a loss to know where to start.
Is there some other sort of registration of my index handler class that I have to make somewhere?

If I ignore this error and carry on then the command

                ALTER INDEX champions_attendance ON champions REBUILD;

seems to succeed _and_ build an index. However when I issue a query o
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB