Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Problem with Custom InputFormat


Copy link to this message
-
Problem with Custom InputFormat
Hi,

I seem to have a problem getting Hive to use a custom InputFormat.

I am using Hive version 0.10.0 with Hadoop 1.0.4 on Centos 6.3
currently in standalone mode. At this stage I am just experimenting.

I have a file with 10 records which I am using for testing.
I've created a table called zownvehead to access this file.
So if I do
select * from zownvehead;
I get the 10 records and if I do
                select count(1) from zownvehead;
then I get the result 10. No surprises.

Now I've created my own class

       package com.trilliumsoftware.loader.duality;
public class WrappedInputFormat implements InputFormat<LongWritable, Text>, JobConfigurable {

And I've written this class to restrict the number of records. Specifically, in the getSplits method instead of
returning the whole file I return two splits which effectively limit the data scanned to two records instead of 10.
(Inside my class I create an instance of TextInputFormat I delegate all the calls to this instance apart
from getSplits where I call the method on TextInputFormat and then I use the result to build two new FileSplits which I return instead.)
I delete the table and re-create it with the following

                CREATE EXTERNAL TABLE zownvehead (PID STRING,
                ... lots of other columns elided...
                AHM_STAT_CODE STRING)
                ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
                STORED AS
                INPUTFORMAT 'com.trilliumsoftware.loader.duality.WrappedInputFormat'
                OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

Now when I perform
                select * from zownvehead;
then, much to my delight, I see only the two records.
However when I perform
                select count(1) from zownvehead;
I get the result 10 and not 2, as I would expect.

So the results of the two queries are inconsistent.

When I investigate I can see that, in the second query, the class
CombineHiveInputFormat is being used. I can see that an
instance of my class WrappedInputFormat is being constructed
and configured. I can also see that when the query runs this
instance of my class is being used to obtained a record reader
(that is the

       public RecordReader<LongWritable,Text> getRecordReader(InputSplit split, JobConf jc, Reporter rprtr) throws IOException {

method is being invoked. However the getSplits method
is _not_ being invoked and the split being passed to the getRecordReader method is
a FileSplit (or derived class) for the whole file.

I've had a look at the source of CombineHiveInputFormat and it
seems to be looking for an InputFormat class to invoked getSplits
based on the path. But I can't see why it might get it wrong, or
what I can do to help it get it right. I suppose that I could build
my own version of Hive with instrumentation to see exactly
what's going on, but I'd like to avoid that if I can.

So can anyone tell me why the CombineHiveInputFormat wrapped class
is not calling my getSplits? And why this only seems to happen if a
Map/Reduce is required? And, most importantly, what do I have to
do to get it to work the way that I expect?

Any help or comments would be welcome.

Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB