Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Problem with Custom InputFormat

Copy link to this message
Problem with Custom InputFormat

I seem to have a problem getting Hive to use a custom InputFormat.

I am using Hive version 0.10.0 with Hadoop 1.0.4 on Centos 6.3
currently in standalone mode. At this stage I am just experimenting.

I have a file with 10 records which I am using for testing.
I've created a table called zownvehead to access this file.
So if I do
select * from zownvehead;
I get the 10 records and if I do
                select count(1) from zownvehead;
then I get the result 10. No surprises.

Now I've created my own class

       package com.trilliumsoftware.loader.duality;
public class WrappedInputFormat implements InputFormat<LongWritable, Text>, JobConfigurable {

And I've written this class to restrict the number of records. Specifically, in the getSplits method instead of
returning the whole file I return two splits which effectively limit the data scanned to two records instead of 10.
(Inside my class I create an instance of TextInputFormat I delegate all the calls to this instance apart
from getSplits where I call the method on TextInputFormat and then I use the result to build two new FileSplits which I return instead.)
I delete the table and re-create it with the following

                CREATE EXTERNAL TABLE zownvehead (PID STRING,
                ... lots of other columns elided...
                AHM_STAT_CODE STRING)
                STORED AS
                INPUTFORMAT 'com.trilliumsoftware.loader.duality.WrappedInputFormat'
                OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

Now when I perform
                select * from zownvehead;
then, much to my delight, I see only the two records.
However when I perform
                select count(1) from zownvehead;
I get the result 10 and not 2, as I would expect.

So the results of the two queries are inconsistent.

When I investigate I can see that, in the second query, the class
CombineHiveInputFormat is being used. I can see that an
instance of my class WrappedInputFormat is being constructed
and configured. I can also see that when the query runs this
instance of my class is being used to obtained a record reader
(that is the

       public RecordReader<LongWritable,Text> getRecordReader(InputSplit split, JobConf jc, Reporter rprtr) throws IOException {

method is being invoked. However the getSplits method
is _not_ being invoked and the split being passed to the getRecordReader method is
a FileSplit (or derived class) for the whole file.

I've had a look at the source of CombineHiveInputFormat and it
seems to be looking for an InputFormat class to invoked getSplits
based on the path. But I can't see why it might get it wrong, or
what I can do to help it get it right. I suppose that I could build
my own version of Hive with instrumentation to see exactly
what's going on, but I'd like to avoid that if I can.

So can anyone tell me why the CombineHiveInputFormat wrapped class
is not calling my getSplits? And why this only seems to happen if a
Map/Reduce is required? And, most importantly, what do I have to
do to get it to work the way that I expect?

Any help or comments would be welcome.

Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699