Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - help on CombineFileInputFormat


Copy link to this message
-
Re: help on CombineFileInputFormat
Aaron Kimball 2010-05-10, 09:12
Zhenyu,

It's a bit complicated and involves some layers of
indirection. CombineFileRecordReader is a sort of shell RecordReader that
passes the actual work of reading records to another child record reader.
That's the class name provided in the third parameter. Instructing it to use
CombineFileRecordReader again as its child RR doesn't tell it to do anything
useful. You must give it the name of another RecordReader class that
actually understands how to parse your particular records.

Unfortunately, TextInputFormat's LineRecordReader and
SequenceFileInputFormat's SequenceFileRecordReader both require the
InputSplit to be a FileSplit. So you can't use them directly.
(CombineFileInputFormat will pass a CombineFileSplit to the
CombineFileRecordReader which is then passed along to the child RR that you
specify.)

In Sqoop I got around this by creating (another!) indirection class called
CombineShimRecordReader.

The export functionality of Sqoop uses CombineFileInputFormat to allow the
user to specify the number of map tasks; it then organizes a set of input
files into that many tasks. This instantiates a CombineFileRecordReader
configured to forward its InputSplit to CombineShimRecordReader.
CombineShimRecordReader then translates the CombineFileSplit into a regular
FileSplit and forward thats to LineRecordReader (for text) or
SequenceFileRecordReader (for SequenceFiles). The grandchild (LineRR or
SequenceFileRR) is determined on a file-by-file basis by
CombineShimRecordReader, by calling a static method of Sqoop's
ExportJobBase.

You can take a look at the source of theseclasses here:
*
http://github.com/cloudera/sqoop/blob/master/src/shims/common/org/apache/hadoop/sqoop/mapreduce/ExportInputFormat.java

*
http://github.com/cloudera/sqoop/blob/master/src/shims/common/org/apache/hadoop/sqoop/mapreduce/CombineShimRecordReader.java
*
http://github.com/cloudera/sqoop/blob/master/src/java/org/apache/hadoop/sqoop/mapreduce/ExportJobBase.java

(apologies for the lengthy URLs; you could also just download the whole
project's source at http://github.com/cloudera/sqoop) :)

Cheers,
- Aaron
On Thu, May 6, 2010 at 7:32 AM, Zhenyu Zhong <[EMAIL PROTECTED]>wrote:

> Hi,
>
> I tried to use CombineFileInputFormat in 0.20.2. It seems I need to extend
> it because it is an abstract class.
> However, I need to implement getRecordReader method in the extended class.
>
> May I ask how to implement this getRecordReader method?
>
> I tried to do something like this:
>
> public RecordReader getRecordReader(InputSplit genericSplit, JobConf job,
>
> Reporter reporter) throws IOException {
>
> // TODO Auto-generated method stub
>
> reporter.setStatus(genericSplit.toString());
>
> return new CombineFileRecordReader(job, (CombineFileSplit) genericSplit,
> reporter, CombineFileRecordReader.class);
>
> }
>
> It doesn't seem to be working. I would be very appreciated if someone can
> shed a light on this.
>
> thanks
> zhenyu
>