Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Does anyone have sample code for forcing a custom InputFormat to use a small split


Copy link to this message
-
Does anyone have sample code for forcing a custom InputFormat to use a small split
Steve Lewis 2011-09-12, 02:27
I have a problem where there is a single, relatively small (10-20 MB) input
file. (It happens it is a fasta file which will have meaning if you are a
biologist.)  I am already using a custom  InputFormat  and a custom reader
to force a custom parsing. The file may generate tens or hundreds of
millions of key value pairs and the mapper does a fair amount of work on
each record.
The standard implementation of
 * public List<InputSplit> getSplits(JobContext job) throws IOException {*

uses fs.getFileBlockLocations(file, 0, length); to determine the blocks and
for a file of this size will come up with a single InputSplit and a single
mapper.

I am looking for a good example of forcing the generation of multiple
InputSplits for a small file. In this case I am  happy if every Mapper
instance is required to read and parse the entire file    as long as I can
guarantee that every record is processed by only a single mapper.
While I think I see how I might modify*  getSplits(JobContext job)  *I am
not sure how and when the code is called when the job is running on the
cluster.

--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com