Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Does anyone have sample code for forcing a custom InputFormat to use a small split


Copy link to this message
-
Does anyone have sample code for forcing a custom InputFormat to use a small split
I have a problem where there is a single, relatively small (10-20 MB) input
file. (It happens it is a fasta file which will have meaning if you are a
biologist.)  I am already using a custom  InputFormat  and a custom reader
to force a custom parsing. The file may generate tens or hundreds of
millions of key value pairs and the mapper does a fair amount of work on
each record.
The standard implementation of
 * public List<InputSplit> getSplits(JobContext job) throws IOException {*

uses fs.getFileBlockLocations(file, 0, length); to determine the blocks and
for a file of this size will come up with a single InputSplit and a single
mapper.

I am looking for a good example of forcing the generation of multiple
InputSplits for a small file. In this case I am  happy if every Mapper
instance is required to read and parse the entire file    as long as I can
guarantee that every record is processed by only a single mapper.
While I think I see how I might modify*  getSplits(JobContext job)  *I am
not sure how and when the code is called when the job is running on the
cluster.

--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB