Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Does anyone have sample code for forcing a custom InputFormat to use a small split


Copy link to this message
-
Re: Does anyone have sample code for forcing a custom InputFormat to use a small split
Thanks - what NLineInputFormat is pretty close to what I want.
In most cases the file is text and quite splittable
 although it raises another issue - sometimes the file is compressed - even
though it may
only be tens of megs compression is useful to speed transport
In the case of a small file with enough work in the mapper it may be useful
to split even a zipped file -
even if it means reading from the beginning to reach a specific index in the
unzipped stream -
ever seen that done??
On Mon, Sep 12, 2011 at 1:36 AM, Harsh J <[EMAIL PROTECTED]> wrote:

> Hello Steve,
>
> On Mon, Sep 12, 2011 at 7:57 AM, Steve Lewis <[EMAIL PROTECTED]>
> wrote:
> > I have a problem where there is a single, relatively small (10-20 MB)
> input
> > file. (It happens it is a fasta file which will have meaning if you are a
> > biologist.)  I am already using a custom  InputFormat  and a custom
> reader
> > to force a custom parsing. The file may generate tens or hundreds of
> > millions of key value pairs and the mapper does a fair amount of work on
> > each record.
> > The standard implementation of
> >   public List<InputSplit> getSplits(JobContext job) throws IOException {
> >
> > uses fs.getFileBlockLocations(file, 0, length); to determine the blocks
> and
> > for a file of this size will come up with a single InputSplit and a
> single
> > mapper.
> > I am looking for a good example of forcing the generation of multiple
> > InputSplits for a small file. In this case I am  happy if every Mapper
> > instance is required to read and parse the entire file    as long as I
> can
> > guarantee that every record is processed by only a single mapper.
>
> Is the file splittable?
>
> You may look at the FileInputFormat's "mapred.min.split.size"
> property. See
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html#setMinInputSplitSize(org.apache.hadoop.mapreduce.Job
> ,
> long)
>
> Perhaps the 'NLineInputFormat' may also be what you're really looking
> for, which lets you limit no. of records per mapper instead of
> fiddling around with byte sizes with the above.
>
> > While I think I see how I might modify  getSplits(JobContext job)  I am
> not
> > sure how and when the code is called when the job is running on the
> cluster.
>
> The method is called in the client-end, at the job-submission point.
>
> --
> Harsh J
>

--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB