Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Custom InputFormat errer


Copy link to this message
-
Re: Custom InputFormat errer
Hi Harsh

That means I have to lose my input data because of Hadoop's FileSplit
evenly splits input file according to the "numSplits". But, I want to
prevent this. Is there any way?

Regards!

Chen

On Wed, Aug 29, 2012 at 9:49 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> No, what I mean is that your RecordReader should be able to handle a
> case where it may start from middle of a record and hence not be able
> to read any record (i.e. return false or whatever right up front).
>
> On Wed, Aug 29, 2012 at 1:27 PM, Chen He <[EMAIL PROTECTED]> wrote:
> > Hi Harsh
> >
> > Thank you for your reply. Do you mean I need to change the FileSplit to
> > avoid those errors I mentioned happen?
> >
> > Regards!
> >
> > Chen
> >
> > On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> >>
> >> Hi Chen,
> >>
> >> Does your record reader and mapper handle the case where one map split
> >> may not exactly get the whole record? Your case is not very different
> >> from the newlines logic presented here:
> >> http://wiki.apache.org/hadoop/HadoopMapReduce
> >>
> >> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <[EMAIL PROTECTED]> wrote:
> >> > Hi guys
> >> >
> >> > I met a interesting problem when I implement my own custom InputFormat
> >> > which
> >> > extends the FileInputFormat.(I rewrite the RecordReader class but not
> >> > the
> >> > InputSplit class)
> >> >
> >> > My recordreader will take following format as a basic record: (my
> >> > recordreader extends the LineRecordReader. It returns a record if it
> >> > meets
> >> > #Trailer# and contains #Header#. I only have one input file that is
> >> > composed
> >> > of many of following basic record)
> >> >
> >> > #Header#
> >> > .....(many lines, may be 0 lines or 1000 lines, it varies)
> >> > #Trailer#
> >> >
> >> > Everything works fine if above basic input unit in a file is integer
> >> > times
> >> > of mapper. For example, I use 2 mappers and there are two basic
> records
> >> > in
> >> > my input file. Or I use 3 mappers and there are 6 basic units in the
> >> > input
> >> > file.
> >> >
> >> > However, if I use 4 mappers but there are 3 basic units in the input
> >> > file(not integer times). The final output is incorrect. The "Map Input
> >> > Bytes" in the job counter is also less than the input file size. How
> can
> >> > I
> >> > fix it? Do I need to rewrite the inputSplit?
> >> >
> >> > Any reply will be appreciated!
> >> >
> >> > Regards!
> >> >
> >> > Chen
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>