Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Custom InputFormat errer


Copy link to this message
-
Re: Custom InputFormat errer
Hi Harsh

That means I have to lose my input data because of Hadoop's FileSplit
evenly splits input file according to the "numSplits". But, I want to
prevent this. Is there any way?

Regards!

Chen

On Wed, Aug 29, 2012 at 9:49 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> No, what I mean is that your RecordReader should be able to handle a
> case where it may start from middle of a record and hence not be able
> to read any record (i.e. return false or whatever right up front).
>
> On Wed, Aug 29, 2012 at 1:27 PM, Chen He <[EMAIL PROTECTED]> wrote:
> > Hi Harsh
> >
> > Thank you for your reply. Do you mean I need to change the FileSplit to
> > avoid those errors I mentioned happen?
> >
> > Regards!
> >
> > Chen
> >
> > On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> >>
> >> Hi Chen,
> >>
> >> Does your record reader and mapper handle the case where one map split
> >> may not exactly get the whole record? Your case is not very different
> >> from the newlines logic presented here:
> >> http://wiki.apache.org/hadoop/HadoopMapReduce
> >>
> >> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <[EMAIL PROTECTED]> wrote:
> >> > Hi guys
> >> >
> >> > I met a interesting problem when I implement my own custom InputFormat
> >> > which
> >> > extends the FileInputFormat.(I rewrite the RecordReader class but not
> >> > the
> >> > InputSplit class)
> >> >
> >> > My recordreader will take following format as a basic record: (my
> >> > recordreader extends the LineRecordReader. It returns a record if it
> >> > meets
> >> > #Trailer# and contains #Header#. I only have one input file that is
> >> > composed
> >> > of many of following basic record)
> >> >
> >> > #Header#
> >> > .....(many lines, may be 0 lines or 1000 lines, it varies)
> >> > #Trailer#
> >> >
> >> > Everything works fine if above basic input unit in a file is integer
> >> > times
> >> > of mapper. For example, I use 2 mappers and there are two basic
> records
> >> > in
> >> > my input file. Or I use 3 mappers and there are 6 basic units in the
> >> > input
> >> > file.
> >> >
> >> > However, if I use 4 mappers but there are 3 basic units in the input
> >> > file(not integer times). The final output is incorrect. The "Map Input
> >> > Bytes" in the job counter is also less than the input file size. How
> can
> >> > I
> >> > fix it? Do I need to rewrite the inputSplit?
> >> >
> >> > Any reply will be appreciated!
> >> >
> >> > Regards!
> >> >
> >> > Chen
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB