Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Re: How to process only input files containing 100% valid rows


Copy link to this message
-
Re: How to process only input files containing 100% valid rows
Niels Basjes 2013-04-19, 08:21
How about a different approach:
If you use the multiple output option you can process the valid lines in a
normal way and put the invalid lines in a special separate output file.
On Apr 18, 2013 9:36 PM, "Matthias Scherer" <[EMAIL PROTECTED]>
wrote:

> Hi all,****
>
> ** **
>
> In my mapreduce job, I would like to process only whole input files
> containing only valid rows. If one map task processing an input split of a
> file detects an invalid row, the whole file should be “marked” as invalid
> and not processed at all. This input file will then be cleansed by another
> process, and taken again as input to the next run of my mapreduce job.****
>
> ** **
>
> My first idea was to set a counter in the mapper after detecting an
> invalid line with the name of the file as the counter name (derived from
> input split). Then additionally put the input filename to the map output
> value (which is already a MapWritable, so adding the filename is no
> problem). And in the reducer I could filter out any rows belonging to the
> counters written in the mapper.****
>
> ** **
>
> Each job has some thousand input files. So in the worst case there could
> be as many counters written to mark invalid input files. Is this a feasible
> approach? Does the framework guarantee that all counters written in the
> mappers are synchronized (visible) in the reducers? And could this number
> of counters lead to OOME in the jobtracker?****
>
> ** **
>
> Are there better approaches? I could also process the files using a non
> splitable input format. Is there a way to reject the already outputted rows
> of a the map task processing an input split?****
>
> ** **
>
> Thanks,****
>
> Matthias****
>
> ** **
>