Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: How to process only input files containing 100% valid rows


Copy link to this message
-
Re: How to process only input files containing 100% valid rows
With files that small it is much better to write a custom input format
which checks the entire file and only passes records from good files. If
you need Hadoop you are probably processing a large number of these files
and an input format could easily read the entire file and handle it if it
as as short as a few thousand lines
On Thu, Apr 18, 2013 at 12:34 PM, Matthias Scherer <
[EMAIL PROTECTED]> wrote:

> Hi all,****
>
> ** **
>
> In my mapreduce job, I would like to process only whole input files
> containing only valid rows. If one map task processing an input split of a
> file detects an invalid row, the whole file should be “marked” as invalid
> and not processed at all. This input file will then be cleansed by another
> process, and taken again as input to the next run of my mapreduce job.****
>
> ** **
>
> My first idea was to set a counter in the mapper after detecting an
> invalid line with the name of the file as the counter name (derived from
> input split). Then additionally put the input filename to the map output
> value (which is already a MapWritable, so adding the filename is no
> problem). And in the reducer I could filter out any rows belonging to the
> counters written in the mapper.****
>
> ** **
>
> Each job has some thousand input files. So in the worst case there could
> be as many counters written to mark invalid input files. Is this a feasible
> approach? Does the framework guarantee that all counters written in the
> mappers are synchronized (visible) in the reducers? And could this number
> of counters lead to OOME in the jobtracker?****
>
> ** **
>
> Are there better approaches? I could also process the files using a non
> splitable input format. Is there a way to reject the already outputted rows
> of a the map task processing an input split?****
>
> ** **
>
> Thanks,****
>
> Matthias****
>
> ** **
>

--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB