Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop, mail # user - Re: AW: How to process only input files containing 100% valid rows


Copy link to this message
-
Re: AW: How to process only input files containing 100% valid rows
Nitin Pawar 2013-04-19, 10:16
Reject the entire file even if a single record is invalid? There has to be
a eeal serious reason to take this approach
If not in any case to check the file has all valid lines you are opening
the files  and parsing them. Why not then parse + separate incorrect lines
as suggested in previous mails
That way it will give you count of invalid records as well you will not
miss the valid records for small number of invalid records in a file.
On Apr 19, 2013 3:23 PM, "Matthias Scherer" <[EMAIL PROTECTED]>
wrote:

> I have to add that we have 1-2 Billion of Events per day, split to some
> thousands of files. So pre-reading each file in the InputFormat should be
> avoided.****
>
> ** **
>
> And yes, we could use MultipleOutputs and write bad files to process each
> input file. But we (our Operations team) think that there is more / better
> control if we reject whole files containing bad records.****
>
> ** **
>
> Regards****
>
> Matthias****
>