-Re: AW: How to process only input files containing 100% valid rows
As far as I know, there are no guarantees on when counters will be updated during the job. One thing you can do is to write a metadata file along with your parsed events listing what files have errors and should be ignored in the next step of your ETL workflow.
If you really don't want to have "dirty" records mixed in, you can accomplish it using secondary sort. In a nutshell:
- create a composite key using filename and an enum BROKEN = 0, CLEAN = 1
- create a sorting comparator that ensures BROKEN comes before CLEAN
- create a grouping comparator and a partitioner on filename only, to ensure both BROKEN and CLEAN are processed by the same reducer
- if you found a broken line, send it with a BROKEN key
- in the reducer, if you get a BROKEN key, write that filename somewhere so you know you will have to scrub and re-submit it, and ignore both BROKEN and CLEAN records
On 19-04-2013 06:39, Matthias Scherer wote:
I have to add that we have 1-2 Billion of Events per day, split to some thousands of files. So pre-reading each file in the InputFormat should be avoided.
And yes, we could use MultipleOutputs and write bad files to process each input file. But we (our Operations team) think that there is more / better control if we reject whole files containing bad records.