-Re: Possible to include open .avro file in Map/Reduce job?
Terry Healy 2013-01-18, 14:51
In this case I could truncate the logs earlier, but then I have to go
back at some point and recombine the small files. For now, I can live
with moving the files daily.
I was unable to find a way to trap the "Invalid Sync"
(org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid
Since my mapper extends AvroMapper, and map throws exceptions, I don't
know where to trap it. Another person suggested using low-level avro
functions for this. Perhaps I need to write an avro file validator of
some sort to be run before the Map/Reduce job? This seems nasty. But I
had another M/R job failure for this error over night, and even finding
the offending file via the logs is quite a pain.
On 01/17/2013 04:36 PM, Doug Cutting wrote:
> Folks often move files once they're closed into a directory where
> they're processed to avoid issues with partially written data. Maybe
> you could start a new log file every hour rather than every day?
> We could add an ignoreTruncation or ignoreCorruption option to
> DataFileReader that attempts to read files that might be truncated or
> And yes, you can probably just catch those exceptions and exit the map
> at that point.
> On Mon, Jan 14, 2013 at 11:22 AM, Terry Healy <[EMAIL PROTECTED]> wrote:
>> I have a log collection application that writes .avro files within HDFS.
>> Ideally I would like to include the current days (open for append) file
>> as one of the input files for a periodic M/R job.
>> I tried this but the Map job exited in error with the dreaded "Invalid
>> Sync!" IOException. I guess I should have expected this, but is there a
>> reasonable way around it? Can I catch the exception and just exit the
>> map at that point?
>> All suggestions appreciated.